CN110046943B - Optimization method and optimization system for network consumer subdivision - Google Patents

Optimization method and optimization system for network consumer subdivision Download PDF

Info

Publication number
CN110046943B
CN110046943B CN201910398178.9A CN201910398178A CN110046943B CN 110046943 B CN110046943 B CN 110046943B CN 201910398178 A CN201910398178 A CN 201910398178A CN 110046943 B CN110046943 B CN 110046943B
Authority
CN
China
Prior art keywords
user
psychological
word
network
consumer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910398178.9A
Other languages
Chinese (zh)
Other versions
CN110046943A (en
Inventor
王伟军
黄英辉
刘辉
李伟卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN201910398178.9A priority Critical patent/CN110046943B/en
Publication of CN110046943A publication Critical patent/CN110046943A/en
Application granted granted Critical
Publication of CN110046943B publication Critical patent/CN110046943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0605Supply or demand aggregation

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention belongs to the information processing technology of network consumerThe technical field discloses an optimization method and an optimization system for network consumer segmentation, which are characterized in that firstly, psychological lexical evidences are utilized, two main psychological segmentation dictionaries are constructed aiming at the word use behaviors of consumers, and the psychological map features of the consumers are obtained; secondly, selecting different cluster distances by using a clustering method to obtain consumer fine clusters; thirdly, constructing a machine learning prediction model facing the user consumption preference in each cluster, and determining the reliability and effectiveness of the cluster through comparison with a reference method; and finally, selecting an optimal user preference prediction model, manually inspecting the user cluster in the model, and endowing market subdivision labels. Two evaluation indices (RMSE and R) of the invention 2 ) In comparison with unoptimized user segments, the invention can find the network consumer segment group with the optimal effect of remarkably improving the user preference prediction.

Description

Optimization method and optimization system for network consumer subdivision
Technical Field
The invention belongs to the technical field of network consumer behavior information processing, and particularly relates to an optimization method and an optimization system for network consumer subdivision.
Background
Currently, the closest prior art:
the existing market subdivision technology is mainly subdivided based on data such as user demographics, psychological maps, behavior indexes and the like, for example, a pre-subdivision method adopts supervised machine learning such as predefined subdivision labels, a support vector machine, a decision tree, a random forest and the like for classification; after-post subdivision is automatically generated into subdivision clusters based on clustering methods such as K-means and the like, and corresponding subdivision labels are given to the clustering clusters by adopting manual observation, so that subdivision consumer groups and corresponding consumer market subdivision labels are obtained; or a mixed subdivision method, namely, clustering analysis is carried out on the basis of the result of prior subdivision. Regardless of which segment variables and methods are selected, the resulting segments must be operable and useful to support marketing strategy formulation and enforcement. Specifically, there are five criteria to distinguish the success or failure of a subdivision: identifiability (whether segments are identifiable), substantive (size of segments), accessibility (whether it is easy for marketers to conduct a campaign), differentiability (whether there is a differentiation between segments), and operability (whether segments are consistent with enterprise competitiveness).
Generally, in a network scenario such as e-commerce, the problems of the prior art are:
(1) The existing subdivision method has the problems of substantive property, accessibility, operability and the like. The pre-segmentation adopts predefined labels as segmentation targets, most of the labels are derived from past experiences and probably do not accord with real user data, so that the problems that the segmentation cannot be identified, the size difference of the segmentation scale is overlarge, the marketing personnel is not facilitated to carry out activities, the enterprise competitiveness runs counter and the like are caused.
(2) The existing post subdivision method has the problems of identifiability, accessibility, operability and the like. After-the-fact subdivision automatically identifies subdivision clusters in user data by adopting unsupervised methods such as clustering and the like, and the obtained subdivisions are different only in data significance and are not manually inspected, so that the problems that subdivision effects are questioned, marketing activities are not facilitated to be developed, subdivision does not accord with enterprise benefits and the like are caused.
(3) The existing mixed subdivision method also inherits the advantages of the two methods, and simultaneously has the disadvantages of the two methods such as the performance, the accessibility and the operability to different degrees.
Particularly in a network context, automation and intellectualization of marketing activities become mainstream, and the existing segmentation method and system are not combined with the marketing activities focusing on user preference and demand, such as positioning, popularization and delivery of corresponding online enterprise services and products. Therefore, the existing method generally has the prominent problems of accessibility, operability and the like.
The difficulty of solving the technical problems is as follows:
to achieve accurate predictions and dynamic interpretations of consumption demand and preference predictions, network consumer behavior must be fully extracted and mined. However, on the one hand, consumer behavior in a network environment is dynamic and heterogeneous, and it is difficult for existing market segmentation methods to support efficient mining of heterogeneous data. On the other hand, to meet the requirements of market segmentation on substantiality, accessibility, and operability in consumer preference prediction, the segmentation method must be adapted to dynamic consumer preferences. The existing consumption subdivision method lacks an implementation idea, an operation method and a system implementation for the functional requirement.
The significance of solving the technical problems is as follows:
the invention realizes accurate prediction and dynamic explanation of network consumer user segmentation, optimizes network market segmentation functional modules and technical routes, and provides support for electronic marketing decision making such as positioning, popularization and delivery of network services and products and development of intelligent electronic marketing components.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an optimization method and an optimization system for network consumer segmentation.
The invention is realized in such a way that a method for optimizing network consumer segment comprises the following steps:
step one, constructing a psychological subdivision dictionary for the word use behaviors of a user by using psychological lexical evidences, and acquiring the characteristics of a user psychological map;
selecting different cluster distances by using a clustering method to obtain user detailed clusters;
step three, constructing a machine learning prediction model facing user consumption preference in each cluster, and determining the reliability and effectiveness of the cluster through comparison with a reference method;
and step four, selecting an optimal user preference prediction model, manually inspecting the user cluster in the model, and endowing market segmentation labels.
Further, the step one further comprises: and automatically constructing a Schwarz value observation table word bank and a five-personality model word bank by using a natural language processing method, wherein the word bank is used for supporting the acquisition of the user psychological map.
Further, the first step specifically comprises:
1) Automatic acquisition of a psychological map word library:
calculating the similarity between word vectors generated by word embedding by using cosine similarity; order to
Figure BDA0002058852560000031
Where m is the word vector dimension; based on word vectors, the invention measures the semantic similarity of vocabularies by using Cosine distance, and the specific calculation formula is as follows:
Figure BDA0002058852560000032
based on cosine similarity, calculating Top10 vocabularies most similar to seed words in a synonym library by using a natural language word embedding algorithm; setting 0.45 as a threshold value of a similarity value, and traversing embedded words trained by the Internet corpus to obtain 10 seed words most similar to the corresponding psychological map candidate word library; after filtering through a threshold value, adding the expanded words into a candidate word set; repeatedly executing the same process of calculation by using a natural language word embedding algorithm based on the updated candidate word set until no new words are extracted;
calculating the score of the candidate word in each psychological subdivision dimension according to the membership formula of the psychological map dimension of the candidate word, wherein w ext Is a psychologically subdivided expanded word set, w seed1 ,w seed2 ...,w seedp The method is characterized in that each dimension seed word of each psychological map:
SVS_scores(w ext ,w seed1 )=Max{sim(w ext ,w seed2 ),sim(w ext ,w seed3 ),...,sim(w ext ,w seedp )};
2) The method comprises the following steps of automatically identifying a user psychological map based on a word bank:
defining a p-dimensional psychographic map as L = { L = { (L) } 1 ,L 2 ,...,L p The user reviews the unstructured data set as r 1 ,r 2 ,...,r m Total number of user comments is m, where each r i Is { w i1 ,w i2 ,...,w in N is the total number of words in the data; according to the obtained psychological map dictionary, adopting
Figure BDA0002058852560000041
Vocabulary accumulation, L u p Is the score for each dimension of the mental map.
Further, the second step further comprises: setting different cluster distances, and identifying the network user market subdivision by using a DBSCAN density clustering algorithm.
Further, the second step further comprises:
a) Consumer psycho-subdivision cluster acquisition based on DBSCAN:
performing DBSCAN clustering on the scores of the network user psychological map; in the clustering cluster, according to the subdivision group where the user is located, predicting consumption preference;
b) Consumer preference prediction integrating psychological segmentation and deep neural networks:
capturing a nonlinear user psychological subdivision-product preference relation by using a deep neural network, and performing higher-level data representation by using complex abstract coding; if there are M users and N products, R represents the training data set matrix,
Figure BDA0002058852560000042
representing a test data set matrix; let r be ui To the consumer u's preference for product i,
Figure BDA0002058852560000043
scoring the predicted preference;
on the basis, a DNN system architecture is constructed, and an input layer, a plurality of hidden layers and an output layer formed by nodes with specific category numbers are arranged; by minimizing
Figure BDA0002058852560000044
Wherein
Figure BDA0002058852560000045
g is a linear combination of node values in the hidden layers of the network, h is an activation function, the network is trained on the basis of the evaluation index minimum mean square error, and weights among the hidden layers are dynamically updated by using gradient descent and back propagation.
Further, the third step further comprises: and verifying the accuracy of the obtained user sub-groups by using the purchasing preference prediction of the user based on the deep neural network.
Further, in the fourth step, an optimal user preference prediction model is selected, and the root mean square error is defined as follows;
Figure BDA0002058852560000046
Figure BDA0002058852560000047
wherein
Figure BDA0002058852560000048
As predicted score, y u,i Is the true score of the character,
Figure BDA0002058852560000049
average score for user u, size of n test data set.
Another object of the present invention is to provide an optimization control system for network consumer segment, which implements the optimization method for network consumer segment.
Another object of the present invention is to provide an optimization terminal for network consumer segment implementing the optimization method for network consumer segment.
In summary, the advantages and positive effects of the invention are:
the invention is optimized for two typical user psychographic map subdivision methods (SVS and BFF). Some experimental results are shown in table 1, and there are specific sub-groups (ClusterID column), and using specific psychographic subdivision variables, the best results are obtained by corresponding clustering and preference regression algorithms (e.g. in the Beauty commodity category, the smallest preference prediction error 0.6923 is obtained by the sub-group of "Cluster _2" under the support of BBF subdivision variables, DBSCAN clustering and DNN components).
Statistical analysis of optimization method results table 1Ming and B two evaluation indexes (RMSE and R) 2 ) Next, the optimization methods (BFF and SVS) have higher user preference interpretability (R) than the control group's user subdivision (e.g., random of FIG. 3, FIG. 4, table 2) 2 ) And a smaller preference prediction error (RMSE).
Table 1 optimization method for network consumer segmentation provided by the embodiment of the present invention effect table (part)
Figure BDA0002058852560000051
Figure BDA0002058852560000061
Table 2 effect comparison table (difference of different subdivision variables in preference prediction) of optimization method for network consumer subdivision provided in the embodiment of the present invention
Figure BDA0002058852560000062
In addition, the performance of the core component DNN of the optimization method and system is analyzed and compared by the invention. It can be seen that DNN is significantly lower in the least mean square error on preference prediction (RMSE value of NN algorithm around 0.82 as shown in fig. 5) than support vector regression SVR, random forest RF and linear regression LR methods (RMSE of these three methods is greater than 0.95 as shown in fig. 5), in the interpretability of consumer preference (R of DNN algorithm as shown in fig. 6) 2 Around a value of 0.14) is significantly higher than the R of other methods (other algorithms as shown in table 3 and fig. 6) 2 A value less than 0.05); the DNN components in Table 3 have the smallest prediction error (RMSE) and the greatest predictive interpretation power (R) relative to the other reference components (Linear regression LR, support vector regression SVR, and random forest RF) 2 )。
Table 3 core component effect comparison table in the embodiment of the present invention
Figure BDA0002058852560000063
Figure BDA0002058852560000071
In general, the present invention effectively combines user segmentation with electronic marketing activities such as user preference prediction in a network scenario. And the user preference prediction is taken as a subdivided evaluation standard, and the user preference prediction is further divided into specific means of preference prediction. On one hand, the method can provide a basis for manual checksum utilization of subsequent sub-groups by identifying the sub-groups with optimal preference prediction effect, and provide a basis for optimization of substantive and accessibility of the sub-groups; on the other hand, the segmentation result is consistent with the competitiveness of an enterprise and the purpose of an electronic marketing component, and the accessibility and operability of segmentation are improved.
Drawings
Fig. 1 is a flowchart of a method for optimizing network consumer segments according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of an optimization method for network consumer segment according to an embodiment of the present invention.
FIG. 3 is a comparison graph (RMSE) of consumer segment optimization results provided by embodiments of the present invention.
FIG. 4 is a comparison graph of consumer segment optimization (R) provided by an embodiment of the present invention 2 )。
Fig. 5 is a comparison graph (RMSE) of the performance of a preference prediction algorithm provided by an embodiment of the present invention.
Fig. 6 is a comparison graph (R2) of the performance of the preference prediction algorithm provided by the embodiment of the present invention.
Fig. 7 is a DDNN training and testing diagram provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The existing subdivision technology depends on manual labels to determine the effectiveness and reliability of the subdivision method. But the agility and accuracy of the method are insufficient, so that the preference behavior of the consumer in the network environment cannot be accurately predicted and dynamically explained.
To solve the above problems, the present invention will be described in detail with reference to the accompanying drawings.
The optimization method for network consumer segmentation provided by the embodiment of the invention comprises the following steps: firstly, two main psychological subdivision dictionaries are constructed aiming at word use behaviors of consumers by utilizing psychological lexical evidences, and psychological map features of the consumers are obtained.
And secondly, selecting different clustering distances by using a clustering method to obtain consumer fine clusters.
And thirdly, constructing a machine learning prediction model facing the user consumption preference in each cluster, and determining the reliability and effectiveness of the cluster through comparison with a reference method.
And finally, selecting an optimal user preference prediction model, manually inspecting the user cluster in the model, and endowing market subdivision labels.
As shown in fig. 1, the method for optimizing network consumer segments provided in the embodiment of the present invention specifically includes:
1) Acquisition of consumer psychological maps:
and (3) automatically constructing two word banks of a Schwarz value observation table and a 'five-personality' model by using a natural language processing method so as to support the acquisition of a psychological map of the consumer. Firstly, two seed word sets are obtained according to the Schwarz value view and the using behavior characteristics of the related vocabularies of the five personality. And secondly, expanding the seed words by utilizing semantic knowledge in synonym forest to obtain a candidate word set. And thirdly, extracting the associated vocabulary from the large-scale Internet corpus by using a word embedding method, and further expanding the candidate word set. Finally, the vocabulary with larger deviation is eliminated through manual inspection, so that the reliability of the lexicon is verified.
2) Consumer psychological segmentation optimization based on user preference prediction:
the invention sets different clustering distances, utilizes a DBSCAN density clustering algorithm to identify the market segmentation of network consumers, and utilizes the purchasing preference prediction of consumers based on the deep neural network to verify the accuracy of the obtained consumer segmentation group.
3) The user preference prediction evaluation method based on the psychological subdivision comprises the following steps:
the Root Mean Square Error (RMSE) is defined as follows. Wherein
Figure BDA0002058852560000081
As predicted score, y u,i Is the true score of the character,
Figure BDA0002058852560000082
average score of consumer u, y test Test data set, n test the size of the data set.
Figure BDA0002058852560000083
Figure BDA0002058852560000084
The step 1) specifically comprises the following steps:
1.1 Automatic acquisition of a psychographic thesaurus:
in the invention, the similarity between word vectors generated by embedding words is calculated by using cosine similarity. Order to
Figure BDA0002058852560000091
Where m is the word vector dimension. Based on word vectors, the invention measures the semantic similarity of the vocabularies by using Cosine distance, and the specific calculation formula is as follows:
Figure BDA0002058852560000092
based on cosine similarity, top10 vocabularies most similar to the seed words in the synonym library are calculated by using a natural language word embedding algorithm. The method sets 0.45 as a threshold value of the similarity value, and traverses the embedded words trained by the Internet corpus to obtain 10 seed words most similar to the corresponding psychological map candidate word library. And after threshold filtering, adding the expanded words into the candidate word set. Based on the updated set of candidate words, the same process above is repeatedly performed until no new words are extracted.
According to the membership formula of the psychological map dimensions of the candidate words, the invention calculates the score of the candidate words in each psychological subdivision dimension, wherein w ext Is a psychologically subdivided expanded set of words, w seed1 ,w seed2 …,w seedp The method is characterized in that each dimension seed word of each psychological map:
SVS_scores(w ext ,w seed1 )=Max(sim(w ext ,w seed2 ),sim(w ext ,w seed3 ),...,sim(w ext ,w seedp )}
1.2 Thesaurus-based automatic identification of consumer psychographic maps:
the invention makes p dimension psychology map definition is L = { L = { L 1 ,L 2 ,...,L p The unstructured data sets such as the consumer reviews are r 1 ,r 2 ,...,r m H, the total number of user comments is m, where each r i Is { w i1 ,w i2 ,...,w in N is the total number of words in the data. According to the psychological map dictionary obtained in 1.1, vocabulary accumulation is adopted, i.e.
Figure BDA0002058852560000093
L u p Is the score for each dimension of the mental map.
In the step 2), the method specifically comprises the following steps:
2.1 DBSCAN-based consumer psycho-segmentation cluster acquisition:
unlike K-means, DBSCAN density clustering does not require specifying the number of clusters in the prior data, and can find clusters of any shape. DBSCAN clustering is carried out on the psychological map score of the network consumer; in the clustering, the invention predicts the consumption preference according to the fine group of the consumers.
2.2 Consumer preference prediction integrating psychological segmentation and deep neural networks:
deep Neural Networks (DDNN) are an advanced, rapidly developing artificial intelligence technique that has significant advantages over traditional intelligent algorithms. The method can effectively capture the nonlinear consumer psychological subdivision-product preference relation and can use complex abstract coding to carry out higher-level data representation. If there are M consumers and N products, R represents the training data set matrix,
Figure BDA0002058852560000101
representing a test data set matrix. Let r be ui To the consumer u's preference for product i,
Figure BDA0002058852560000102
is a predicted preference score. And on the basis, constructing a DDNN architecture, and setting an input layer, a plurality of hidden layers and an output layer formed by nodes with specific category number. By minimizing
Figure BDA0002058852560000103
Wherein
Figure BDA0002058852560000104
g is a linear combination of node values in hidden layers of the network, h is an activation function (typically a Sigmoid or hyperbolic tangent function), the network is trained based on evaluation index minimum mean square error (mse), and the weights between multiple hidden layers are dynamically updated using gradient descent and back propagation.
In the embodiment of the invention, the overall framework principle of the optimization method of the network consumer subdivision is shown in fig. 2.
The present invention will be further described with reference to effects.
The invention is optimized for two typical user psychographic map subdivision methods (SVS and BFF). The results show two evaluation indices (RMSE and R) 2 ) In comparison with the unoptimized user segment (e.g., rond in FIGS. 3 and 4), the present invention can find a significant improvement in the prediction of user preferenceThe best effort network consumer detail groups (e.g., BFFs and SVSs in fig. 3 and 4).
The invention analyzes and compares the performance of the core component DDNN of the optimization method and the optimization system. It can be seen that DDNN is significantly lower in the minimum mean square error on preference prediction (RMSE value around 0.82 for DNN algorithm shown in fig. 5) than support vector regression SVR, random forest RF, and linear regression LR methods (RMSE is greater than 0.95 for these three methods shown in fig. 5), in the interpretability of consumer preference (R for DNN algorithm shown in fig. 6) 2 Around a value of 0.14) is significantly higher than the R of other methods (such as the other algorithms shown in fig. 6) 2 A value of less than 0.05).
The invention is further described below in connection with the experimental procedures.
The invention constructs positive correlation and negative correlation electronic commerce psychographic dictionaries, namely SVS-pos, SVS-neg and BFF-pos, BFF-neg dictionaries, and carries out electronic commerce consumer segmentation based on the identified SVS and BFF scores and a DBSCAN clustering algorithm. The invention further provides a DDNN method for constructing a regression model for scoring the subdivided consumers. The invention then proceeds to experiment with amazon online shopping data.
1) The data set describes:
amazon is one of the largest e-commerce platforms in the world, and accumulates massive amounts of user purchasing behavior data. The amazon review dataset published by McAuley et al contains 1.428 hundred million product reviews and metadata from amazon.com with a data collection period of 5 months 1996 to 7 months 2014. The present invention selects 5 review datasets from 5 product categories based on a "K-core" value equal to "10" to ensure that there are at least 10 reviews per pending user or item. The present invention considers that 10 reviews (average review length of 189 words) are sufficient for consumer/product psychographic reasoning compared to 25 tweets. Table 4 shows a detailed data set description.
Experimental dataset description is shown in Table 4.
Figure BDA0002058852560000111
The following is detailed information of the review sample:
{
"reviewerID":"A2SUAM1J3GDNN3B",
"asin":"0000013714",
"reviewerName":"J.McDonald",
"helpful":[2,3],
"reviewText":"I bought this for my husband who plays the piano.He is having a wonderful time playing these old hymns.The music is at times hard to read because we think the book was published for singing from more than playing from.Great purchase though!",
"overall":5.0,
"summary":"Heavenly Highway Hymns",
"unixReviewTime":1252800000,
"reviewTime":"09 13,2009"
}
2) The experimental process comprises the following steps:
the data processing process is divided into the following steps.
First, the present invention retains the "reviewerID", "asin", "override", "reviewText", and "summary" in the above seven data sets, and merges the "reviewText" and "summary" as vocabulary use behaviors to recognize online psychological diagrams.
Secondly, the method uses Python, a machine learning tool Sciket-leann and a natural language processing tool (NLTK) to carry out text preprocessing, including normalization, identification, deletion of stop words and word drying. Normalization is the process of converting a list of words into a more uniform sequence. The marking is to cut a given character sequence and a defined document unit into pieces, i.e. marks. Some of the words in the reviews and their common use are of little value in helping to select text that meets the needs of the present invention and should be excluded from the vocabulary. These words are called stop words. The stem is intended to generalize the various shapes and derivatives of words into a common basic form. The invention intervenes in word forms in dictionaries and comments through Lancaster Stemmer of NLTK. According to the invention, stop words are deleted through an English stop word list in NLTK, the lowercase form of English words is obtained through an English lowercase conversion method in Python, and Z-score normalization is carried out through a proportion method in Sciket-leern. Based on the SVS-pos, SVS-neg, BFF-pos and BFF-neg dictionaries and all the data preprocessing steps described above, the present invention calculates a psychographic score by matching the vocabulary in the reviews with the vocabulary in these dictionaries. Thus, the present invention obtains the SVS and BFF scores for each amazon consumer and product.
Third, for each product category, the present invention performs a DBSCAN clustering algorithm in the SVS or BFF scores of the consumer using Scikit-leann to obtain the consumer's positively and negatively correlated psychographic scores and corresponding psychological segmentation labels. The present invention then constructs a score prediction dataset of psychographic scores (independent variables) combined with the scores given to the product by the consumer (dependent variables). The present invention also constructs a feature set for each product category, which contains random values between 0 and 1 as control groups. For each data set, the parameters of the DDNN were optimized by a gradient descent algorithm. The optimal number of cycles of the neural network algorithm (one pass through the complete training set) is determined by the performance of the validation set. For the support vector machine algorithm, the present invention uses a validation set to optimize the cost parameter C. The present invention develops DNN and Baseline using the Keras interface of Linear Regression (LR), SVM (radial basis function kernel), random forest and Google Tensorflo software in the Scikit-learn tool. 5-fold cross validation was used to select the training and testing data sets for each pass, avoiding overfitting of linear regression, SVM, RF and DDNN. Fig. 7 shows an example of the evolution of epochs from RMSE and DDNN. In fig. 7, the present invention can note that the best epoch is around 15.
Fourth, feature importance of different psychographic sub-dimensions in understanding consumer online preferences is studied using feature ranking and recursive feature elimination methods. A recursive feature elimination (RFE-SVM) method based on a support vector machine is a commonly used feature selection and subsequent regression task technique, especially in consumer preference prediction. Each iteration trains a linear SVM, and the next step is to consider the deletion of one or more "bad" features. The quality of the features is determined by the absolute values of the respective weights used in the SVM. The features that remain after many iterations are considered to be the most useful features for analyzing data. By incorporating RFE-SVM into segment consumer score prediction, the present invention can explore from the sub-dimensions of SVS and BFF whether these sub-dimensions are effective in predicting and interpreting preferences.
Finally, the present invention performed an online consumer preference prediction experiment that contained 5 product categories 4 prediction algorithms (LR, SVM, RM, DNN) 3 psychographic variables (random, SVS and BFF) 2 clustering methods (whether consumers were clustered based on DBSCAN).
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method for optimizing a network consumer segment, the method comprising:
step one, constructing a psychological subdivision dictionary for the word use behaviors of a user by using psychological lexical evidences, and acquiring the characteristics of a user psychological map; the method specifically comprises the following steps:
1) Automatic acquisition of a psychological map word library:
calculating the similarity between word vectors generated by word embedding by using cosine similarity; order to
Figure FDA0003912683110000011
Where m is the word vector dimension; based on the word vectors, the Cosine distance is utilized to measure the semantic similarity of the vocabularies, and the specific calculation formula is as follows:
Figure FDA0003912683110000012
based on cosine similarity, calculating Top10 vocabularies most similar to seed words in a synonym library by using a natural language word embedding algorithm; setting 0.45 as a threshold value of a similarity value, and traversing embedded words trained by the Internet corpus to obtain 10 seed words most similar to the corresponding psychological map candidate word library; after filtering through a threshold value, adding the expanded words into a candidate word set; based on the updated candidate word set, repeatedly executing the same process of calculation by using a natural language word embedding algorithm until no new word is extracted;
calculating the score of the candidate word in each psychological subdivision dimension according to the membership formula of the psychological map dimension of the candidate word, wherein w ext Is a psychologically subdivided expanded set of words, w seed1 ,w seed2 …,w seedp Is a seed word of each dimension of each psychological map:
SVS_scores(w ext ,w seed1 )=Max{sim(w ext ,w seed2 ),sim(w ext ,w seed3 ),...,sim(w ext ,w seedp )};
2) The method comprises the following steps of automatically identifying a user psychological map based on a word bank:
psychology of p dimension map definition is L = { L = { (L) 1 ,L 2 ,...,L p The user reviews the unstructured data set as r 1 ,r 2 ,...,r m Total number of user comments is m, where each r i Is { w i1 ,w i2 ,...,w in N is the total number of words in the data; according to the obtained psychological map dictionary, adopting
Figure FDA0003912683110000013
Vocabulary accumulation, L u p Is the score for each dimension of the mental map;
selecting different cluster distances by using a clustering method to obtain user detailed clusters;
step three, constructing a machine learning prediction model facing user consumption preference in each cluster, and determining the reliability and effectiveness of the cluster through comparison with a reference method;
and step four, selecting an optimal user preference prediction model, manually inspecting the user cluster in the model, and endowing market subdivision labels.
2. The method for optimizing network consumer segments of claim 1, wherein step one further comprises: and automatically constructing a Schwarz value observation table word library and a five-personality model word library by using a natural language processing method, wherein the word libraries are used for supporting the acquisition of the psychological map of the consumer.
3. The method of optimizing network consumer segments of claim 1,
the second step further comprises: setting different cluster distances, and identifying the network user market segmentation by using a DBSCAN density clustering algorithm.
4. The method of optimizing network consumer segments of claim 1,
the second step further comprises:
a) Obtaining user psychology fine clustering based on DBSCAN:
performing DBSCAN clustering on the scores of the psychological map of the network consumers; in the cluster, according to the subdivision group where the user is located, consumption preference is predicted;
b) Integrating psychological segmentation with user preference prediction for deep neural networks:
capturing a nonlinear user psychological subdivision-product preference relation by using a deep neural network, and performing higher-level data representation by using complex abstract coding; if there are M users and N products, R represents the training data set matrix,
Figure FDA0003912683110000021
representing a test dataset matrix; let r be ui For the preference of user u for product i,
Figure FDA0003912683110000022
scoring the predicted preference;
on the basis, a DNN architecture is constructed, and one input layer and a plurality of input layers are arrangedAn output layer composed of hidden layers and nodes with specific category number; by minimizing
Figure FDA0003912683110000023
Wherein
Figure FDA0003912683110000024
g is a linear combination of node values in the hidden layer of the network, h is an activation function, the network is trained based on the evaluation index least mean square error, and weights among multiple hidden layers are dynamically updated by using gradient descent and back propagation.
5. The method for optimizing network consumer segments of claim 1, wherein step three further comprises: and verifying the accuracy of the obtained user sub-groups in the purchase preference prediction of the user based on the deep neural network.
6. The method for optimizing network consumer segments according to claim 1, wherein in the step four, an optimal user preference prediction model is selected, and a root mean square error is defined as follows;
Figure FDA0003912683110000031
Figure FDA0003912683110000032
wherein
Figure FDA0003912683110000033
As predicted score, y u,i Is the true score of the character,
Figure FDA0003912683110000034
the average score for user u, and n is the size of the test data set.
7. An optimization control system for network consumer segment implementing the optimization method for network consumer segment of claim 1.
8. An optimization terminal for network consumer segment implementing the method for optimizing network consumer segment of claim 1.
CN201910398178.9A 2019-05-14 2019-05-14 Optimization method and optimization system for network consumer subdivision Active CN110046943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910398178.9A CN110046943B (en) 2019-05-14 2019-05-14 Optimization method and optimization system for network consumer subdivision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910398178.9A CN110046943B (en) 2019-05-14 2019-05-14 Optimization method and optimization system for network consumer subdivision

Publications (2)

Publication Number Publication Date
CN110046943A CN110046943A (en) 2019-07-23
CN110046943B true CN110046943B (en) 2023-01-03

Family

ID=67281967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910398178.9A Active CN110046943B (en) 2019-05-14 2019-05-14 Optimization method and optimization system for network consumer subdivision

Country Status (1)

Country Link
CN (1) CN110046943B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941953B (en) * 2019-11-26 2023-08-01 华中师范大学 Automatic identification method and system for network false comments considering interpretability
CN111163057B (en) * 2019-12-09 2021-04-02 中国科学院信息工程研究所 User identification system and method based on heterogeneous information network embedding algorithm
CN111626822A (en) * 2020-05-26 2020-09-04 山东能源数智云科技有限公司 Bulk commodity consultation transaction chain based on internet and big data
CN112017062A (en) * 2020-07-15 2020-12-01 北京淇瑀信息科技有限公司 Resource limit distribution method and device based on guest group subdivision and electronic equipment
CN113781128A (en) * 2021-10-15 2021-12-10 北京明略软件系统有限公司 High-potential consumer identification method, system, electronic device, and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902566B (en) * 2012-12-26 2018-04-24 中国科学院心理研究所 A kind of personality Forecasting Methodology based on microblog users behavior
US20170103402A1 (en) * 2015-10-13 2017-04-13 The Governing Council Of The University Of Toronto Systems and methods for online analysis of stakeholders
CN105760502A (en) * 2016-02-23 2016-07-13 常州普适信息科技有限公司 Commercial quality emotional dictionary construction system based on big data text mining

Also Published As

Publication number Publication date
CN110046943A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
Dhal et al. A comprehensive survey on feature selection in the various fields of machine learning
CN110046943B (en) Optimization method and optimization system for network consumer subdivision
US10089581B2 (en) Data driven classification and data quality checking system
CN106447066A (en) Big data feature extraction method and device
Chang et al. Research on detection methods based on Doc2vec abnormal comments
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
US20170004414A1 (en) Data driven classification and data quality checking method
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Gamal et al. Hybrid Algorithm Based on Chicken Swarm Optimization and Genetic Algorithm for Text Summarization.
Shreda et al. Identifying non-functional requirements from unconstrained documents using natural language processing and machine learning approaches
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
Sandhu et al. Enhanced Text Mining Approach for Better Ranking System of Customer Reviews
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN114942974A (en) E-commerce platform commodity user evaluation emotional tendency classification method
Prabhakar et al. A framework for text classification using evolutionary contiguous convolutional neural network and swarm based deep neural network
Kim et al. Opinion mining-based term extraction sentiment classification modeling
Sheng et al. A paper quality and comment consistency detection model based on feature dimensionality reduction
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
US20220114490A1 (en) Methods and systems for processing unstructured and unlabelled data
Preetham et al. Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches
Kumar et al. Sentiment mining approaches for big data classification and clustering
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
Reddy Particle Swarm Optimized Neural Network for Predicting Customer Behaviour in Digital Marketing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant