CN112070543B

CN112070543B - Method for detecting comment quality in E-commerce website

Info

Publication number: CN112070543B
Application number: CN202010944581.XA
Authority: CN
Inventors: 刘嘉辉; 李喆
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2023-04-07
Anticipated expiration: 2040-09-10
Also published as: CN112070543A

Abstract

The invention provides a comment evaluation model and a fuzzy clustering method based on individual and group of commentators, which are used for detecting the quality of comments in an e-commerce website and comprise the following steps: extracting relevant characteristics of the commentators and the merchants, normalizing unknown target values in a characteristic set in a limited range to be 0-1 by adopting a series convergence model, constructing characteristic vectors and training set data of the target value set, establishing a truth degree logistic regression model of the commentators and the merchants according to the characteristics and the training set, classifying the commentators into individuals and groups according to a criterion, respectively constructing individual and group comment evaluation models, iterating to obtain true classes and false class membership degrees of the commentaries by utilizing a fuzzy C-means clustering algorithm, comparing the membership degrees, and detecting the comment quality. The invention establishes a more perfect appraisal model based on the commentator from the individual and group angles, improves the authenticity and the intuition of the comment quality detection result by the fuzzy clustering analysis method, and meets the requirement of the E-commerce comment quality detection.

Description

Method for detecting comment quality in E-commerce website

Technical Field

The invention relates to a comment evaluation model and a fuzzy clustering method based on individual and group comments of commentators, in particular to a method for detecting comment quality of an electronic commerce website, and belongs to the field of computer technology application.

Background

With the development and popularization of internet technology, the era mainly of information and data gradually goes into life of common people, and a mode of shopping through electronic transaction on the internet is widely favored. The electronic commerce website is used as a main channel of online shopping, a large amount of commodity comment information of a purchasing user is accumulated, the comment information is used as one of expression forms of the network public praise, and the quality of the comment information has a self-evident effect on consumption decision of the user. However, as the e-commerce website reviews accumulate and commercial interest drives, a large number of meaningless and even false reviews appear in the website. How to adopt the technical means to detect the comment quality of the electronic commerce website, the reputations of the online shopping platform are improved by removing the counterfeits and the truths, and the method becomes one of the tasks to be solved urgently at present.

The purpose of detecting e-commerce website reviews is to identify false reviews therein. The false comment is a poor motivation for some traders to seek personal interest, and may be used to compile unrealistic consumption experiences in the comment, to blow or defame the quality of an evaluation object, or the like, and to mislead the consumption behavior of the trader. Examples of cases of false comments in an e-commerce website are as follows:

case 1: at a certain electronic commerce platform, a user A has purchased some goods and published a plurality of real comments, and at the moment, the user A temporarily accepts some entrusts of merchants to carry out false purchase and partially presents false comments. User A is referred to herein as an individual false reviewer, with most of the reviews of user A being true and only a small percentage of false reviews being present. The false comments made by the users have the characteristics of temporality and concealment.

Case 2: at an e-commerce platform, a merchant B spends money to hire a group of 'water force' users C to perform fake evaluation on a target commodity, and the 'water force' users generate a certain amount of fake comments within a specified time period. The "water force" user C is referred to herein as the cohort false comment, and most or even all of the comments of the "water force" user C are false comments. The false comments published by the users have the characteristics of promptness and exposure.

Currently, for the detection of false comments, research can be started from two types of detection objects, namely comment text detection and comment person detection. The detection method based on the comment text mostly relates to vocabulary sentence patterns, grammar emotions and the like, the method has good pertinence to comment contents, but complex and changeable Chinese vocabulary collocation, rich grammar structure and semantic association relation limit the popularization of the method, the subjectivity of the method is reflected, the actual detection effect is not good, and therefore the detection method based on the comment text is particularly important. According to the above cases, the commentators are divided into individuals and groups according to different generation modes of the false comments, so that the characteristics of the comment evaluation can be effectively extracted, and the evaluation model can be reasonably established.

Extracting relevant features of the comments, introducing a mutual rating model between the commentators and the merchants by analyzing an evaluation mechanism of the E-commerce comments, and taking the calculated commentator rating and the merchant rating as the first type of features.

In the feature normalization processing, the discussion and screening of the distribution values in the feature set need to satisfy both the requirements of statistical analysis and computer processing, so that the values in the feature set need to be normalized. Because the limited feature set has an upper limit and a lower limit and meets the basic features of series planning, the convergence property of the function term series is adopted to normalize the numerical value in the feature value range interval to 0-1 interval, and the data meeting the requirements of the given target value can be conveniently mined from the given data set.

The method for calculating the truth of the user and the merchant according to the characteristics is characterized in that logistic regression is used as a generalized linear model, nonlinear factors are introduced through a sigmoid function on the basis of the linear regression, the method is commonly used for solving the problem of binary classification, when specific constraint conditions are met, the probability of occurrence of an event can be expressed, and the method is widely used for distinguishing and predicting. Therefore, a logistic regression equation can be established through the relevant characteristics, and the user and merchant trueness can be calculated.

A common clustering analysis is used as a multivariate statistical method to objectively classify types and quantitatively determine the affinity and sparseness relationship among data. The truth of the user and the business is used as a classification standard, the comment quality is divided, and attribute boundaries which are not strictly clear are not defined, so that the unreasonable assumption is made that the category of the comment belongs to the mandatory degree. Therefore, a fuzzy dynamic clustering analysis method is adopted, and the forced constraint relation is relaxed in the form of class membership, namely, the class is closer to which class, and the membership degree is higher. The trueness, the interpretability and the intuition of the quality detection effect are better.

And the fuzzy C-means clustering algorithm obtains the membership degree of each data to the class center by optimizing the objective function, so as to judge the membership degree of each data corresponding to the class. The fuzzy C-means clustering algorithm is widely applied at present as a class of fuzzy dynamic clustering algorithms with perfect theory.

The basic idea of the method for detecting the comment quality in the E-commerce website is as follows: firstly, calculating a first class of characteristics according to a mutual scoring model of a commentator and a merchant, extracting other class related characteristics of the commentator and the merchant, normalizing a characteristic set of a limited range to a 0-1 interval by adopting a series normalization convergence model, establishing a logistic regression model according to the characteristics, and expressing the truth of the commentator (the merchant); secondly, dividing the commentators into individuals and groups according to the criterion that the commentators of more than or equal to three merchants are in one group together with the common commentator; then, an individual critic evaluation model is constructed: sampling merchants from the perspective of individual commentators according to the price intervals of purchased commodities, calculating the truth of the individual commentators and the corresponding merchants by adopting a logistic regression model, constructing a sample characteristic evaluation matrix, and iterating to obtain an optimal class membership function of the commentary by utilizing a fuzzy C-means clustering algorithm; constructing a group critic evaluation model: sampling group critics from the perspective of a merchant according to a sold product review time interval, calculating the truth of the merchant and the truth of the corresponding group critics by adopting a logistic regression model, constructing a sample characteristic evaluation matrix, and iterating to obtain an optimal class membership function of reviews by utilizing a fuzzy C-means clustering algorithm; and finally, detecting whether the comment quality is good or not through the degree of membership of the individual comment category and the group comment category.

Disclosure of Invention

Technical problem to be solved

In order to detect the comment quality of the e-commerce website and identify the false comments in the e-commerce website, the comment-based evaluation model is established by combining a mutual scoring method, a series convergence method, a logistic regression method and a fuzzy clustering method for comment detection of comment features. Firstly, a mutual scoring model is established, scores of a critic and a merchant are respectively calculated to serve as first-class characteristics, other relevant characteristics of the critic and the merchant are extracted, further quantization and standardization are carried out, and a dimensionless numerical value which is convenient to analyze and calculate is obtained. Considering that the quantized feature set needs to be normalized to an interval of 0-1, and different feature sets have different value ranges, a series normalization statistical analysis convergence model is adopted, and the feature sets in different ranges are normalized by series models with different convergence rates. Secondly, a maximum likelihood estimation method is adopted, a group of characteristic weights are obtained according to training set data, and the truth of the to-be-detected commentator and the merchant is calculated by substituting a logistic regression equation. Then, considering different generation modes of the false comments of the website, a classification criterion is provided to divide the commentators into individual commentators and group commentators, and two comment quality evaluation models are established. Further, constructing an individual critic evaluation model: calculating the truth of the individual commentator, dividing the purchased commodities into a commodity number sequence according to price intervals, randomly sampling the commodities in the sequence by Poisson distribution with a parameter of lambda, and calculating the truth mean value of the corresponding merchant to obtain the truth of the individual commentator and the corresponding merchant; constructing a group critic evaluation model: calculating the truth of the merchant, dividing the received comments by time intervals to obtain a group comment number sequence, randomly sampling the comments in the sequence by Poisson distribution with the parameter as mu, and calculating the average value of the truth of the corresponding group comment persons to obtain the truth of the merchant and the corresponding group comment persons. Finally, selecting the trueness of the individual commentator and the trueness of the corresponding merchant, and combining the two parameters into an individual comment quality characteristic vector; and selecting the authenticity of the merchant and the authenticity of the corresponding group comment person, and combining the two parameters into a group comment quality characteristic vector. And establishing individual and group characteristic evaluation matrixes, iterating to obtain an optimal class center matrix and a class membership matrix by adopting a fuzzy C-means clustering algorithm respectively, distinguishing the quality degree of the comments according to the class membership degree, and identifying false comments, so that the comment quality of the E-commerce website can be detected.

(II) technical scheme

In order to realize the detection of the comment quality of the e-commerce website and identify false comments in the e-commerce website, the invention provides a method for detecting the comment quality in the e-commerce website, which comprises the following steps:

(1) Calculating a first class of characteristics according to a mutual grading model of the commentator and the merchant, extracting other relevant characteristics of the commentator and the merchant, normalizing the characteristic set to a (0, 1) interval by adopting a series normalization statistical analysis convergence model, establishing a target characteristic vector, and marking out a training data set.

(2) And according to the marked training set data, establishing a truth logistic regression model of the commentator and the merchant according to the characteristics.

(3) And dividing the commentators into individual commentators and group commentators according to a classification criterion, and establishing a corresponding comment quality evaluation model.

(4) Individual evaluation model: sampling merchants from the perspective of individual commentators according to the price interval of purchased commodities, and calculating the truths of the individual commentators and the corresponding merchants by using a logistic regression equation. Group evaluation model: and sampling the group commentators from the perspective of the merchant according to the received comment time interval, and calculating the truth of the merchant and the group commentators corresponding to the category by using a logistic regression equation.

(5) Establishing truth characteristic evaluation matrixes of individuals and groups, and iterating to obtain final truth class membership and false class membership of the comments by adopting a fuzzy C-means clustering algorithm respectively, so as to achieve the purpose of detecting the quality of the comments and identifying the false comments in the comments.

A method for detecting comment quality in an E-commerce website comprises the following steps:

the scoring mechanism of the commenter by the merchant is as follows: and defining calculation rules reflecting three behaviors of registration of a commentator (remote registration or frequent registration), browsing (browsing similar commodities before purchase) and commenting (evaluating commodities after purchase).

In each transaction, the merchant will comment on the person's three types of behavior described above, with the tag 1 present and the tag 0 absent, with 8 possibilities for combination. 001, 101, 111 are defined as suspicious transactions and the rest are normal transactions.

Defining X marks, calculating a merchant mark as a first class of characteristics according to the probability of normal transaction in n transactions, extracting other relevant characteristics of the critics, and constructing the characteristic vector of the critics.

Specifically, the commentator feature is represented as u = (1, u feature_1, u feature_2, \8230; u feature _ k), parameters in the vector represent characteristic values of the quantized commentators, including merchant scores, registration time, purchase rates, comment quantity and the like;

the scoring mechanism of the commenter for the merchant: and defining calculation rules reflecting three attributes of merchant goods (description matching), service (service attitude) and logistics (logistics situation).

In each transaction, 8 possibilities are available for the commentator to combine the above three types of attributes of the merchant, the excellent label 1, the poor label 0 and the like. 000, 001, 010, 100 are defined as bad transactions, and the rest are good transactions.

Defining X marks, calculating the marks of the critics according to the probability of high-quality transactions in n transactions as a first class of characteristics, extracting other relevant characteristics of merchants and constructing merchant characteristic vectors.

Specifically, the merchant features are denoted as m = (1, m feature 2, \8230; m feature k), the parameters in the vector represent the quantized merchant feature values, including "critic rating", "registration time", "rating ratio", "sales quantity", etc.;

discussion and screening of distribution values in feature sets of critics and merchants are reference values for statistical analysis, so that the numerical values in the feature sets need to be normalized, and processed data are input into a logistic regression model;

because the feature set has upper limit and lower limit, the basic feature of series planning is satisfied, and a convergence model S is established according to the consistent convergence property of the function term series _n (x)=x ⁿ /(x ⁿ + Range), range is defined as the maximum value of the Range of features, the argument x is the set of features defining the Range, S _n (x) The sequence S (x) and the portion of (c) converge consistently to 1 over the argument interval (0, range);

then selecting an initialized n value according to the Range value, enabling all characteristic values in the value Range to approach a normalization (0, 1) interval, and obtaining a result after characteristic data processing;

combining the normalized target characteristic values into target characteristic vectors, namely characteristic vectors of each commentator and each merchant;

and (4) excavating feature vectors meeting the conditions in a set feature vector set by adopting a statistical analysis labeling method, and labeling categories as a training set of logistic regression.

In the logistic regression training set data, the independent variable is the characteristic of the commenter after quantization normalization, the dependent variable is the type of the annotated commenter and obeys Bernoulli distribution, the true commenter is marked as 1, and the false commenter is marked as 0;

the feature matrix of the training set of the commentator is represented as U = { U _1, U _2, \8230;, U _ n }, wherein U _ i is a (k + 1) -dimensional feature vector of the ith sample, and the labeling result of the training set is represented by an n-dimensional 0,1 vector;

a group of (k + 1) -dimensional regression coefficient vectors alpha are obtained by adopting a maximum likelihood estimation principle and a batch gradient descent method, and the truth of a critic is expressed as URE =1/[1+ exp (-alpha) ] ^T *u)]。

In the logistic regression training set data, the independent variable is the merchant characteristics after quantization and normalization, the dependent variable is the marked merchant category and obeys Bernoulli distribution, the marked real merchant is 1, and the false merchant is 0;

the feature matrix of the training set of the merchant is expressed as M = { M _1, M _2, \8230;, M _ n }, wherein M _ i is a (k + 1) -dimensional feature vector of the ith sample, and the marking result of the training set is expressed by an n-dimensional 0,1 vector;

a group of (k + 1) -dimensional regression coefficient vectors beta are obtained by adopting a maximum likelihood estimation principle and a batch gradient descent method, and the truth of a merchant is represented as MRE =1/[1+ exp (-beta) ] ^T *m)]。

According to the criterion that the commentators of more than or equal to three merchants are in a group after commenting together, dividing the commentators to be detected into individual commentators and group commentators.

Individual critic evaluation model: individual commentators are defined as individuals, as follows:

(1) Calculating individual truth URE by adopting the logistic regression model;

(2) Dividing all commodity numbers purchased by individuals according to a specified price interval (p _1, p _2, \8230;, p _ n) to obtain an initial commodity number sequence (c _1, c _2, \8230;, c _ n);

(3) The initial commodity number sequence obeys Poisson distribution with parameter c _ avg, random ('poisson', c _ avg,1, n) is adopted to generate a Poisson distribution random number sequence, poisson is defined as Poisson distribution sampling, c _ avg is defined as the average value of distribution values, and the commodity number sequence after sampling is represented as (sc _1, sc _2, \8230;, sc _ n);

(4) And calculating the truth average value (MRE _ avg _1, MRE _avg _2, \8230; MRE _ avg _ n) of the merchant corresponding to the commodity number sequence after sampling by adopting the logistic regression model.

The group commentator evaluation model comprises: the group commentator is defined as a group, and the following details are as follows:

(1) Calculating the merchant truth MRE by adopting the logistic regression model;

(2) Dividing the number of all comments received by a merchant according to a specified time interval (t _1, t _2, \8230;, t _ n), ensuring that each interval has only one group of comments, and obtaining an initial group comment number sequence (r _1, r _2, \8230;, r _ n);

(3) The initial group comment number sequence obeys Poisson distribution with a parameter of r _ avg, random ('poisson', r _ avg,1, n) is adopted to generate a Poisson distribution random number sequence, poisson is defined as Poisson distribution sampling, r _ avg is defined as the average value of distribution values, and the sampled group comment number sequence is represented as (sr _1, sr_2, 8230;, sr _ n);

(4) And calculating the truth average value (URE _ avg _1, URE _avg _2, \8230; URE _ avg _ n) of the group corresponding to the group comment number sequence after sampling by adopting the logistic regression model.

The individual review sample feature evaluation matrix is represented as X = { X _1, X _2, \8230;, X _ n }, wherein X _ j = (X _ j _ URE, X _ j _ MRE _ avg) is the degree of truth of the jth individual and its corresponding class of merchants;

the group comment sample characteristic evaluation matrix is represented as Y = { Y _1, Y _2, \8230;, Y _ n }, wherein Y _ j = (Y _ j _ MRE, Y _ j _ URE _ avg) is the trueness of the jth merchant and the corresponding class group;

the algorithm for detecting the individual comment quality is applied to the detection of the group comment quality in the same way, and the algorithm is as follows:

step _1, randomly selecting feature evaluation vectors c _1 and c _2as a category center of a real comment and a category center of a false comment respectively, dividing a sample into two categories, wherein the Euclidean distance between the sample x _ j and the category center c _ i is d _ ij = | | x _ j-c _ i |, U _ ij represents a membership function of the jth sample to the ith category, U represents a fuzzy classification matrix, and V represents a category center matrix;

step _2, the objective function and constraints of fuzzy C-means clustering are as follows:

J(U,V)=(u_11) ^m *(d_11) ² +(u_12) ^m *(d_12) ² +…+(u_1n) ^m *(d_1n) ² +(u_21) ^m *(d_21) ² +(u_22) ^m *(d_22) ² +…+(u_2n) ^m *(d_2n) ²

u_1j+u_2j=1，j=1,2,…,n

step _3, deriving a membership function and a class center:

u_ij=[(d_ij/d_1j) ^2/(m-1) +(d_ij/d_2j) ^2/(m-1) ] ^-1 ，i=1,2；j=1,2,…,n

c_i=[(u_i1) ^m *x_1+(u_i2) ^m *x_2+…+(u_in) ^m *x_n]/[(u_i1) ^m +(u_i2) ^m +…+(u_in) ^m ]，i=1,2

step _4, taking a threshold value epsilon =0.001, m =2, and when | | | Δ c _ i | < epsilon is met, stopping iteration and outputting an optimal fuzzy classification matrix U and a category center matrix V by the algorithm;

according to the fuzzy classification matrix U, the membership degree of each comment belonging to the real comment or the false comment can be known, namely which type of comment of the merchants evaluated by the individual is judged as the false comment according to the value, and similarly which type of group of comments of the merchants is judged as the false comment.

And the quality of the E-commerce website comments is detected by combining the two methods.

(III) advantageous effects

The method has the advantages that a mutual scoring model of the commentator and the merchant is established to calculate the first class of characteristics, then other relevant characteristics of the commentator and the merchant are extracted, the characteristic values are preprocessed by utilizing a series normalization convergence model, two classes of comment quality evaluation models based on the commentator individuals and the commenter groups are established through regression and sampling methods, the quality of comments is distinguished by adopting a fuzzy clustering analysis method according to the truth characteristics of the commentator and the merchant, the relevant characteristics can be uniformly and reasonably processed to establish a more practical model, the model is solved by adopting a fuzzy division method, and the authenticity, the interpretability and the intuitiveness of the detection effect are reflected.

Drawings

FIG. 1 is a series normalized statistical analysis convergence model image of a feature set.

Fig. 2 is a flow chart of a method for detecting review quality in an e-commerce website.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example 1: establishing a mutual scoring model to calculate the first kind of characteristics

The merchant scores the commentators: and defining calculation rules reflecting three behaviors of registration of a commentator (remote registration or frequent registration), browsing (browsing similar commodities before purchase) and commenting (evaluating commodities after purchase).

In each transaction, the merchant will comment on the above three types of behavior of the person, with the mark 1 appearing and the mark 0 not appearing, and there are 8 possibilities of combination. 001, 101, 111 are defined as suspicious transactions and the rest are normal transactions.

Taking X scores and n transactions as an example, if there are k suspicious transactions, the merchant score X (n-k) is calculated.

The commenter scores the merchants: and defining calculation rules reflecting three types of attributes of merchant goods (description matching), service (service attitude) and logistics (logistics situation).

In each transaction, the commenter can combine the three attributes of the merchant, namely the excellent attribute 1 and the poor attribute 0, so that the possibility is 8. 000, 001, 010, 100 are defined as bad transactions, the rest are good transactions.

Taking X points and n transactions as an example, if there are k bad transactions, the reviewer score is calculated as X (n-k).

Example 2: and (5) normalizing the feature set, and constructing a feature vector and a training set.

Take "registration time" and "number of comments" of a commentator as an example: the registration time is used as a limited feature set and is represented by x1, and the value range is (0-10), S _n (x 1) is a normalization interval within the value range of x 1; the number of comments is used as a limited feature set and is represented by x2, and the value range is (0-100), S _n And (x 2) is a normalization interval within the value range of x 2.

Establishing a convergence model S _n (x)=x ⁿ /(x ⁿ + Range), when x = x1, take n =2.0, range = 10; when x = x2, take n =1.5, range =100; as can be seen from fig. 1, S is within the value range of the x1 feature set _n (x 1) converges to the interval 0 to 1, S within the value range of the x2 feature set _n (x 2) also converges to the interval 0 to 1.

Similarly, for other characteristics of the commentators and the merchants, such as unit time purchase quantity, unit time sales quantity and the like, the initialized n value is selected to be brought into the convergence model according to different characteristic set value intervals, and the normalized characteristic value is obtained.

From the above, it can be concluded that the result of normalization of one critic who registers for 5 years and has the number of reviews of 50 is (0.83, 0.78), and the critic annotation class is put into the training set data, thereby establishing the critic training set.

The commenting person feature vector u = (1, u2, \8230;, uk) is thus obtained; merchant feature vector m = (1, m2, \8230;, mk); commenting people and merchants.

Example 3: and constructing a truth logistic regression model of the commentators and the merchants.

Selecting n critic training data sets with the characteristic vector u, marking a real critic as 1 and marking a false critic as 0; similarly, n merchant training data sets with the feature vector m are selected, the real merchant is marked as 1, and the false merchant is marked as 0.

Taking the human review as an example, the probability that a sample is considered as class 1 and class 0 can be expressed according to the logistic regression model as follows:

p(y=1|u,α)=h _α (u)=1/[1+exp(-α ^T *u)]

p(y=0|u,α)=1- h _α (u)

it can further be represented in its general form as:

p(y|u,α)= (h _α (u)) ^y (1- h _α (u)) ^1-y

calculating a loss function of logistic regression according to a known label training set by adopting a maximum likelihood estimation principle, wherein the loss function is as follows:

J(h _α (u),y)=(-1/n)*[y ₁ *ln(h _α (u ₁ ))+(1-y ₁ )ln(1-h _α (u ₁ ))

+y ₂ *ln(h _α (u ₂ ))+(1-y ₂ )ln(1- h _α (u ₂ ))+…+ y _n *ln(h _α (u _n ))+(1-y _n )ln(1-h _α (u _n ))]

and minimizing a loss function by adopting a batch gradient descent method to obtain a final weight coefficient vector alpha of a group of characteristics of the critics, wherein the truth of the critics can be expressed as URE =1/[1+ exp (-alpha) ^T *u)]。

The method described above obtains the final weight coefficient vector β of a set of merchant features, and the truth of the merchant can be expressed as MRE =1/[1+ exp (- β) ^T *u)]。

Example 4: and classifying the individual commentators and the group commentators, and constructing corresponding evaluation models.

According to the criterion that the commentators of more than or equal to three merchants are in a group, the commentators to be detected are divided into individual commentators and group commentators.

Individual critic evaluation model:

for convenience of explanation, an individual commentator is defined as an individual.

Step 1: obtaining individual characteristic values, constructing individual characteristic vectors (1, u2, \8230;, uk), and bringing the vectors into a critic logistic regression equation to obtain individual trueness URE =1/[1+ a 1+ u1+ \8230 ++ α k + uk)) ];

and 2, step: suppose that the total commodities purchased by an individual are divided according to commodity price intervals { (1, 50], (50, 100], (100, 150], (150, 200], (200, 250], (250, 300], (300, 350], (350, 400], (400, 450], (450, 500) ] }, to obtain an initial commodity number sequence (5, 6,8,10,4,5,6,8, 4) marked as c, wherein 5 indicates that the individual purchases 5 commodities within the price range of 1-50 yuan, and so on.

And step 3: knowing that the sequence c obeys the poisson distribution with the parameter λ =6, 10 poisson distribution random numbers are generated by random ('poisson', 6,1, 10) and are used for sampling the initial commodity number sequence, and if the selected random numbers do not exceed the data per se, the sampled commodity number sequence (4, 5,4,6,4, 6,5, 4) is recorded as sc.

And 4, step 4: calculating the mean value of the truth of each merchant corresponding to the number of each commodity in the sequence sc, specifically, assuming that the first commodity number 4 corresponds to two merchants, namely merchant 1 and merchant 2, obtaining the characteristic value of merchant 1, constructing the characteristic vector (1, m2, \ 8230;, mk), and substituting into a merchant logistic regression equation to obtain the truth of merchant 1 MRE _1=1/[1 ++ exp (- (β 0+ β 1 m1+ \ 8230; + β k:) ], and obtaining the truth of merchant 2 MRE _2 through the same process, wherein the result is MRE _ avg _1= (MRE _1 +2)/2, and so on, the average value of the truth of the merchant corresponding to the sequence sc is represented as (MRE _ avg _1, MRE _ avg \\ 2, MRE _ avg _2, \\\\\ \ 8230, and E _ avg _ 10).

The group commentator evaluation model comprises:

for convenience of explanation, the group commentator is defined as a group.

Step 1: obtaining a merchant characteristic value, constructing a merchant characteristic vector (1, m1, m2, \8230;, mk), and substituting the merchant characteristic vector into a merchant logistic regression equation to obtain a merchant truth MRE =1/[1+ exp (- (beta 0+ beta 1+ m1+ \ 8230; + beta k + mk)) ];

and 2, step: suppose that all group reviews received by a merchant are divided according to review time intervals { (1, 9], (5, 12], (10, 15], (15, 25], (16, 30], (25, 40], (40, 50], (45, 55], (55, 70], (60, 75) ] }, and an initial group review number sequence (4, 5,6,5,4, 5) is recorded as r, wherein 4 represents that the merchant receives 4 reviews of a group within the time range of 1-9 days, and so on.

And 3, step 3: knowing that the sequence r obeys the poisson distribution with the parameter β =5, 10 poisson distribution random numbers are generated using random ('poisson', 5,1,10) for sampling the initial cohort critic number sequence, and if the selected random numbers do not exceed the data itself, the sampled cohort critic number sequence (4,5,4,5,4,4,5,5,3,4) is denoted as sr.

And 4, step 4: calculating the truth mean value of each group with the number of comments in each group in the sequence sr, specifically, assuming that the number 4 of the comments in the first group corresponds to two commentators, namely the commentator 1 and the commentator 2, respectively, obtaining the characteristic value of the commentator 1, constructing the characteristic vector (1, u2, \ 8230;, uk) of the commentator 1, bringing the characteristic vector into a logistic regression equation of the commentator, obtaining the true degree URE _1=1/[1+ exp (- (α 0+ α 1+ u1+ \8230; + α k + uk)) ], obtaining the true degree URE _2 of the critic 2 through the same process, obtaining the true degree URE _2, and obtaining the true degree URE _ avg _1= (URE _1 URE \/2)/2, and so on, wherein the group true degree mean value corresponding to the sequence sr is represented as (URE _ avg _1, URE \/avg \/2, \/8230;, URE _ avg _ 10).

Example 5: and establishing a comment evaluation matrix, detecting the comment quality, and identifying false comments.

An individual review evaluation matrix X = { X _1, X _2, \8230;, X _ n }, where X _ j = (X _ j _ URE, X _ j _ MRE _ avg _ k), k ∈ [1,10],

representing the truth of the jth individual and the corresponding class of merchants; the cohort review evaluation matrix Y = { Y _1, Y _2, \8230;, Y _ n }, where,

y _ j = (y _ j _ MRE, y _ j _ URE _ avg _ k), k ∈ [1,10], which represents the trueness of the jth merchant and its corresponding class group.

The method is also applicable to group comment quality detection by taking the example of detecting the individual comment quality;

randomly selecting feature evaluation vectors c _1 and c _2as a category center of a real comment and a category center of a false comment respectively, and dividing the samples into two categories, wherein the Euclidean distance between a sample x _ j and the category center c _ i is d _ ij = | | | x _ j-c _ i |, U _ ij represents a membership function of a jth sample to the ith category, U represents a fuzzy classification matrix, and V represents a category center matrix;

the objective function and constraint conditions of the fuzzy C-means clustering are as follows:

u_1j+u_2j=1，j=1,2,…,n

deriving a membership function and a class center:

u_ij=[(d_ij/d_1j) ^2/(m-1) +(d_ij/d_2j) ^2/(m-1) ] ^-1 ，i=1,2；j=1,2,…,n

taking a threshold value epsilon =0.001, m =2, and when | | | Δ c _ i | < epsilon is satisfied, stopping iteration by the algorithm and outputting an optimal fuzzy classification matrix U and a class center matrix V;

assuming that a certain vector in U is (0.6, 0.4), it indicates that the possibility of 0.6 of the comments belongs to the real category, and the possibility of 0.4 belongs to the false category, and the comments are classified as real comments; on the contrary, if a certain vector is (0.4, 0.6), it indicates that the possibility of 0.4 of the comments belongs to the true category, and the possibility of 0.6 belongs to the false category, and the comments are classified as false comments; when a certain vector is (0.5 ), the likelihood of representing that the comment is true or false is the same, and the comment is not divided.

Finally, it should be noted that: the above examples are intended only to illustrate the technical process of the invention, and not to limit it; although the invention has been described in detail with reference to the foregoing examples, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing examples can be modified, or some technical features can be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding embodiments of the present invention.

Claims

1. A method for detecting comment quality in an E-commerce website is characterized by comprising the following steps:

the scoring mechanism of the commenter by the merchant is as follows: defining characteristics reflecting the critics as: logging in: logging in at different places and frequently logging in; browsing: browsing similar commodities before purchase; review: after purchasing, providing calculation rules of three types of behaviors for commodity evaluation;

in each transaction, 8 combinations of the three types of behaviors of the commenter, namely the mark 1 and the mark 0 are shown by the merchant, 001, 101 and 111 are defined as suspicious transactions, and the rest is normal transactions;

defining an X score, calculating a merchant score X p _1 as a first class of characteristics according to the probability p _1 of normal transaction in n _ E61 transactions, and constructing a feature vector of a commentator according to the characteristics;

the human commentator represents with a vector: u = (1, u feature _1, u feature _2, \/8230;, u feature _ k), parameters in the vector represent the quantized human commentator feature values, and the initial values include: u _ feature _1= "merchant score", u _ feature _2= "registration time", u _ feature _ k = null value;

the scoring mechanism of the commenter for the merchant: the definition reflects the characteristics of the merchant as: commercial product: the description is consistent; service: a service attitude; logistics: calculating rules of three types of attributes of the logistics situation;

in each transaction, 8 combinations of the above three types of attributes of the merchant, namely a good mark 1 and a poor mark 0, are defined by the commentator, wherein 000, 001, 010 and 100 are inferior transactions, and the rest are high-quality transactions;

defining an X score, calculating a commenter score X p _2 as a first class of features according to the probability p _2 of high-quality transaction in n _ E62 transactions, and constructing a merchant feature vector according to the features;

the merchant uses a vector representation: m _ E7= (1, m feature _1, m feature _2, \8230;, m feature _ k), the parameters in the vector represent the quantized merchant feature values, and the initial values include: m _ feature _1= "reviewer score", m _ feature _2= "registration time", m _ feature _ k = null value;

the discussion and screening of the distribution values in the feature sets of the commentators and the merchants are reference values for statistical analysis, so that the values in the feature sets need to be normalized, and the processed data are input into a logistic regression model;

because the feature set has upper limit and lower limit, the basic feature of series planning is satisfied, and a convergence model S is established according to the consistent convergence property of the function term series _{n_E1} (x)＝x ⁿ _{_} ^E1 /(x ⁿ _{_} ^E1 + Range), range is defined as the maximum value of the Range of features, the argument x is the set of features defining the Range, S _{n_E1} (x) And the sequence S _ PartSum (x) converge consistently to 1 over the argument interval (0, range);

then, according to the Range value, selecting an initialized n _ E1 value, enabling all characteristic values in the value Range to approach a normalization (0, 1) interval, and obtaining a result after characteristic data processing;

combining the normalized target characteristic values into a target characteristic vector, namely the characteristic vectors of each commentator and each merchant;

adopting a statistical analysis labeling method, excavating feature vectors meeting conditions in a set feature vector set, and labeling categories as a training set of logistic regression;

the feature matrix of the training set of the critic is represented as U _ E4= { U _1, U \2, \8230;, U _ nE4}, wherein U _ i is a (k + 1) -dimensional feature vector of the ith sample, and the marking result of the training set is represented by an nE 4-dimensional 0,1 vector;

a group of (k + 1) -dimensional regression coefficient vectors alpha are obtained by adopting a maximum likelihood estimation principle and a batch gradient descent method, and the truth of a critic is expressed as URE =1/[1+ exp (-alpha) ] ^T *u_E2)]；

The independent variable in the logistic regression training set data is the merchant characteristic after quantization normalization, the dependent variable is the marked merchant category and obeys Bernoulli distribution, the marked real merchant is 1, and the false merchant is 0;

the feature matrix of the merchant training set is expressed by M = { M _1, M _2, \8230;, M _ nProSet }, wherein M _ i is a (k + 1) -dimensional feature vector of the ith sample, and the marking result of the training set is expressed by an nProSet-dimensional 0,1 vector;

a group of (k + 1) -dimensional regression coefficient vectors beta are obtained by adopting a maximum likelihood estimation principle and a batch gradient descent method, and the truth of a merchant is expressed as MRE =1/[1+ exp (-beta) ^T *u_E2mre)]；

Dividing the commentators needing to be detected into individual commentators and group commentators according to the criterion that the commentators of three merchants with more than or equal to the common commentary are in a group;

individual critic evaluation model: individual commentators are defined as individuals, and are described in detail as follows:

(1) Calculating individual truth URE by adopting the logistic regression model;

(2) Dividing all commodity numbers purchased by an individual into specified price intervals (p _1, p _2, \8230;, p _ n) to obtain an initial commodity number sequence (GSet _ c _1, GSet _c _2, \8230;, GSet _ c _ n);

(3) The initial commodity number sequence obeys Poisson distribution with parameter c _ avg, random ('poisson', c _ avg,1, n \\ E31) is adopted to generate a Poisson distribution random number sequence, poisson is defined as a Poisson distribution sampling, c _ avg is defined as an average value of distribution values, and the commodity number sequence after sampling is expressed as (sc _1, sc_2, 8230, sc _ n);

(4) Calculating the truth average value (MRE _ avg _1, MRE _avg _2, \8230; MRE _ avg _ n) of the merchant corresponding to the commodity number sequence after sampling by adopting the logistic regression model;

the group commentator evaluation model comprises: the group commentator is defined as a group and is specifically as follows:

(2) Dividing all the comment numbers received by a merchant according to a specified time interval (t _1, t_2, \8230;, t _ n), ensuring that each interval has only one group of comments, and obtaining an initial group comment number sequence (r _1, r_2, \8230;, r _ n);

(3) The initial group comment number sequence obeys Poisson distribution with a parameter of r _ avg, random ('poisson', r _ avg,1, n \ E32) is adopted to generate a Poisson distribution random number sequence, poisson is defined as a Poisson distribution sampling, r _ avg is defined as an average value of distribution values, and the sampled group comment number sequence is represented as (sr _1, sr u2, \8230;, sr _ n);

(4) Calculating the truth mean value (URE _ avg _1, URE _avg _2, \ 8230; URE _ avg _ n) of the group corresponding to the group comment number sequence after sampling by adopting the logistic regression model;

the individual review sample feature evaluation matrix is denoted as X = { X _1, X _2, \8230;, X _ n }, where,

x _ j = (x _ j _ URE, x _ j _ MRE _ avg) is the degree of truth of the individual of the jth sample and its corresponding class of merchants;

the cohort review sample feature evaluation matrix is denoted as Y = { Y _1, Y _2, \8230, Y _ n }, where,

y _ j = (y _ j _ MRE, y _ j _ URE _ avg) is the trueness of the merchant of the jth sample and its corresponding class group;

J(U,V)＝(u_11) ^m *(d_11) ² +(u_12) ^m *(d_12) ² +…+(u_1n) ^m *(d_1n) ² +(u_21) ^m *(d_21) ² +(u_22) ^m *(

d_22) ² +…+(u_2n) ^m *(d_2n) ²

u_1j+u_2j＝1，j＝1,2,…,n

step _3, deriving a membership function and a class center:

u_ij＝[(d_ij/d_1j) ^2/(m-1) +(d_ij/d_2j) ^2/(m-1) ] ^-1 ，i＝1,2；j＝1,2,…,n

c_i＝[(u_i1) ^m *x_1+(u_i2) ^m *x_2+…+(u_in) ^m *x_n]/[(u_i1) ^m +(u_i2) ^m +…+(u_in) ^m ]，i＝1,2

step _4, taking a threshold value epsilon =0.001, and when the < epsilon > is satisfied, stopping iteration and outputting an optimal fuzzy classification matrix U and a category center matrix V by the algorithm;

according to a fuzzy classification matrix U, knowing that the individual comments belong to the membership degree U1 of the real comments and the membership degree U2 of the false comments, taking the membership degrees U1 and U2 as indexes for dividing the comment quality, when U1 is greater than U2, the comments belong to the real comment class, when U1 is less than U2, the comments belong to the false comment class, and when U1 is = U2, the comments are not divided;

the quality detection of the group comments is the same as the quality detection method of the individual comments.