Perception data evaluation method and system
Technical Field
The invention relates to the technical field of data analysis, in particular to a perception data evaluation method and system.
Background
With the development of electronic commerce, the express logistics industry has been developed greatly. There are a plurality of express companies, and an objective evaluation standard is needed for selecting a proper express company from the plurality of express companies. However, at present, no system capable of quantitatively analyzing the perception comments of the customers well appears in the market, and the difficulty is selection of scoring indexes, natural semantic analysis and the like. Therefore, it is necessary to provide an evaluation method, which can implement automatic, quantitative and standardized evaluation of customer perception data to implement comparison between different express service providers.
The express logistics service industry is a typical production type service industry, and an evaluation method provided by taking the express logistics service industry as an example can play a demonstration role in the development of other service industry authentication.
Disclosure of Invention
In view of the foregoing analysis, the present invention aims to provide a method and a system for evaluating sensory data, so as to solve the problem that the service industry lacks a uniform authentication system standard for evaluation.
The purpose of the invention is mainly realized by the following technical scheme:
the method for evaluating the perception data comprises the following steps:
s1, obtaining perception data as training corpora;
s2, carrying out data preprocessing and manual labeling on the training corpus to obtain a training word bank;
s3, extracting characteristics of words in the training word bank to obtain a characteristic dictionary, generating characteristic vectors based on the characteristic dictionary, and constructing a training sample;
s4, creating a classifier, and training the classifier by using a training sample;
s5, acquiring perception data to be evaluated, performing data preprocessing on the perception data to be evaluated, constructing a perception data vector, inputting the perception data vector into a trained classifier, and judging the category of the perception data;
and S6, calculating the evaluation score of the perception data to be evaluated.
Wherein, the preprocessing in the steps S2 and S5 further includes formatting and word segmentation, and the specific steps are as follows:
s21, formatting each piece of perception data in the training corpus, and converting the perception data into the same structured format, wherein the structured format at least comprises 4 fields of perception data content, a theme domain, keywords and a company name; wherein, there is at least one topic area, and each topic area defines at least one category;
s22, segmenting the sensing data content; adopting a Chinese word segmentation device for Chinese sensing data; and for English perception data, performing space word segmentation, and after English word segmentation is completed, normalizing the tense and the single-complex number by using a word stem extraction mode.
The preprocessing in step S2 and step S5 further includes stop word and synonym processing, and the specific steps are as follows:
a. processing the word segmentation result by using a pre-established stop word list, and removing stop words;
b. synonyms are replaced with a pre-established synonym table.
The manual labeling in step S2 is performed to label the subject field and the category under the subject field.
In step S3, the feature extraction method includes: and counting the word frequency of each vocabulary in the training word bank, sequencing the vocabularies according to the word frequency, and selecting the first N words to form a feature dictionary.
The method for generating the feature vector specifically comprises the following steps: taking the number of words in the feature dictionary as the total dimensionality of the feature vector, wherein each word in the feature dictionary corresponds to one feature dimensionality, and establishing the feature vector for perception data on the basis of the feature dimensionality; if words in the feature dictionary appear in the preprocessed perception data, taking TF-IDF values corresponding to the appearing words as values of corresponding dimensions; if the words in the feature dictionary do not appear in the preprocessed sensing data, the corresponding feature dimension value is 0; the TF-IDF value refers to TF multiplied by IDF, and TF refers to word frequency; IDF means inverse document frequency, where n denotes the number of perceptual data in which a word appears, and D is the total perceptual data number.
For the topic domains, a training sample may be constructed for each topic domain in step S3, and a classifier may be created for each topic domain in step S4, and the training sample of each topic domain is used to train the respective classifier. The classifier may be a classifier that employs a naive bayes model.
The evaluation score is calculated by the formulaWhereinWherein n represents the number of categories in the subject field, Max represents the highest score of the evaluation coefficient, △ represents the highest value minus the lowest value of the evaluation coefficient, h represents each category, h is 1-n, αhIs an evaluation coefficient of each category; x is the number ofChThe number of items belonging to each category under each subject domain is satisfiedXCIRepresenting the number of perceptual data divided into a certain theme zone.
The invention also provides a system for evaluating the perception data, which comprises the following components:
the training corpus module is used for acquiring sensing data as training corpuses;
the preprocessing module is used for preprocessing the materials;
the training word library module is used for calling the preprocessing module to carry out data preprocessing on the training corpus and then carrying out manual labeling to obtain a training word library;
the training sample module is used for extracting the characteristics of the vocabularies in the training word stock to obtain a characteristic dictionary, generating a characteristic vector based on the characteristic dictionary and constructing a training sample;
the training module is used for creating a classifier and training the classifier by using the training samples;
the judging module is used for acquiring the perception data to be evaluated, calling the preprocessing module to perform data preprocessing on the perception data to be evaluated, constructing a perception data vector, inputting the perception data vector into the trained classifier, and judging the category of the perception data;
and the evaluation module is used for calculating the evaluation score of the perception data to be evaluated.
The invention has the following beneficial effects:
the invention evaluates based on the customer perception data, and the evaluation and authentication system is different from the traditional product and system authentication from index selection, evaluation technology to authentication mode. The service quality score is obtained through statistics and calculation of user perception data, and therefore a uniform authentication system standard is established for a certain service industry.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
Fig. 1 is a flowchart of an evaluation method of perception data.
Detailed Description
The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.
According to a specific embodiment of the invention, the perception data in the field of express delivery service is evaluated by a method and a system, wherein the sources of the perception data include but are not limited to evaluation contents of microblogs, posts, public comments, post bureau evaluation websites and various e-commerce websites, and can also be derived from user behavior logs, user behavior analysis and the like. The perception data in the embodiment refers to comments of the user on the express delivery service.
The evaluation method specifically comprises the following steps:
s1, crawling comment contents on a network through URL links to serve as training corpora.
Specifically, an open-source nwebrowler program can be adopted to crawl HTML files and then extract evaluation information from the HTML files.
The wider the source distribution of the training comment data is, the more comprehensive the data type is collected, the more accurate the trained classifier is, so that the more accurate the result of subsequent category prediction is, and the more the express company condition can be reflected by the final scoring.
S2, carrying out data preprocessing and manual labeling on the training corpus to obtain a training word bank; the data pre-processing further comprises: formatting, word segmentation, stop words, synonym processing and the like, wherein the specific contents are as follows:
and S21, formatting each comment in the training corpus, and converting the comments into the same structured format, wherein the format can be json or xml and the like. The fields of the structured format include: comment content, topic domain, keywords, company name, etc. There may be multiple subject domains, with multiple level classes defined under each subject domain. The content of the keyword field is extracted based on the original review content.
Take a certain express company in the field of express service as an example. The subject domain was identified as 6 of functionality, economy, safety, timeliness, comfort, civilization, as shown in table 1. The function represents the individual service condition; economically expressing the price status; the safety expresses privacy protection, insurance and cargo integrity; the timeliness shows the delivery speed; whether the comfort performance is convenient to consult, convenient to take and deliver, timely to remind and the like; the service attitude of the company is expressed civilized. Under each topic domain, 4 classes are defined, respectively good, bad, and bad. Of course, the expression corresponding to 4 levels of classes corresponding to the timeliness may also be: fast, slow, very slow; the 4 level class correspondence expressions for economy may be: is cheap, expensive and very expensive.
And S22, segmenting the structured comment content. Wherein, if the comment is Chinese, a Chinese word segmentation device is adopted; if the English words are English words, the blank space is used for word segmentation, and the tense and the single-complex number are normalized by using a word stem extraction mode after the English word segmentation is finished. Specifically, word segmentation tools such as ICTCLAS (Institute of Computing Technology, chinese lexical Analysis System) and IK Analyzer (IK segmenter) may be employed as the chinese segmenter.
And S23, processing the word segmentation result by using a pre-established stop word list, and removing stop words. The stop words include words or words without practical meaning, such as "and" have "," not only "but also" and the like, and some uncommon words and special symbols.
S24, replacing synonyms in the training word library by using a pre-established synonym table, so that all synonyms are represented by one word.
TABLE 1 topic Domain and its level classes
And S25, manually marking the topic domain related to each comment in the training corpus and the level class under the topic domain. It should be noted that a comment may relate to multiple subject domains, but a comment may only correspond to one level class under each subject domain. For example, a topic domain involved in a comment is security and timeliness, and manually labeled "good" in security and "slow" in timeliness according to semantics. If a comment is not related to each topic field, the comment is deleted.
And S26, storing the vocabulary topic division domains subjected to word division, table deactivation and synonym processing into a training word library corresponding to each topic domain in a vector mode.
And S3, extracting the characteristics of the words in the training word library to obtain a characteristic dictionary, generating the characteristic vector of each comment by using the characteristic dictionary, and forming the training sample of each topic domain by using the characteristic vector related in each topic domain and the manually labeled class.
The feature extraction method comprises the following steps: and counting the word frequency of each vocabulary in the training word bank, and selecting the first N (N is more than or equal to 1) high-frequency words as a feature dictionary.
The generation method of the feature vector comprises the following steps: counting the number of words (total dimension) of the feature dictionary, wherein each word corresponds to one feature dimension. Based on this, a feature vector is established for each comment. If a word in the feature dictionary appears in the preprocessed comment, taking the TF-IDF value corresponding to the word as the value of the corresponding dimension; and if the word in the feature dictionary does not appear in the comment, the corresponding feature dimension value is 0.
The form of the feature vector is as follows:
represents: the words in 3 feature dictionaries appear in a preprocessed comment and respectively correspond to the words in the 1 st dimension, the 32 th dimension and the 80 th dimension of the feature dictionary, so that the values of the feature vector of the comment in the 1 st dimension, the 32 th dimension and the 80 th dimension are TF-IDF values of the 3 words, namely 0.1, 0.4 and 0.32, and the values of the feature vectors of other dimensions are 0. 0 indicates that the word corresponding to the dimension in the feature dictionary does not appear in the comment.
The TF-IDF value refers to TF multiplied by IDF, and TF refers to word frequency; IDF means inverse document frequency, where IDF is log (D/n), where n represents the number of comments where the word appears and D is the total number of comments.
And S4, creating a classifier for each topic domain, and training the corresponding classifier by using the training samples of each topic domain. The classifier will be used to predict the class of levels in the subject domain where the reviews are located.
The embodiment adopts a naive Bayes model as a classifier, the classification principle is to judge the probability of the characteristic belonging to each class, and then the class with the highest probability is taken as the classification result. The invention is not limited to the naive Bayes model, and other classifiers such as SVM (support vector machine) classifiers can be adopted.
S5, the comments of a certain company are crawled through URL links, data preprocessing is carried out on the comments, comment vectors are constructed, the comment vectors are input into a trained classifier, the class of the comments in the related subject domains is judged, and then the class distribution condition of the company in the subject domains can be obtained.
The data preprocessing comprises formatting, word segmentation, stop word processing and the like.
And S51, converting the crawled comments into the same structured format, wherein the format can be json or xml and the like. The fields of the structured format include: comment content, topic domain, keywords, company name, etc.
And S52, performing word segmentation on the structured comment content, wherein the word segmentation method is the same as the step S23.
And S53, processing the word segmentation result by using a pre-established stop word list, and removing stop words. The deactivation word list is the same as that used in step S24.
S54, the construction method of the comment feature vector comprises the following steps: comparing the word after the data preprocessing with a characteristic dictionary, and if a word in the characteristic dictionary appears in the word after the data preprocessing, acquiring a TF-IDF value of the word in a training sample as a characteristic value of a corresponding position in a characteristic vector; and if the word in the feature dictionary does not appear in the word after the data preprocessing, the feature value of the corresponding position of the word is 0.
Taking an express company in a time-sensitive subject domain as an example, the distribution conditions of all levels are judged by a classifier as shown in table 2.
Table 2 distribution of express companies in each level on the timeliness topic domain
And S6, calculating the evaluation scores of the companies on each topic domain.
Calculating the score K of the company on a subject domain based on the level class distribution of the company on the subject domainCIThe calculation formula isWherein
In the formula, n represents the number of level classes under the subject field;
max represents the highest score of the evaluation coefficient;
delta is the highest value minus the lowest value of the evaluation coefficient;
h represents each class, h is 1 to n;
αhthe evaluation coefficients of all the classes are obtained, and the values of the evaluation coefficients can be changed according to requirements;
xChthe number of items belonging to each class in a certain subject domain for the comment of the company
XCIThe number of comments divided into the topic field I for company C comments is indicated.
Taking the distribution of the level classes of a certain express company on the time-sensitive subject domain in table 2 as an example, the calculation of the evaluation score is described, wherein 4 level classes are set in each subject domain distribution, namely n is 4, h is 1, 2, 3 and 4, and the evaluation coefficient of each level class is set to α1=1.2、α2=1、α3=-1、α4-1.2. Thus the formula is
Wherein,
r is to beCISubstituting the value of (C) into formula KCI,
Meaning that the express company has a score of 3.1167 on the subject field "timeliness".
The invention discloses another specific embodiment, which provides an evaluation system of perception data for implementing the perception data evaluation method, comprising:
a corpus module, configured to implement step S1 to obtain perceptual data as corpus;
the preprocessing module is used for preprocessing the materials; the preprocessing may include formatting, word segmentation, and further may include stop word processing, synonym processing, etc., as described in the above steps S21 to S24;
the training word library module is used for calling the preprocessing module to carry out data preprocessing on the training corpus and then carrying out manual labeling to obtain a training word library; the manual marking can mark a subject field related to each comment in the training corpus and a level class under the subject field;
the training sample module is used for extracting the characteristics of the vocabularies in the training word stock to obtain a characteristic dictionary, generating a characteristic vector based on the characteristic dictionary and constructing a training sample; specifically, the method of the above-described step S3 may be employed;
a training module, configured to create a classifier, train the classifier using the training sample, and specifically adopt the method in step S4;
the judging module is configured to obtain the sensing data to be evaluated, call the preprocessing module to perform data preprocessing on the sensing data to be evaluated, construct a sensing data vector, input the sensing data vector into the trained classifier, and judge the category of the sensing data, where the method in step S5 may be specifically adopted;
and the evaluation module is used for calculating the evaluation score of the perception data to be evaluated, wherein the calculation method is as described in step S6.
In summary, the embodiments of the present invention provide an evaluation method and system for perception data in the field of express service, which classify and quantify user evaluations, and provide a scoring method for express company services, so as to establish a unified evaluation standard in the express service industry.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.