CN106202481A

CN106202481A - The evaluation methodology of a kind of perception data and system

Info

Publication number: CN106202481A
Application number: CN201610565797.9A
Authority: CN
Inventors: 李甫; 汪洋泽
Original assignee: Quantum Cloud Future (beijing) Mdt Infotech Ltd
Current assignee: Quantum Cloud Future (beijing) Mdt Infotech Ltd; Wuxi Liangziyun Digital New Media Technology Co Ltd
Priority date: 2016-07-18
Filing date: 2016-07-18
Publication date: 2016-12-07

Abstract

The present invention relates to evaluation methodology and the system of a kind of perception data, obtain perception data as corpus；Corpus is carried out data prediction and artificial mark, obtains training dictionary；Vocabulary in training dictionary is carried out feature extraction, obtains feature lexicon, then feature based dictionary generates characteristic vector, and build training sample；Create grader, utilize training sample to train grader；Obtain perception data to be evaluated, perception data to be evaluated is carried out data prediction, and builds perception data vector, then this perception data vector is inputted trained grader, it is judged that the classification of perception data；Finally calculate the evaluation score of perception data to be evaluated.Unified authentication system standard can be set up for a certain service occupation by above-mentioned evaluation methodology.

Description

Perception data evaluation method and system

Technical Field

The invention relates to the technical field of data analysis, in particular to a perception data evaluation method and system.

Background

With the development of electronic commerce, the express logistics industry has been developed greatly. There are a plurality of express companies, and an objective evaluation standard is needed for selecting a proper express company from the plurality of express companies. However, at present, no system capable of quantitatively analyzing the perception comments of the customers well appears in the market, and the difficulty is selection of scoring indexes, natural semantic analysis and the like. Therefore, it is necessary to provide an evaluation method, which can implement automatic, quantitative and standardized evaluation of customer perception data to implement comparison between different express service providers.

The express logistics service industry is a typical production type service industry, and an evaluation method provided by taking the express logistics service industry as an example can play a demonstration role in the development of other service industry authentication.

Disclosure of Invention

In view of the foregoing analysis, the present invention aims to provide a method and a system for evaluating sensory data, so as to solve the problem that the service industry lacks a uniform authentication system standard for evaluation.

The purpose of the invention is mainly realized by the following technical scheme:

the method for evaluating the perception data comprises the following steps:

s1, obtaining perception data as training corpora;

s2, carrying out data preprocessing and manual labeling on the training corpus to obtain a training word bank;

s3, extracting characteristics of words in the training word bank to obtain a characteristic dictionary, generating characteristic vectors based on the characteristic dictionary, and constructing a training sample;

s4, creating a classifier, and training the classifier by using a training sample;

s5, acquiring perception data to be evaluated, performing data preprocessing on the perception data to be evaluated, constructing a perception data vector, inputting the perception data vector into a trained classifier, and judging the category of the perception data;

and S6, calculating the evaluation score of the perception data to be evaluated.

Wherein, the preprocessing in the steps S2 and S5 further includes formatting and word segmentation, and the specific steps are as follows:

s21, formatting each piece of perception data in the training corpus, and converting the perception data into the same structured format, wherein the structured format at least comprises 4 fields of perception data content, a theme domain, keywords and a company name; wherein, there is at least one topic area, and each topic area defines at least one category;

s22, segmenting the sensing data content; adopting a Chinese word segmentation device for Chinese sensing data; and for English perception data, performing space word segmentation, and after English word segmentation is completed, normalizing the tense and the single-complex number by using a word stem extraction mode.

The preprocessing in step S2 and step S5 further includes stop word and synonym processing, and the specific steps are as follows:

a. processing the word segmentation result by using a pre-established stop word list, and removing stop words;

b. synonyms are replaced with a pre-established synonym table.

The manual labeling in step S2 is performed to label the subject field and the category under the subject field.

In step S3, the feature extraction method includes: and counting the word frequency of each vocabulary in the training word bank, sequencing the vocabularies according to the word frequency, and selecting the first N words to form a feature dictionary.

The method for generating the feature vector specifically comprises the following steps: taking the number of words in the feature dictionary as the total dimensionality of the feature vector, wherein each word in the feature dictionary corresponds to one feature dimensionality, and establishing the feature vector for perception data on the basis of the feature dimensionality; if words in the feature dictionary appear in the preprocessed perception data, taking TF-IDF values corresponding to the appearing words as values of corresponding dimensions; if the words in the feature dictionary do not appear in the preprocessed sensing data, the corresponding feature dimension value is 0; the TF-IDF value refers to TF multiplied by IDF, and TF refers to word frequency; IDF means inverse document frequency, where n denotes the number of perceptual data in which a word appears, and D is the total perceptual data number.

For the topic domains, a training sample may be constructed for each topic domain in step S3, and a classifier may be created for each topic domain in step S4, and the training sample of each topic domain is used to train the respective classifier. The classifier may be a classifier that employs a naive bayes model.

The evaluation score is calculated by the formulaWhereinWherein n represents the number of categories in the subject field, Max represents the highest score of the evaluation coefficient, △ represents the highest value minus the lowest value of the evaluation coefficient, h represents each category, h is 1-n, α_hIs an evaluation coefficient of each category; x is the number of_ChThe number of items belonging to each category under each subject domain is satisfiedX_CIRepresenting the number of perceptual data divided into a certain theme zone.

The invention also provides a system for evaluating the perception data, which comprises the following components:

the training corpus module is used for acquiring sensing data as training corpuses;

the preprocessing module is used for preprocessing the materials;

the training word library module is used for calling the preprocessing module to carry out data preprocessing on the training corpus and then carrying out manual labeling to obtain a training word library;

the training sample module is used for extracting the characteristics of the vocabularies in the training word stock to obtain a characteristic dictionary, generating a characteristic vector based on the characteristic dictionary and constructing a training sample;

the training module is used for creating a classifier and training the classifier by using the training samples;

the judging module is used for acquiring the perception data to be evaluated, calling the preprocessing module to perform data preprocessing on the perception data to be evaluated, constructing a perception data vector, inputting the perception data vector into the trained classifier, and judging the category of the perception data;

and the evaluation module is used for calculating the evaluation score of the perception data to be evaluated.

The invention has the following beneficial effects:

the invention evaluates based on the customer perception data, and the evaluation and authentication system is different from the traditional product and system authentication from index selection, evaluation technology to authentication mode. The service quality score is obtained through statistics and calculation of user perception data, and therefore a uniform authentication system standard is established for a certain service industry.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

Fig. 1 is a flowchart of an evaluation method of perception data.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

According to a specific embodiment of the invention, the perception data in the field of express delivery service is evaluated by a method and a system, wherein the sources of the perception data include but are not limited to evaluation contents of microblogs, posts, public comments, post bureau evaluation websites and various e-commerce websites, and can also be derived from user behavior logs, user behavior analysis and the like. The perception data in the embodiment refers to comments of the user on the express delivery service.

The evaluation method specifically comprises the following steps:

s1, crawling comment contents on a network through URL links to serve as training corpora.

Specifically, an open-source nwebrowler program can be adopted to crawl HTML files and then extract evaluation information from the HTML files.

The wider the source distribution of the training comment data is, the more comprehensive the data type is collected, the more accurate the trained classifier is, so that the more accurate the result of subsequent category prediction is, and the more the express company condition can be reflected by the final scoring.

S2, carrying out data preprocessing and manual labeling on the training corpus to obtain a training word bank; the data pre-processing further comprises: formatting, word segmentation, stop words, synonym processing and the like, wherein the specific contents are as follows:

and S21, formatting each comment in the training corpus, and converting the comments into the same structured format, wherein the format can be json or xml and the like. The fields of the structured format include: comment content, topic domain, keywords, company name, etc. There may be multiple subject domains, with multiple level classes defined under each subject domain. The content of the keyword field is extracted based on the original review content.

Take a certain express company in the field of express service as an example. The subject domain was identified as 6 of functionality, economy, safety, timeliness, comfort, civilization, as shown in table 1. The function represents the individual service condition; economically expressing the price status; the safety expresses privacy protection, insurance and cargo integrity; the timeliness shows the delivery speed; whether the comfort performance is convenient to consult, convenient to take and deliver, timely to remind and the like; the service attitude of the company is expressed civilized. Under each topic domain, 4 classes are defined, respectively good, bad, and bad. Of course, the expression corresponding to 4 levels of classes corresponding to the timeliness may also be: fast, slow, very slow; the 4 level class correspondence expressions for economy may be: is cheap, expensive and very expensive.

And S22, segmenting the structured comment content. Wherein, if the comment is Chinese, a Chinese word segmentation device is adopted; if the English words are English words, the blank space is used for word segmentation, and the tense and the single-complex number are normalized by using a word stem extraction mode after the English word segmentation is finished. Specifically, word segmentation tools such as ICTCLAS (Institute of Computing Technology, chinese lexical Analysis System) and IK Analyzer (IK segmenter) may be employed as the chinese segmenter.

And S23, processing the word segmentation result by using a pre-established stop word list, and removing stop words. The stop words include words or words without practical meaning, such as "and" have "," not only "but also" and the like, and some uncommon words and special symbols.

S24, replacing synonyms in the training word library by using a pre-established synonym table, so that all synonyms are represented by one word.

TABLE 1 topic Domain and its level classes

And S25, manually marking the topic domain related to each comment in the training corpus and the level class under the topic domain. It should be noted that a comment may relate to multiple subject domains, but a comment may only correspond to one level class under each subject domain. For example, a topic domain involved in a comment is security and timeliness, and manually labeled "good" in security and "slow" in timeliness according to semantics. If a comment is not related to each topic field, the comment is deleted.

And S26, storing the vocabulary topic division domains subjected to word division, table deactivation and synonym processing into a training word library corresponding to each topic domain in a vector mode.

And S3, extracting the characteristics of the words in the training word library to obtain a characteristic dictionary, generating the characteristic vector of each comment by using the characteristic dictionary, and forming the training sample of each topic domain by using the characteristic vector related in each topic domain and the manually labeled class.

The feature extraction method comprises the following steps: and counting the word frequency of each vocabulary in the training word bank, and selecting the first N (N is more than or equal to 1) high-frequency words as a feature dictionary.

The generation method of the feature vector comprises the following steps: counting the number of words (total dimension) of the feature dictionary, wherein each word corresponds to one feature dimension. Based on this, a feature vector is established for each comment. If a word in the feature dictionary appears in the preprocessed comment, taking the TF-IDF value corresponding to the word as the value of the corresponding dimension; and if the word in the feature dictionary does not appear in the comment, the corresponding feature dimension value is 0.

The form of the feature vector is as follows:

represents: the words in 3 feature dictionaries appear in a preprocessed comment and respectively correspond to the words in the 1 st dimension, the 32 th dimension and the 80 th dimension of the feature dictionary, so that the values of the feature vector of the comment in the 1 st dimension, the 32 th dimension and the 80 th dimension are TF-IDF values of the 3 words, namely 0.1, 0.4 and 0.32, and the values of the feature vectors of other dimensions are 0. 0 indicates that the word corresponding to the dimension in the feature dictionary does not appear in the comment.

The TF-IDF value refers to TF multiplied by IDF, and TF refers to word frequency; IDF means inverse document frequency, where IDF is log (D/n), where n represents the number of comments where the word appears and D is the total number of comments.

And S4, creating a classifier for each topic domain, and training the corresponding classifier by using the training samples of each topic domain. The classifier will be used to predict the class of levels in the subject domain where the reviews are located.

The embodiment adopts a naive Bayes model as a classifier, the classification principle is to judge the probability of the characteristic belonging to each class, and then the class with the highest probability is taken as the classification result. The invention is not limited to the naive Bayes model, and other classifiers such as SVM (support vector machine) classifiers can be adopted.

S5, the comments of a certain company are crawled through URL links, data preprocessing is carried out on the comments, comment vectors are constructed, the comment vectors are input into a trained classifier, the class of the comments in the related subject domains is judged, and then the class distribution condition of the company in the subject domains can be obtained.

The data preprocessing comprises formatting, word segmentation, stop word processing and the like.

And S51, converting the crawled comments into the same structured format, wherein the format can be json or xml and the like. The fields of the structured format include: comment content, topic domain, keywords, company name, etc.

And S52, performing word segmentation on the structured comment content, wherein the word segmentation method is the same as the step S23.

And S53, processing the word segmentation result by using a pre-established stop word list, and removing stop words. The deactivation word list is the same as that used in step S24.

S54, the construction method of the comment feature vector comprises the following steps: comparing the word after the data preprocessing with a characteristic dictionary, and if a word in the characteristic dictionary appears in the word after the data preprocessing, acquiring a TF-IDF value of the word in a training sample as a characteristic value of a corresponding position in a characteristic vector; and if the word in the feature dictionary does not appear in the word after the data preprocessing, the feature value of the corresponding position of the word is 0.

Taking an express company in a time-sensitive subject domain as an example, the distribution conditions of all levels are judged by a classifier as shown in table 2.

Table 2 distribution of express companies in each level on the timeliness topic domain

And S6, calculating the evaluation scores of the companies on each topic domain.

Calculating the score K of the company on a subject domain based on the level class distribution of the company on the subject domain_CIThe calculation formula isWherein

In the formula, n represents the number of level classes under the subject field;

max represents the highest score of the evaluation coefficient;

delta is the highest value minus the lowest value of the evaluation coefficient;

h represents each class, h is 1 to n;

α_hthe evaluation coefficients of all the classes are obtained, and the values of the evaluation coefficients can be changed according to requirements;

x_Chthe number of items belonging to each class in a certain subject domain for the comment of the company

X_CIThe number of comments divided into the topic field I for company C comments is indicated.

Taking the distribution of the level classes of a certain express company on the time-sensitive subject domain in table 2 as an example, the calculation of the evaluation score is described, wherein 4 level classes are set in each subject domain distribution, namely n is 4, h is 1, 2, 3 and 4, and the evaluation coefficient of each level class is set to α₁＝1.2、α₂＝1、α₃＝-1、α₄-1.2. Thus the formula is

K_{C I} = 5 - \frac{4 * (1.2 - R_{C I})}{2.4}

Wherein,

r is to be_CISubstituting the value of (C) into formula K_CI，

K_{C I} = 5 - \frac{4 * (1.2 - R_{C I})}{2.4} = 3.1167.

Meaning that the express company has a score of 3.1167 on the subject field "timeliness".

The invention discloses another specific embodiment, which provides an evaluation system of perception data for implementing the perception data evaluation method, comprising:

a corpus module, configured to implement step S1 to obtain perceptual data as corpus;

the preprocessing module is used for preprocessing the materials; the preprocessing may include formatting, word segmentation, and further may include stop word processing, synonym processing, etc., as described in the above steps S21 to S24;

the training word library module is used for calling the preprocessing module to carry out data preprocessing on the training corpus and then carrying out manual labeling to obtain a training word library; the manual marking can mark a subject field related to each comment in the training corpus and a level class under the subject field;

the training sample module is used for extracting the characteristics of the vocabularies in the training word stock to obtain a characteristic dictionary, generating a characteristic vector based on the characteristic dictionary and constructing a training sample; specifically, the method of the above-described step S3 may be employed;

a training module, configured to create a classifier, train the classifier using the training sample, and specifically adopt the method in step S4;

the judging module is configured to obtain the sensing data to be evaluated, call the preprocessing module to perform data preprocessing on the sensing data to be evaluated, construct a sensing data vector, input the sensing data vector into the trained classifier, and judge the category of the sensing data, where the method in step S5 may be specifically adopted;

and the evaluation module is used for calculating the evaluation score of the perception data to be evaluated, wherein the calculation method is as described in step S6.

In summary, the embodiments of the present invention provide an evaluation method and system for perception data in the field of express service, which classify and quantify user evaluations, and provide a scoring method for express company services, so as to establish a unified evaluation standard in the express service industry.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A method for evaluating perception data is characterized by comprising the following steps:

s1, obtaining perception data as training corpora;

2. The method for evaluating perceptual data according to claim 1, wherein the preprocessing in steps S2 and S5 further comprises formatting and word segmentation, and the specific steps are as follows:

s21, formatting the perception data, and converting the perception data into the same structured format, wherein the structured format at least comprises 4 fields of perception data content, a subject domain, keywords and a company name; wherein, there is at least one topic area, and each topic area defines at least one category;

and S22, segmenting the sensing data content.

3. The evaluation method of perception data according to claim 2, wherein a chinese word segmenter is employed for the chinese perception data; and for English perception data, performing space word segmentation, and after English word segmentation is completed, normalizing the tense and the single-complex number by using a word stem extraction mode.

4. The method for evaluating perceptual data according to claim 2, wherein the preprocessing in steps S2 and S5 further comprises stop word and synonym processing, and the specific steps are as follows:

b. synonyms are replaced with a pre-established synonym table.

5. The method for evaluating perceptual data according to claim 2, wherein the manual labeling in step S2 is performed on the topic area and the category under the topic area.

6. The method for evaluating perceptual data according to claim 1, wherein the method for extracting features in step S3 is: and counting the word frequency of each vocabulary in the training word bank, sequencing the vocabularies according to the word frequency, and selecting the first N words to form a feature dictionary.

7. The method for evaluating perceptual data according to claim 1, wherein the method for generating the feature vector in step S3 is: taking the number of words in the feature dictionary as the total dimensionality of the feature vector, wherein each word in the feature dictionary corresponds to one feature dimensionality, and establishing the feature vector for perception data on the basis of the feature dimensionality; if words in the feature dictionary appear in the preprocessed perception data, taking TF-IDF values corresponding to the appearing words as values of corresponding dimensions of the feature vectors; if the words in the feature dictionary do not appear in the preprocessed sensing data, the corresponding feature dimension value is 0; the TF-IDF value refers to TF multiplied by IDF, and TF refers to word frequency; IDF means inverse document frequency, and IDF is log (D/n), where n denotes the number of perceptual data in which a word appears and D is the total number of perceptual data.

8. The method for evaluating perceptual data according to claim 2, wherein a training sample is constructed for each topic domain in step S3, a classifier is created for each topic domain in step S4, and the training sample for each topic domain is used to train the respective classifier.

9. The method for evaluating perceptual data according to claim 2, wherein the evaluation score is calculated by a formula ofWhereinIn the formula, n represents the number of categories under the theme zone; max representative evaluation systemThe highest score, △ is the highest value of the evaluation coefficient minus the lowest value, h represents each category, h is 1-n, α_hIs an evaluation coefficient of each category; x is the number of_ChThe number of items belonging to each category under each subject domain is satisfiedX_CIRepresenting the number of perceptual data divided into a certain theme zone.

10. An evaluation system for implementing the perceptual data evaluation method of any one of claims 1 to 9, comprising:

the preprocessing module is used for preprocessing the materials;