CN117235253A

CN117235253A - Truck user implicit demand mining method based on natural language processing technology

Info

Publication number: CN117235253A
Application number: CN202211724850.7A
Authority: CN
Inventors: 郑枫; 尹红雨
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-12-15

Abstract

The invention discloses a method for mining implicit demands of truck users based on natural language processing technology, which mainly comprises the steps of acquiring documents and materials about trucks, HOWNET emotion dictionary and social media comment text about trucks; carrying out validity analysis on the original comment text to obtain an effective comment text; carrying out corpus preprocessing on truck documents and materials, HOWNET emotion dictionary and effective comment text, and constructing an attribute vocabulary library and an emotion vocabulary library; and (5) manually labeling the effective comment text. The invention relates to the technical field of user demand mining, in particular to a method for mining implicit demands of truck users based on a natural language processing technology. The technical problem to be solved by the invention is to provide the method for mining the implicit demands of the truck users based on the natural language processing technology, which is convenient for mining the implicit demands of the truck users.

Description

Truck user implicit demand mining method based on natural language processing technology

Technical Field

The invention relates to the technical field of user demand mining, in particular to a method for mining implicit demands of truck users based on a natural language processing technology.

Background

In the process of product improvement design, it is important to obtain user demands in time. The traditional method mainly adopts methods such as questionnaire investigation, user interview, eye movement experiment and the like, and has the disadvantages of time delay, long period and high cost. Thanks to the rapid development of the e-commerce industry and the social network, online comments become effective ways for expressing the carrier of the user demands and obtaining the product defects, and the product features focused by the user can be obtained from massive online comments by utilizing a data mining tool, and the text is mined and analyzed by utilizing a natural language processing technology, so that the potential client views, text trends and emotion states can be obtained. The online comment becomes an important basis for users to select and use products, and potential demands of the users on the products are hidden in the comment, however, most researches on analysis of the potential demands of the users based on the online comment are still in a starting stage, how to acquire the dominant demands and the recessive demands of the users more comprehensively and accurately, and the important problem of the current need of deep research is still that an efficient and accurate product demand mining method is needed.

Disclosure of Invention

The technical problem to be solved by the invention is to provide the method for mining the implicit demands of the truck users based on the natural language processing technology, which is convenient for mining the implicit demands of the truck users.

The invention adopts the following technical scheme to realize the aim of the invention:

the method for mining the implicit requirements of the truck users based on the natural language processing technology is characterized by comprising the following steps of:

step one: based on a Windows 10-bit operating system and a Ubuntu18.0.4 operating system, an experimental platform is built by PyCharm and Python3.6 together, and folders of truck data, emotion dictionary, online comment and effective comment text are built;

step two: and (3) data collection: collecting documents and materials, emotion dictionary and comment text of the truck, and storing the documents and materials, emotion dictionary and comment text in a corresponding folder;

step three: analyzing comment text validity: separating out invalid comments, and storing valid comments under a folder of valid comment text;

step four: corpus pretreatment: carrying out corpus preprocessing on data in two folders of 'truck data' and 'effective comment text' by using an NLPIR-ICCLAS Chinese lexical analysis system;

step five: and (3) word stock construction: the method is divided into an attribute vocabulary library and an emotion vocabulary library. Nouns or noun phrases obtained after the contents in the folder of the truck data are subjected to NLPIR-ICICCLAS Chinese lexical analysis system word segmentation are used as an initial attribute lexicon; the content in the folder of emotion dictionary is used as an initial emotion vocabulary library; nouns or noun phrases obtained after the contents in the folders of the effective comment text are subjected to word segmentation by an NLPIR-ICICCLAS Chinese lexical analysis system are used as candidate attribute vocabulary libraries, and adjectives or adjective phrases are used as candidate emotion vocabulary libraries; finally, screening and de-duplication are carried out, and a final attribute vocabulary library and a final emotion vocabulary library are obtained through arrangement;

Step six: classifying the text of the effective comments, and classifying by using a text classification model after manual labeling;

step seven: attribute vocabulary-emotion vocabulary extraction: defining a judging standard of a dominant sentence pattern and a recessive sentence pattern, and judging the dominant sentence pattern if the comment text contains attribute words and emotion words at the same time; if the comment text only contains attribute words, judging the comment text as a hidden sentence pattern;

step eight: emotion quantification; carrying out emotion polarity analysis and quantization on the result by utilizing an SO-PMI algorithm and an HOWNET emotion polarity quantization standard, and finishing to obtain a < attribute vocabulary and emotion average > set;

step nine: and (5) demand ordering: constructing a truck demand ordering model by combining a KANO model and a DEMATEL analysis method, finding the relation between demands and user satisfaction by using the KANO model, and classifying the demand attributes; according to the influence relation between quantitative requirements of a DEMATEL analysis method, a clear ordering strategy is finally provided;

step ten: and (5) utilizing a questionnaire method to conduct network investigation on the satisfaction degree of part of truck users on each design element of truck products, and comparing and analyzing the satisfaction degree with the requirement sequencing result.

As a further limitation of the present technical solution, the data collection in the second step is specifically as follows:

Acquiring documents and data about trucks, such as truck market reports, truck news, truck advertisements and the like by using an NLPIR-ICCLAS Chinese word segmentation system and an octopus network crawler tool, and storing the documents and the data under a folder of truck data;

obtaining an HOWNET emotion dictionary, a university emotion vocabulary ontology of the university company, a Li Jun Chinese commend and derogative dictionary of the university of Qinghai and a simplified Chinese emotion dictionary of the university of Taiwan by using a literature retrieval method, and storing the emotion dictionary under a file of emotion dictionary;

and (5) acquiring social media comment texts about the trucks by using a manual retrieval method and an octopus network crawler tool, and storing the social media comment texts under a folder of on-line comments.

As a further limitation of the present technical solution, in the third step:

defining valid comments:

(1) The valid reviews must be associated with the actual product information, excluding irrelevant noisy information; in addition, the comments should also have information such as user experience, and not only simple product introduction information;

(2) The effective comments contain more product feature dimensions and the content is more detailed;

(3) In addition, the effective comments have universality in sentence-like structures and contents, and are not sentences with some remote words and special structures;

Defining invalid comments:

(1) The invalid comments have the characteristic of low relevance to the truck products, and detailed description of specific design elements of the truck products, such as the comments made only for an e-commerce platform, are not involved in the content;

(2) Part of comment content is obviously different from the actual product characteristics and is also defined as invalid comment;

cleaning invalid comments to reduce interference of noise data on experimental results:

(1) The comment is deleted, the comment only comprises a plurality of words, a string of characters or all punctuation marks, and the comment has no information value and needs to be deleted;

(2) Repeated comments are deleted, the repeated comments can increase workload, influence the emotion analysis and the result of user attention, and need to be deleted;

(3) Special symbol processing, including emoticons, special symbols or messy codes, requires deletion.

As a further limitation of the present technical solution, the specific flow of the fourth step is:

step four, first: data cleaning: finding interesting contents in the corpus, and cleaning and deleting uninteresting contents which are regarded as noise;

step four, two: word segmentation: processing all text data into words or words with minimum unit granularity;

And step four, three: part of speech tagging: tagging each word or word with a word class label, such as adjective, verb, noun, etc.;

and step four: deactivating words, which generally refer to words that do not contribute to text features, such as punctuation, mood, person scale, and the like.

As a further limitation of the present technical solution, the specific flow in the step six is:

step six,: in order to prevent the phenomenon of data overfitting during text classification, firstly, carrying out truck product design element classification work, namely carrying out attribute classification on effective comment texts according to ten types of first-class functional attributes in a truck platform product feature catalog;

step six, two: manually labeling the effective comment text, namely manually judging which class of ten classes of first-class functional attribute labels the effective comment text belongs to, marking '1' in the label frame of the corresponding class, marking '0' in the label frames of the other classes of objects, and taking the data set as a training set;

and step six, three: learning a training set by using a machine learning model and a deep learning model, and then classifying attributes of the effective comment text which is not manually marked;

step six, four: and selecting a result with highest accuracy in the text classification results by adopting a K-fold cross validation method for subsequent research.

As a further limitation of the present technical solution, the specific flow of the step eight is:

the emotion polarity analysis uses an SO-PMI algorithm, namely an emotion tendency point mutual information algorithm, the whole idea of the algorithm is very simple, the probability that a word P (word) to be judged and a reference word P (base) are simultaneously appeared is judged, if the probability that the word P (word) to be judged and a positive word are simultaneously appeared is higher, the word is judged to be a positive word, if the probability that the word P (word) to be judged and a negative word are simultaneously appeared is higher, the word is judged to be a negative word, and if the probability that the word P (word) to be judged and the positive word and the negative word are the same, the word is judged to be a neutral word;

the formula of the algorithm is as follows:

wherein: num (pos) refers to the total number of positive reference words;

homonymy num (neg) refers to the total number of passive fiducial words;

posi refers to the positive reference word;

negi refers to a negative benchmark term;

PMI(word，pos _i ) Referring to the point-to-point information of words and positive reference words;

PMI(word，neg _i ) The mutual information between the points of the words and the negative reference words is referred to;

the results from the formula are as follows:

SO-PMI >0, and the word is judged as an active word;

SO-pmi=0, the word is judged as a neutral word;

SO-PMI <0, the word is judged as a negative word.

And after the emotion polarity analysis is completed, calculating an emotion mean value according to the HOWNET emotion polarity quantification standard table.

As a further limitation of the present technical solution, in the step nine, attribute classification and priority identification are performed on the demands by adopting a KANO model, and the implementation steps are as follows:

step nine, one by one: questionnaire preparation: according to the attribute vocabulary, satisfaction, user attention > and the set design questionnaire, the feedback condition of the user on the truck demand is comprehensively known;

step nine two: and data processing, namely calculating satisfaction coefficients and dissatisfaction coefficients of all requirements by using a Better-Worse satisfaction coefficient calculation formula in order to obtain the priority ordering of the requirements more intuitively. During calculation, the non-difference requirement and the reverse requirement are removed, and specific calculation formulas are shown as formula (2) and formula (3);

the product provides this requirement, the Better coefficient is:

Better/SI ＝ (A+O)/(A+O+M+I) (2)

the product does not offer this requirement, the Worse coefficient is:

Worse/DSI ＝ -(O+I)/(A+O+M+I) (3)

wherein: a-number of charm demand options;

o-number of desired demand options;

m-the number of requisite demand options;

i-number of options for demand without differences;

drawing a demand quartering bitmap according to the Better and Worse coefficient values, wherein the ordering principle is that the requisite demand M > expected demand O > charm demand A > has no difference demand I, when a plurality of demand items belong to the same attribute, the internal ordering is required to calculate the importance degree, and the weight calculation formula of the ith demand item is as follows:

Wherein: _wi -initial weight of the ith product demand;

k _i ——the adjustment coefficients of the necessary attribute, the desired attribute, the charm attribute, and the non-difference attribute are taken as 1,2,4,0 in order.

According to w _i And k is equal to _i The value can calculate the weight w 'of each demand item' _i Thereby determining the demand item ordering, and determining the final product demand priority ordering by combining the classification principle.

As a further limitation of the present technical solution, in the step nine, in combination with the DEMATEL analysis method, a group of experts is established, and the expert opinion is adopted to further explore the influence relationship between requirements, so as to convert the influence relationship into a more objective product requirement ordering method, and the steps of briefly calculating are as follows:

step nine, two and one: by scoring the influence of the demands in pairs, the causal relationship between the demands is quantified to obtain a direct influence matrix A, wherein a _ij The influence degree of the requirement i on the requirement j is expressed;

step nine two: normalized direct influence matrix N is obtained by equation (5) and equation (6):

step nine, two and three: solving a comprehensive influence matrix T:

T＝N(I-N) ^-1 (7)

wherein: i, an identity matrix;

step nine, two and four: setting a threshold value and outputting coordinate information of an influence causal relationship graph;

step nine, two and five: dividing the influence causal relationship graph into four quadrants, and determining the demand category.

Compared with the prior art, the invention has the advantages and positive effects that:

1. according to the invention, on-line comments of truck users are taken as data sources, the natural language processing technology is utilized to carry out demand mining research, the text classification technology is utilized to clarify implicit demands, and the user attention and the user satisfaction are comprehensively considered to determine the demand priority. Compared with the psychological modeling, the natural language processing technology can be used for rapidly and accurately acquiring the user demands, so that the accuracy is improved, the enterprise cost is saved, the opinion and the suggestion can be provided for product design or improvement in real time, and the method has practical significance in big data age.

2. The method is based on the current research, a plurality of text classification models are used, and the K-fold cross validation method is used for selecting the result with the highest accuracy to conduct the user implicit demand mining, so that the accuracy is improved.

3. Considering that part of the product attributes of the truck are influenced by environmental factors and are influenced to different degrees, the satisfaction degree of users on the product attributes, such as appearance, tires and a braking system, is influenced by natural condition factors, and the influence degree of the product attributes is different, so that the influence coefficient of the environmental factors is provided, and the influence of the environmental factors on the subjective evaluation of the truck attributes of the users is reduced to the greatest extent.

Drawings

Fig. 1 is a main technical route of the present invention.

FIG. 2 is a schematic diagram of a web crawler technique of the present invention crawling social media comment text.

Fig. 3 is a schematic diagram of a web crawler technology of the present invention crawling e-commerce website test driving comment codes.

Fig. 4 is a schematic diagram of corpus preprocessing by using the NLPIR-ICICLAS chinese lexical analysis system of the present invention.

Fig. 5 is a schematic diagram of a word stock construction process according to the present invention.

Fig. 6 is a schematic diagram of manual labeling of valid comment text according to the present invention.

Fig. 7 is a schematic diagram of reading manually labeled text corpus data according to the present invention.

Fig. 8 is a schematic diagram of the cleaning of text data according to the present invention.

Fig. 9 is a schematic diagram of the word sequence of the present invention.

FIG. 10 is a schematic representation of text conversion to digital codes according to the present invention.

Fig. 11 is a detailed process diagram of three machine learning of naive bayes, random forests and support vector machines of the present invention.

FIG. 12 is a schematic diagram of the operation of the Billet model of the present invention.

FIG. 13 is a schematic diagram of the operation of the attention_BiLSTM model of the present invention.

FIG. 14 is a schematic diagram of a KANO model according to the present invention.

FIG. 15 is a graph of four-quadrant influence causal relationship of the present invention.

FIG. 16 is a diagram of a truck demand prioritization model in accordance with the present invention.

Detailed Description

One embodiment of the present invention will be described in detail below with reference to the attached drawings, but it should be understood that the scope of the present invention is not limited by the embodiment.

As shown in fig. 1 to 16, the present invention includes the steps of:

step one: based on a Windows 10-bit operating system and a Ubuntu18.0.4 operating system, an experimental platform is built by PyCharm and Python3.6 together, and folders of truck data, emotion dictionary, online comment and effective comment text are built. And performing text mining, demand analysis and other related experiments by using an NLPIR-ICICCLAS Chinese word segmentation system, and crawling the online comments and the post content of the product community by using an octopus network crawler tool to serve as a comment corpus and an experiment data set.

Step two: and (3) data collection: and collecting documents and materials, emotion dictionary and comment text of the truck, and storing the documents and materials, emotion dictionary and comment text in a corresponding folder.

The NLPIR-ICICSAS Chinese word segmentation system and the octopus network crawler tool are utilized to acquire documents and data about trucks, such as truck product characteristic catalogues, truck market reports from 1 month 1 day 2019 to 31 days 12 months 2020, truck news, truck advertisements and the like, and the documents and data are saved under a file folder of truck data.

The HOWNET emotion dictionary, university emotion vocabulary ontology of the great company, the university of Qinghai Li Jun Chinese emotion dictionary and the university of Taiwan simplified Chinese emotion dictionary are obtained by a literature retrieval method and stored under an emotion dictionary folder.

And (3) acquiring social media comment texts about the truck from 1 st 2019 to 12 nd 31 st 2021 by using a manual retrieval method and an octopus network crawler tool, wherein the social media comment texts comprise comments of 'truck home' forum, test driving comments on an e-commerce website and comments of short video software, and the total 4463 comment data are stored under an on-line comment folder.

Step three: analyzing comment text validity: and separating out the invalid comments, and storing the valid comments under the folder of the valid comment text.

The validity of the online comment refers to that a comment receiver can obtain valid information from the comment to assist the comment receiver in making decisions; for the user, the main role of the valid comment is to eliminate the uncertainty; from the standpoint of product design or research mining on user needs, it is desirable to obtain detailed evaluations of the product design elements by users from the review data.

On-line comments have a lot of useless information, so that the collected text is valuable for research, from the perspective of truck products, research results of other scholars on comment validity influence factors are synthesized, and valid comments are defined from the aspects of the correlation degree of comment contents and truck product characteristics, the concrete degree of the comment contents, the readability of the comment contents and the like:

(1) The valid reviews must be associated with the actual product information, excluding irrelevant noisy information; in addition, the comments should also have information such as user experience, and not just simple product introduction information.

(2) The effective reviews should contain more product feature dimensions and more detailed content.

(3) In addition, the effective comments have universality in sentence-like structures and contents, and are not sentences with some remote words and special structures.

The characteristics of the invalid comments are opposite to those of the valid comments, the invalid comments cannot provide the enterprise with the real experience of the user, the difficulty is increased for the work of the user requirement mining, and the interference to the requirement mining result is likely to be caused.

Defining invalid comments:

(1) The invalid comments have the characteristic of low relevance to the truck product, and detailed descriptions of specific design elements of the truck product, such as comments made only on an e-commerce platform, are not involved in the content.

(2) Some comment content is obviously different from the actual product characteristics and is also defined as invalid comment.

And cleaning invalid comments so as to reduce interference of noise data on experimental results.

(1) The comment is deleted, the comment only comprises a few words, a string of characters or all punctuation marks, and the comment has no information value and needs to be deleted.

(2) And (3) repeating comment deletion, wherein the repeated comment can increase workload, influence the emotion analysis and the result of user attention, and need to be deleted.

The above criteria serve as criteria for manually annotating the validity of comment text. After validity analysis of the comment text, 2713 valid comments are obtained and saved under the folder of the valid comment text.

Step four: corpus pretreatment: and carrying out corpus preprocessing on the data in the two folders of the truck data and the effective comment text by using an NLPIR-ICCLAS Chinese lexical analysis system.

Step four, first: data cleaning: finding interesting contents in the corpus, and cleaning and deleting uninteresting contents which are regarded as noise; including extracting information such as headlines, abstracts, text, etc. from the original text, and removing advertisements, tags, HTML, JS code, notes, etc. from the crawled web content.

Step four, two: word segmentation: the Chinese corpus data is generally short text or long text, and when text mining analysis is performed, all text data needs to be processed into words or words with minimum unit granularity.

And step four, three: part of speech tagging: tagging each word or word with a word class label, such as adjective, verb, noun, etc.; doing so allows the text to incorporate more useful language information in later processing. An ICTCLAS chinese part-of-speech tagging set is used, as shown in table 1.

TABLE 1ICTCLAS Chinese part-of-speech tagging set

Step five: and (3) word stock construction: the method is divided into an attribute vocabulary library and an emotion vocabulary library. Nouns or noun phrases obtained after the contents in the folder of the truck data are subjected to NLPIR-ICICCLAS Chinese lexical analysis system word segmentation are used as an initial attribute lexicon; the content in the folder of emotion dictionary is used as an initial emotion vocabulary library; nouns or noun phrases obtained after the contents in the folders of the effective comment text are subjected to word segmentation by an NLPIR-ICICCLAS Chinese lexical analysis system are used as candidate attribute vocabulary libraries, and adjectives or adjective phrases are used as candidate emotion vocabulary libraries; and finally, screening and de-duplication are carried out, and a final attribute vocabulary library and a final emotion vocabulary library are obtained through arrangement.

Step six: and classifying the text of the effective comments, and classifying by using a text classification model after manual labeling.

Step six,: in order to prevent the phenomenon of data overfitting during text classification, the classification work of truck product design elements is firstly carried out, namely, the effective comment text is classified according to ten classes of first-class functional attributes in the truck platform product feature catalogue.

A focal group was established, comprising 12 members of an industrial design study, 6 men and 6 women, with ages ranging from 24 to 28. In order to prevent the phenomenon of data overfitting during text classification, the classification work of truck product design elements is firstly carried out.

The product feature catalog is a complete set of language systems used to describe exactly what a product is, including descriptions from index large items to specific product function items. The index from the product definition investigation can be refined to the third level index, and the development target value of the refinement is directed to go deep into the fourth level index. The primary feature directory represents the customer domain: vehicle characteristics that directly affect purchase decisions, primarily those related to the customer; the secondary feature directory represents a functional domain: secondary or functional vehicle characteristics are functions and attributes necessary to achieve primary characteristics; the three-level feature directory represents the physical domain: physical preconditions for realizing the characteristics of the functional domain are a great number of refined target values; the four-level and above feature directories are described in more detail as required.

The characteristic catalog item of the product of the truck platform reaches 1250. The catalog has universality, is suitable for various vehicle types, wherein a primary characteristic catalog (client domain) comprises 10 items, a secondary characteristic catalog (functional domain) comprises 54 items, a tertiary characteristic catalog (physical domain) comprises 265 items, and partial characteristics are decomposed into four-level and five-level technical index layers. Specific items need to formulate a special feature directory according to their own characteristics, from which representative items are selected.

Step six, two: and (3) manually marking the effective comment text, namely manually judging which class of ten classes of first-class functional attribute labels the effective comment text belongs to, marking '1' in the label frame of the corresponding class, marking '0' in the label frames of the other classes of objects, and taking the data set as a training set.

And (3) carrying out hierarchical division according to the primary function attribute, the secondary function attribute and nouns or noun phrases in a final attribute vocabulary library in the truck platform product feature catalog, wherein the division process is shown in a table 2. The classification results of the first-level functional attributes and the final attribute vocabulary are finally retained, as shown in table 3. The valid comment text was then manually annotated by the same focus team for a total of 275, as shown in FIG. 6.

TABLE 2 truck product design element classification process

TABLE 3 layering results for truck product design elements

And step six, three: and (3) manually marking the effective comment text, namely manually judging which class of ten classes of first-class functional attribute labels the effective comment text belongs to, marking '1' in the label frame of the corresponding class, marking '0' in the label frames of the other classes of objects, and taking the data set as a training set.

The text classification models adopted in the study are six in total, and comprise three machine learning models, namely Naive Bayes, random Forest and support vector machines (Support Vector Machines, SVM), and three deep learning models, namely an LSTM algorithm (Long Short Term Memory, long short term memory network), a BiLSTM algorithm (Bi-directional Long Short Term Memory, two-way long term memory network) and an Attention-based two-way long term memory network.

Naive Bayes is a classification method based on the independent assumption of Bayes theorem and feature conditions. For a given training data set, first learning a joint probability distribution of input/output based on independent assumptions of feature conditions; then, based on the model, the output with the maximum posterior probability is obtained for a given input by using the bayesian theorem. Naive bayes are classification algorithms based on probability theory.

The Random Forest algorithm (Random Forest) is a supervised learning algorithm and is an integrated learning algorithm with decision trees as a base learner. Firstly, randomly sampling m samples from an original data set with a place back to generate m training sets; respectively training m decision tree models for m training sets; then, for a single decision tree model, assuming that the number of training sample features is n, selecting the best feature for splitting according to the information gain/information gain bikini coefficient during each splitting; and finally, forming a random forest by the generated multiple decision trees. For the classification problem, voting to determine the final classification result according to a plurality of tree classifiers; for regression problems, the final prediction result is determined by the average of the prediction values of the multiple trees.

The support vector machine (Support Vector Machines, SVM) is a two-class model whose basic model is a linear classifier defined at maximum separation in feature space, the maximum separation distinguishing it from the perceptron; the SVM also includes a kernel technique, which makes it a substantially nonlinear classifier. The learning strategy of the SVM is interval maximization, and can be formed into a problem of solving convex quadratic programming, and the problem is also equivalent to the minimization of regularized hinge loss function. The learning algorithm of the SVM is an optimization algorithm for solving convex quadratic programming. The algorithm core of the SVM is to find the geometric spacing, find the geometric spacing margin and process the linear separable problem.

The LSTM algorithm (Long Short Term Memory, long and short term memory network) is an important and most currently used time series algorithm, and is a special RNN (Recurrent Neural Network ) capable of learning long term dependencies. Mainly aims to solve the problems of gradient elimination and gradient explosion in the long sequence training process. In short, LSTM is able to perform better in longer sequences than normal RNNs.

The BiLSTM algorithm (Bi-directional Long Short Term Memory, two-way long and short term memory network) is composed of two LSTM combinations, one is a forward processing input sequence; another reverse processing sequence, after processing is completed, concatenates the outputs of the two LSTMs. Only after all time steps are calculated, the final output result of BiLSTM can be obtained. The forward LSTM obtains a result vector through m time steps; after n time steps, the reverse LSTM obtains another result, and the two result vectors are spliced together to obtain a final BiLSTM output result.

Attention-BiLSTM (Attention-based two-way long-short-term memory network) is the earliest

The paper "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification" published by national academy of sciences in 2015 is mainly used for relationship classification, and can use BiLSTM and Attention mechanisms, which can automatically pay Attention to words having decisive influence on classification, so as to capture the most important semantic information in sentences without using additional knowledge and natural language processing technical systems.

Manually annotated text corpus data is first imported and read, as shown in fig. 7. Then, the text data is cleaned, as shown in fig. 8, and the word order is constructed as shown in fig. 9. The cleaning comprises extracting effective characters by using a regular expression, then using an open source Jieba word segmentation tool to segment words, removing dead words, namely removing useless characters and words which are not useful for research contents, filtering the dead words, filtering based on word length after word segmentation, and filtering the length of the words which is less than or equal to 1, otherwise, keeping the words. The id number is used for label identification, that is, the current label is de-duplicated and then marked, and the text is converted into digital codes, as shown in fig. 10.

After processing, training a model by using a Word2vec Word vector tool, dividing a training set and a testing set, and constructing an emotion classification model based on deep learning.

The following is a process of machine learning:

1. performing word segmentation and removing stop words;

2. acquiring a text representation model through training TF-IDF;

3. and inputting the characteristics into the corresponding model for training and evaluating the index.

The detailed process of three machine learns, naive Bayes, random Forest and support vector machines (Support Vector Machines, SVM), is shown in fig. 11.

The following is a process of deep learning:

1. constructing a word sequence dictionary for the data after word segmentation to obtain words: a serial number;

2. training the embellishing of each Word, obtaining through Word2vec training, constructing with Word sequence, and obtaining sequence number: enabling the embedding;

3. converting the data into id numbers, and randomly dividing a training set and a testing set;

4. setting up LSTM, biLSTM and attribute_BiLSTM, setting corresponding parameters, and training.

The operation of the BiLSTM algorithm (Bi-directional Long Short Term Memory, two-way long and short term memory network) is shown in FIG. 12, and the operation of the Attention-based two-way long and short term memory network is shown in FIG. 13.

The basic idea of cross-validation is to group raw data (dataset) in a certain sense, wherein one part is used as a training set (train set) and the other part is used as a validation set (validation set or test set), firstly, the classifier is trained by the training set, and then, a model (model) obtained by training is tested by the validation set, so that the model is used as a performance index for evaluating the classifier. The research utilizes a K-fold cross validation method to select the result with highest accuracy in the text classification result for subsequent research.

The sample data is divided into 10 parts, the prediction index value is calculated based on each data by developing exercise detection by a K-fold cross test method, and then the average value of 10 times is taken, so that the final prediction result is established. Accuracy (P), recall (R), and F1 value (F1-score) are metrics commonly used for model evaluation. The accuracy is the ratio of the predicted correct in all samples, the recall ratio is the ratio of the predicted correct similarity in the data set, the F1 value is a comprehensive index, the maximum value is 1, the minimum value is 0, and the sum average of the accuracy and the recall ratio is obtained. The higher the F1 value, the better the effect of the text similarity model. The average evaluation results of the 6 models are shown in table 4, and thus the LSTM model is highest in accuracy.

Table 4 6 average evaluation results of text classification models

Results of the LSTM model were derived, and some of the results are shown in table 5.

TABLE 5 classification results (section) of LSTM model

/>

Step seven: attribute vocabulary-emotion vocabulary extraction: defining a judging standard of a dominant sentence pattern and a recessive sentence pattern, and judging the dominant sentence pattern if the comment text contains attribute words and emotion words at the same time; if the comment text only contains attribute words, judging the comment text as a hidden sentence pattern.

Step eight: emotion quantification; and carrying out emotion polarity analysis and quantization on the result by using an SO-PMI algorithm and an HOWNET emotion polarity quantization standard, and finishing to obtain a < attribute vocabulary and emotion average > set.

the formula of the algorithm is as follows:

wherein: num (pos) refers to the total number of positive reference words;

homonymy num (neg) refers to the total number of passive fiducial words;

posi refers to the positive reference word;

negi refers to a negative benchmark term;

the results from the formula are as follows:

SO-PMI >0, and the word is judged as an active word;

SO-pmi=0, the word is judged as a neutral word;

SO-PMI <0, the word is judged as a negative word.

TABLE 6 HOWNET emotion polarity quantification Standard Table

Considering that natural condition factors can affect part of functional attributes, a focus group method is utilized to draw an influence coefficient of an environmental factor, all truck attributes are divided into three types which are not affected and are affected less and are affected more, satisfaction is calculated together with emotion mean values, and a < attribute vocabulary and satisfaction > set is obtained through arrangement. And combining the occurrence frequency of each attribute word, and obtaining the attention of the user by using word frequency statistics technology. Summarizing to obtain a set of attribute words, satisfaction and user attention.

Step nine: and (5) demand ordering: the KANO model and the DEMATEL analysis method are combined to construct a truck demand ordering model, the specific flow is shown in a table 7, the KANO model is utilized to find the relation between the demand and the user satisfaction, and the demand attribute classification is carried out; and quantifying the influence relation among requirements according to a DEMATEL analysis method, and finally giving a clear ordering strategy.

Table 7 construction of truck demand ordering model

And adopting a KANO model to classify the attributes and identify the priorities of the demands, and performing the following steps:

Step nine, one by one: questionnaire preparation: and designing a questionnaire according to the < attribute vocabulary, satisfaction and user attention > set so as to comprehensively know the feedback condition of the user on the truck demand.

Step nine two: and data processing, namely calculating satisfaction coefficients and dissatisfaction coefficients of all requirements by using a Better-Worse satisfaction coefficient calculation formula in order to obtain the priority ordering of the requirements more intuitively. And in the calculation, the non-difference requirement and the reverse requirement are removed, and specific calculation formulas are shown as formula (2) and formula (3).

The product provides this requirement, the Better coefficient is:

Better/SI ＝ (A+O)/(A+O+M+I) (2)

the product does not offer this requirement, the Worse coefficient is:

Worse/DSI ＝ -(O+I)/(A+O+M+I) (3)

wherein: a-number of charm demand options;

o-number of desired demand options;

m-the number of requisite demand options;

i-number of options for demand without differences.

wherein: w (w) _i -initial weight of the ith product demand;

k _i the adjustment coefficients of the requisite attribute, the expected attribute, the charm attribute and the non-difference attribute are taken as 1,2,4,0 in sequence.

The method is combined with a DEMATEL analysis method, a special group is established, the influence relation among requirements is further explored by adopting expert opinions, the method is converted into a more objective product requirement ordering method, and the steps of brief calculation are as follows:

step nine, two and one: by scoring the influence of the demands in pairs, the causal relationship between the demands is quantified to obtain a direct influence matrix A, wherein a _ij The degree of influence of the requirement i on the requirement j is expressed.

step nine, two and three: solving a comprehensive influence matrix T:

T＝N(I-N) ^-1 (7)

wherein: i-identity matrix.

Step nine, two and four: setting a threshold and outputting the coordinate information of the influence causal relation graph. The influence causal relationship graph is constructed based on the comprehensive influence matrix T information, and each coordinate information is shown in table 8.

TABLE 8INRM coordinate information definition

The INRM is divided into four quadrants by calculating the average of the horizontal vector center degree D+R and the vertical vector cause degree D-R as shown in FIG. 15. In a four-quadrant influence causal relationship graph, each quadrant represents different meanings and features, and the position of a specific demand in the graph can be used for determining the category of the demand to which the specific demand belongs.

And taking a KANO model for dividing the user demands as a primary criterion, and taking a DEMATEL analysis method for making decisions by an expert as a secondary criterion for adjustment and supplement to construct a truck demand ordering hierarchical model, wherein the model comprises a target layer, a KANO criterion layer, a DEMATEL criterion layer and a demand ordering table. The target layer is used for sequencing the truck demand priority; the KANO criterion layer carries out preliminary prioritization on the product demands, and the demands are arranged according to the priority order of M > O > A > I; the DEMATEL criterion layer identifies key requirements and the mutual influence relation thereof through INRM, and further sorts the requirements in the same quadrant of the DEMATEL according to the mode of center degree from high to low by combining a first quadrant > second quadrant > third quadrant > fourth quadrant sorting mode of the DEMATEL; after the qualitative and quantitative double ordering of all the demands, a truck demand priority ordering table is formed, and the demand priority and influence relation information are more effectively conveyed, wherein the specific process is shown in fig. 16.

The above disclosure is merely illustrative of specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be considered by those skilled in the art should fall within the scope of the present invention.

Claims

1. The method for mining the implicit requirements of the truck users based on the natural language processing technology is characterized by comprising the following steps of:

step one: based on a Windows 1064-bit operating system and a Ubuntu18.0.4 operating system, an experimental platform is built by PyCharm and Python3.6 together, and folders of truck data, emotion dictionary, online comment and effective comment text are built;

2. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 1, wherein: the data collection in the second step is specifically as follows:

3. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 2, wherein: in the third step:

defining valid comments:

Defining invalid comments:

4. The method for mining implicit requirements of truck users based on natural language processing techniques as claimed in claim 3, wherein: the specific flow of the fourth step is as follows:

5. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 4, wherein: the specific flow of the step six is as follows:

6. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 5, wherein: the specific flow of the step eight is as follows:

the formula of the algorithm is as follows:

wherein: num (pos) refers to the total number of positive reference words;

homonymy num (neg) refers to the total number of passive fiducial words;

posi refers to the positive reference word;

negi refers to a negative benchmark term;

The results from the formula are as follows:

SO-PMI >0, and the word is judged as an active word;

SO-pmi=0, the word is judged as a neutral word;

SO-PMI <0, the word is judged as a negative word.

7. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 6, wherein: in the ninth step, the KANO model is adopted to classify the attribute and identify the priority of the demand, and the implementation steps are as follows:

the product provides this requirement, the Better coefficient is:

Better/SI＝(A+O)/(A+O+M+I) (2)

the product does not offer this requirement, the Worse coefficient is:

Worse/DSI＝-(O+I)/(A+O+M+I) (3)

wherein: a-number of charm demand options;

O-number of desired demand options;

m-the number of requisite demand options;

i-number of options for demand without differences;

wherein: w (w) _i -initial weight of the ith product demand;

8. The natural language processing technology-based method for mining implicit requirements of truck users according to claim 6, wherein: in the step nine, a DEMATEL analysis method is combined, a group of experts is established, the influence relation among requirements is further explored by adopting expert opinions, the method is converted into a more objective product requirement ordering method, and the steps of brief calculation are as follows:

step nine, two and three: solving a comprehensive influence matrix T:

T＝N(I-N) ^-1 (7)

wherein: i, an identity matrix;