CN114265931A

CN114265931A - Big data text mining-based consumer policy perception analysis method and system

Info

Publication number: CN114265931A
Application number: CN202111434036.7A
Authority: CN
Inventors: 刘勤; 詹若贤; 贾梦婷; 谢春晖; 温晓楠
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-04-01

Abstract

The invention discloses a consumer policy perception mining method and a system based on big data text mining, which comprises the following steps: acquiring policy text data and consumer comment text data in a specific field from related government websites and social platforms; preprocessing the acquired text data, including duplicate removal, noise reduction and short sentence processing; deeply mining professional vocabularies based on machine recognition and an expert knowledge corpus and constructing a policy corpus; and (4) carrying out policy perception mining analysis on the consumers based on the consumer comments and in combination with the policy corpus. The method fully utilizes network public opinion information, excavates the comment text of the consumer from the perspective of consumer policy perception, and establishes an expert knowledge corpus excavated by the consumer policy perception, so that the text word segmentation effect is improved to a greater extent.

Description

Big data text mining-based consumer policy perception analysis method and system

Technical Field

The invention relates to the technical field of data mining, in particular to consumer policy perception mining based on big data text mining, and specifically relates to a consumer policy perception analysis method and system based on big data text mining.

Background

It is of great practical significance to study consumer perception of policies. The research on the policy perception of the consumers is to research the attention content, attention degree, evaluation on the rationality and necessity, emotional attitude and the like of the consumers to the policies.

With the popularization of internet technology, more and more consumers can make their own comments on the internet to form a large amount of consumer generated content (UGC), and the large amount of text data generated on the social media platform reflects the true mind of the consumers to a great extent, and the attention hotspots and emotional attitudes of the consumers can be grasped by analyzing the real idea. Meanwhile, the continuously growing policy text data on the government website is used as an open and available information resource and also contains a lot of information, and deep mining and analysis of the information is an important way for tracing policy intentions and mastering the attitude of decision makers. In the aspect of policy research, the traditional content analysis method mainly depends on expert scholars to research the content of the policy text, and has the disadvantages of large workload, narrow coverage and low efficiency. With the continuous development of big data technology, the method of quantitative analysis occupies a position in the research of policy text mining analysis.

With respect to consumer policy-aware research, although related researchers have also proposed many solutions in policy mining. However, the existing policy mining research has a small amount of research on the aspect of government-consumer, and the solution of the consumer policy perception mining is less involved, so that the improvement in the perfection is also needed. For example, china, as applied in application No. 202011260570.6, dedicated to 2021, 2/19, discloses a tag similarity-based enterprise policy matching method, which completes matching between enterprises and policies by constructing enterprise tags and policy tags and calculating similarity between the enterprise tags and the policy tags. The scheme matches a policy supply side with a policy demand side, but the research is developed based on a 'government-enterprise' level, omits the optimization effect of consumer public opinion on policy making and implementation, and has perfect space.

In the existing solution for mining policy perception based on the text mining technology, fewer researchers utilize network public opinion data, for example, China application No. 201710934706.9 is exclusively beneficial to a matching recommendation method and system based on city specific population and associated policies disclosed in 2017, 10, 9. However, the information sources of the scheme mainly come from information system data of all committees and questionnaire survey data aiming at specific people, network public opinion information is not fully utilized, and the perception value contained in consumer public opinion is ignored. The questionnaire survey mode that adopts, there are recovery rate and effective rate can not guarantee, crowd cover wide problem inadequately to the result authenticity of answering is difficult to guarantee.

Moreover, the word segmentation tool adopted by the existing solution lacks a policy corpus combined with an expert knowledge corpus, so that words related to policies in consumer comments are perfectly identified, and the accuracy of word segmentation of policy texts needs to be improved. For example, a government decision-oriented government affairs big data analysis method and device disclosed in application No. 202110204049.9, which is exclusively owned by china on 5/14/2021. According to the scheme, a multi-dimensional data mining model is constructed according to the hierarchy division indexes and the classification summary indexes related to the government affair data, and the government affair data is subjected to multi-dimensional mining analysis, but because a policy text has different language styles and characteristics from a common text, the expression of the policy text has strict normalization and has a specific official language, words related to a policy are relatively rich in customer comments, an existing solution does not aim at constructing a policy corpus, and the word segmentation result accuracy is to be improved.

It can be seen that the existing policy-aware research methods have the following problems:

most of consumer perception data comes from questionnaire survey, the recovery rate and the effective rate of the questionnaire cannot be guaranteed, the coverage of people is not wide enough, and the authenticity of the answer result is difficult to guarantee;

the existing research based on text mining often ignores the perception value contained in the consumer comment;

the existing word segmentation tool lacks a policy corpus which is used for commenting and identifying consumers based on an expert knowledge corpus, the accuracy of policy text word segmentation needs to be improved, and a manual labeling method is generally adopted, so that a large amount of manpower is consumed for constructing the policy corpus.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a consumer policy perception analysis method and a consumer policy perception analysis system based on big data text mining, which are used for obtaining more accurate text mining analysis results and policy perception differences of different consumer groups.

According to an aspect of the present specification, there is provided a consumer policy perception analysis method based on big data text mining, including:

acquiring policy text data and consumer comment text data and preprocessing the policy text data and the consumer comment text data;

constructing a policy corpus based on the preprocessed text data;

consumer policy-aware analysis is performed based on a policy corpus.

According to the technical scheme, network information is fully utilized, the comment texts of the consumers are mined from the perspective of perception of the consumers, the policy corpus is established, the text word segmentation effect is improved, and the perception states of the consumers on the policies are systematically and accurately mined and analyzed by combining technologies such as deep learning and expert knowledge corpus.

According to the technical scheme, the difference of different types of consumer groups on policy perception can be compared and analyzed, so that suggestions are provided for optimizing policy supply and improving policy accuracy and benefit.

As a further technical solution, the step of obtaining policy text data and consumer comment data further includes:

determining a data source;

acquiring policy text data and corresponding consumer comment text data by using a web crawler tool;

randomly sampling the acquired data source, and performing matching verification on the data source and the crawled data;

and if the crawled data passes verification, locally storing the corresponding data in a persistent mode.

In particular, the data sources may be relevant government websites and social networking platforms.

When the web crawler tool is used for collection, the collection rule can be defined by user to obtain the required policy text data and the corresponding customer comment text data.

As a further technical scheme, the preprocessing further comprises the steps of de-duplication, de-noising and text phrase deletion.

And the duplicate removal, namely deleting the repeated data in the text data, reduces the interference of redundant information.

Denoising, namely deleting some expressions, websites and other special characters in the text data, wherein the contents lack practical significance and research value, and meanwhile, the subsequent word segmentation and other text mining analysis results are influenced, so that denoising operation is needed, and text characteristics are more concentrated on words and semantics.

Text phrase deletion refers to deletion of data having too few words in text data. From the analysis perspective, the less word number indicates that the comment information contains less information value, and the comment information is likely to be content randomly generated by a consumer or a platform default evaluation and has no feedback meaning and research value.

As a further technical solution, constructing a policy text corpus based on the preprocessed text data further includes:

carrying out word splitting on the preprocessed policy text and the customer comment text data, and counting the word frequency of all generated words;

performing inter-point mutual information screening based on the word frequency statistical result;

performing left and right information entropy screening based on the inter-point mutual information screening result;

removing stop words and common general words based on left and right information entropy screening results to obtain policy professional words;

and screening the obtained policy professional vocabularies by the expert knowledge corpus, perfecting and identifying vocabularies related to policy perception in the consumer comments, and classifying the screened and perfected vocabularies into four classes according to the policy to form a final policy text corpus.

The technical scheme includes that firstly, the inter-point mutual information and the left and right information entropy are screened, stop words and common general words are removed, professional words in a policy text and policy-related words in a consumer comment text are mined, and a policy corpus is constructed; and then, on the basis of machine identification, combining an expert knowledge corpus to perform content screening and supplementation on the policy corpus to obtain a final policy corpus.

The inter-Point Mutual Information (PMI) screening specifically comprises the following steps: and according to the result of the word frequency, calculating the solidification degree of each word segment (for a binary word string AB, the solidification degree is PMI (A, B), for a ternary word string ABC, the solidification degree is min [ PMI (A, BC), PMI (AB, C) ] and so on.), setting different thresholds for the word segments with different lengths, and screening according to the thresholds to obtain a set G containing segments with higher solidification degree.

The left and right information entropy screening specifically comprises the following steps: and counting all possible left and right adjacent characters of each segment in the text in the set G, calculating left and right information entropies of each segment, and sequencing the word segments according to the left and right information entropies to obtain a set F.

And removing stop words and some common words from the words in the set F to obtain the policy professional vocabulary.

Professional vocabularies obtained through machine recognition are screened and supplemented by an expert knowledge corpus, nonsense vocabularies are removed, vocabularies relevant to policy perception in consumer comments are perfectly recognized, and the vocabularies are classified into four types of supply promotion, demand pulling, environment regulation and environment support according to policies to form a final policy corpus.

As a further technical solution, the performing consumer policy-aware mining analysis further comprises:

classifying consumers into different categories;

mining comment data in different types of consumers, wherein the comment data comprises word frequency intensity analysis, topic identification, semantic network analysis and emotional tendency analysis;

and comparing mining analysis results of different types of consumers to obtain the difference of different groups for different types of policy perception.

According to the technical scheme, text mining is carried out on consumer comments based on a policy corpus, the consumers are divided into different categories according to the figures of the consumers and the regions where the consumers are located, comment data are mined from the consumers in the different categories, and word frequency strength, topic identification, semantic network analysis and emotional tendency analysis results are compared to obtain the difference of different consumer groups in perception of different categories of policies.

As a further technical solution, the word frequency intensity analysis further includes: importing the constructed policy corpus and the related field cell word library into a custom dictionary, performing word segmentation of a customer comment text by using a jieba, and counting word frequency intensity; and classifying and dividing the word frequency result according to policies, and generating a corresponding word cloud picture according to the word frequency intensity.

As a further technical solution, the topic identification further comprises:

storing the segmented customer comment texts according to a list format of each document;

converting the text into an expression of a sparse vector corresponding to the bag-of-words model, and constructing a word frequency matrix;

calculating the confusion of models with different theme numbers, and determining the optimal model theme number;

and (4) taking the optimal number of topics obtained by the estimation of the confusion method as a parameter, training an LDA model, and analyzing to obtain topics concerned in the consumer comments.

As a further technical solution, the semantic network analysis further includes:

screening comment texts related to the speaking policy from the comment texts of the consumers according to the policy corpus;

counting the times of common appearance of high-frequency words in the comment text to obtain the closeness degree of the words;

counting the occurrence frequency of co-occurrence word pairs, and constructing a high-frequency word co-occurrence matrix according to the result;

and drawing the semantic network graph by using NetDraw according to the high-frequency word co-occurrence matrix to obtain the hierarchy and close relation among all nodes in the semantic network graph so as to know the perception situation of the consumer to the policy.

As a further technical solution, the emotional tendency analysis further includes:

extracting fragment documents related to policies in the consumer comments based on the policy corpus, segmenting each comment document, and finding out emotional words, negative words and degree adverbs in the documents;

judging whether a negative word or a degree adverb exists before each emotional word, if the negative word or the degree adverb exists, dividing the emotional words and the previous negative words or the previous degree adverbs into a group, and multiplying the emotion score of the emotional words by the corresponding weight coefficient of the group; setting the weight coefficient of the negative word as-1, and setting different weight coefficients of the degree adverb according to the semantics;

judging whether each sentence in the comment document is an exclamation sentence or a question-back sentence, and if so, increasing the emotion score value;

and adding up all the emotion scores in the comment, namely obtaining the emotion analysis score of the final comment document.

According to another aspect of the present specification, there is provided a consumer policy-aware analysis system based on big data text mining, comprising: the data acquisition module is used for acquiring policy text data and consumer comment text data; the data preprocessing module is used for preprocessing the acquired text data; the corpus construction module is used for constructing a policy corpus based on the preprocessed text data; and the policy perception mining module is used for carrying out policy perception analysis on the consumers based on the policy corpus.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention constructs a policy corpus based on a method for screening artificial semantics of machine recognition professional vocabularies and an expert knowledge corpus, the constructed policy corpus can be used as a word segmentation dictionary to improve the word segmentation effect of a text, then comments on consumers are compared on the basis of text mining, word frequency strength, topic recognition, semantic networks and emotion analysis results are compared, dimensions such as consumer figures, regions and the like are added on the basis of the existing research, and the difference of different types of consumer groups on policy perception is concerned, so that suggestions are provided for optimizing policy supply and improving policy accuracy and benefit properties.

(2) The invention adopts the natural language processing technology, uses the professional vocabulary in the field of the machine recognition policy to construct the policy corpus, improves the information screening efficiency and saves a great deal of manpower.

(3) The policy corpus constructed by the method is used as a word segmentation dictionary, so that the word segmentation effect of the text is improved, and the subsequent text mining analysis result is more accurate.

(4) The method is based on big data text mining, mining analysis is carried out by adopting multi-source data, network information is fully utilized, and policy perception differences of specific consumer groups are compared through different categories of consumer comment analysis.

Drawings

FIG. 1 is a flowchart illustrating a consumer policy awareness analysis method based on big data text mining according to a first embodiment of the present invention;

fig. 2 is a schematic flow chart of the data acquisition in step S1 according to the first embodiment of the present invention;

FIG. 3 is a flowchart illustrating the process of constructing the policy corpus in step S3 according to the first embodiment of the present invention;

FIG. 4 is a flowchart illustrating the consumer policy-aware mining analysis of step S4 according to the first embodiment of the present invention;

FIG. 5 is a flowchart illustrating the topic identification of sub-step S43 in step S4 according to the first embodiment of the present invention;

FIG. 6 is a flowchart illustrating a semantic web analysis performed in sub-step S44 of step S4 according to a first embodiment of the present invention;

FIG. 7 is a flowchart illustrating the emotional tendency analysis of sub-step S45 in step S4 according to the first embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without any inventive step, are within the scope of the present invention.

As shown in fig. 1, the embodiment provides a new energy automobile policy perception analysis method based on text mining of consumer reviews, taking analysis of new energy automobile policies as an example, the method includes the following steps:

and S1, collecting text data of the policy in the specific field and comment data of the consumer on the related government website and social platform.

And S2, preprocessing the collected policy text and the consumer comment text data.

And S3, constructing a policy corpus based on the combination of the machine recognition and the expert knowledge corpus.

And S4, carrying out policy perception mining analysis on the new energy automobile consumers.

Specifically, as shown in fig. 2, step S1 specifically includes:

s11, determining a data source, wherein the data source is a new energy automobile vertical portal website such as China industry information network, industry and trust department related government officials, law and regulation databases, love card automobile network and automobile family.

S12, compiling a web crawler program by using Python or using a Hoyi collector, octopus and other collection tools, customizing collection rules, and collecting policy text data and corresponding customer comment text data.

And S13, randomly sampling the acquisition source, matching and checking the acquisition source with the crawled data, and eliminating incomplete data or error data caused by errors in the crawling process.

And S14, locally and persistently storing the acquired data under the condition of ensuring that the crawling data passes the verification.

In this embodiment, after the data collection in step S1 is completed, the data preprocessing in step S2 should be performed. The data preprocessing mainly aims to remove useless information in the text, and specifically, the data preprocessing mainly comprises three operations, namely, deduplication, denoising and text phrase deletion.

And S21, removing duplication, namely deleting repeated data in the text data, and reducing the interference of redundant information.

S22, denoising, namely deleting some expressions, websites and other special characters in the text data, wherein the contents lack practical significance and research value, and meanwhile, the subsequent word segmentation and other text mining analysis results are influenced, so denoising operation is needed, and text features are more concentrated on words and semantics.

S23, text phrase deletion means deletion of data having too few words in the text data. From the analysis perspective, the less word number indicates that the comment information contains less information value, and the comment information is likely to be content randomly generated by a consumer or a platform default evaluation and has no feedback meaning and research value.

In this embodiment, the process of the step S3 of constructing the policy corpus is as shown in fig. 3, and the policy corpus is constructed by first screening based on inter-point mutual information and left and right information entropy, removing stop words and common general words, mining policy-related words in policy texts and consumer comment texts, and constructing the policy corpus. And (4) performing content screening and supplementing on the policy corpus by combining an expert knowledge corpus on the basis of machine identification to obtain a final policy corpus. The method comprises the following specific steps:

and S31, carrying out word splitting on the incoming text by using the N-gram, and counting the word frequency of all generated words.

S32, inter-Point Mutual Information (PMI) screening. inter-Point Mutual Information (PMI) can be used to measure the correlation between two variables, and here can be used to reflect the closeness between adjacent words or phrases. The calculation formula is as follows:

wherein: x and y are adjacent character strings; xy is a word formed by combining x and y; p (x), P (y), P (xy) are the probabilities of x, y, xy appearing in the corpus. The larger the PMI, the more closely the co-occurrence frequency of the adjacent strings is, i.e. both x and y are more likely to constitute a fixed vocabulary.

And (3) calculating the solidification degree of each word segment according to the word frequency (for a binary word string AB, the solidification degree is PMI (A, B), for a ternary word string ABC, the solidification degree is min [ PMI (A, BC), PMI (AB, C) ] and the like.) and setting different thresholds for the word segments with different lengths, and screening according to the thresholds. In this case, a set G containing fragments with a high degree of coagulation is obtained.

And S33, left and right information entropy screening. The left and right information entropy is used to judge whether the word string has rich left and right collocation. The calculation formulas are respectively as follows:

wherein H_l(x) And H_r(x) Left and right information entropies of the word string x are obtained; s_lAnd s_rA left adjacent character set and a right adjacent character set of the word string x; p (w)_lx | x) is the word string x whose left-adjacent character is w_lThe conditional probability of (a); p (w)_rx | x) is the right neighbor of the word string x when it appears as w_rThe conditional probability of (2).

The higher the left and right information entropy of a word string is, the more uncertain the left and right adjacent words are, the more possible the word string is to be an independent vocabulary. After the word string is obtained by screening the inter-point mutual information, whether the word string is an independent vocabulary is judged through the calculation of the boundary information entropy.

In this step, all possible left and right adjacent characters of each segment in the set G in the text need to be counted, left and right information entropies of each segment are calculated, and the word segments are sorted according to the left and right information entropies to obtain a set F.

And S34, removing stop words and common words from the words in the set F to obtain the policy professional vocabulary.

And S35, screening and supplementing the professional vocabularies obtained through machine recognition by the expert knowledge corpus, eliminating nonsense vocabularies, perfecting and recognizing vocabularies related to policy perception in the consumer comments, and classifying the vocabularies into four types of supply promotion, demand pulling, environment regulation and environment support according to policies to form a final policy corpus.

(4) The consumer policy aware mining, flow is shown in FIG. 4. And mining the text of the consumer comments based on the policy corpus, classifying the consumers into different categories according to the figures of the consumers and the regions where the consumers are located, mining comment data in the consumers of different categories, and comparing the word frequency strength, the topic identification result, the semantic network and the emotion analysis result to obtain the difference of different groups in perception of different types of policies.

S41, consumer classification

The automobile public praise information collected by the Aika automobile network comprises a consumer nickname, an automobile type, an automobile purchasing place, an automobile purchasing purpose, a naked automobile price, a member level, comments on all dimensions of the automobile and the like, consumer information is extracted from the automobile public praise information, a consumer portrait is constructed according to the member level, the consumption capacity (naked automobile price), the automobile type and the automobile purchasing purpose, and consumers are classified according to the consumer portrait and the area where the consumers are located. And screening and classifying the consumer comment data according to the consumer classification so as to further compare and analyze perception differences of different types of consumers.

S42, word frequency intensity analysis

The method includes the steps that word segmentation of a text is achieved through jieba, a constructed policy corpus and a related field cell word library are led into a custom dictionary for word segmentation, and then word frequency strength is counted. And classifying the word frequency results according to policies, and generating corresponding word cloud pictures according to the word frequency intensity to more intuitively compare the attention point differences of consumers.

S43, topic identification

And extracting the subject of the consumer comment by utilizing an LDA (latent Dirichlet allocation) model, and analyzing the subject concerned in the consumer comment. The process is shown in fig. 5, and the steps include:

and S431, storing the text after word segmentation according to a list format of each document.

And S432, converting the text into an expression of a sparse vector corresponding to the bag-of-words model, and constructing a word frequency matrix.

And S433, determining the optimal number of model topics by evaluating the confusion of models with different number of topics.

And S434, training an LDA model by taking the optimal number of topics obtained through the confusion degree as a parameter, and analyzing to obtain topics concerned in the consumer comments.

In sub-step S431, the text storage is required to meet the requirements of the genim library. (genim requires input formats: [ 'New energy automobile', 'policy', 'subsidy', … … ], each document is a list, and the elements are words).

The method for determining the number of the LDA model topics by the confusion method in the substep S433 specifically includes:

a, calculating the occurrence probability of the words, wherein the formula is as follows:

P(w)＝P(L|M)P(w|L)

b, calculating the puzzles under different model themes according to a puzzles calculation formula:

where N represents the total number of words in the text, P (w) represents the probability of the occurrence of word w, P (L | M) represents the topic probability of topic word L in text M, and P (w | L) represents the word probability of word w in topic L. The lower the confusion value, the less uncertainty, and the better the final clustering result.

And c, drawing a theme number-confusion broken line graph according to the confusion of different model theme numbers, wherein the inflection point is the optimal model theme number according to an elbow method.

S44, semantic network analysis

By performing semantic network analysis on the comment text of the consumer, the perception situation of the consumer on the new energy automobile policy is known on the whole, and the flow is shown in fig. 6 and specifically includes the following steps:

s441, according to the policy corpus, comment texts related to the policies are screened out from the comment texts of the consumers.

And S442, counting the times of the common occurrence of the high-frequency words in the text, and obtaining the closeness degree of the words.

S443, counting the occurrence frequency of the co-occurrence word pairs, and constructing a high-frequency word co-occurrence matrix according to the result.

And S444, drawing the semantic network graph by using NetDraw according to the high-frequency word co-occurrence matrix to obtain the hierarchy and close relation among all nodes in the semantic network graph so as to know the perception situation of the consumer on the policy.

S45, emotional tendency analysis

Based on the policy corpus, extracting the document segments relevant to the policy from the consumer comments, and carrying out emotional tendency analysis on the document segments. The invention carries out text emotional tendency analysis based on an emotional dictionary, and the flow is shown as figure 7, which specifically comprises the following steps:

s451, before emotion analysis, words are segmented for the document based on the policy corpus, and emotion words, negative words and degree adverbs in the document are found out.

And S452, processing the degree words and the negative words. Judging whether a negative word or a degree adverb exists before each emotional word, if the negative word or the degree adverb exists, dividing the emotional words and the previous negative words or the previous degree adverbs into a group, and multiplying the emotion score of the emotional words by the corresponding weight coefficient of the group. The weight coefficient of the negative word is set to-1; the degree adverbs set different weighting coefficients according to semantics. Common degree adverbs such as "very," "special," "somewhat," "less," etc., are assigned different weighting coefficients, respectively.

And S453, special sentence processing. The exclamation sentence and the question sentence are specially processed to determine whether the exclamation sentence or the question sentence is uttered. It is only necessary to judge whether the sentence end punctuation mark is an exclamation mark or a question mark, and if so, a certain emotion score value is added.

And S454, adding all the emotion scores in the comment document to obtain the final emotion analysis score of the comment document.

S455, the comment data is divided into two groups (positive, negative) according to the emotion score value. The number 0 is used as the limit value of emotion, that is, the score of positive emotion is 0 or more, and the score of negative emotion is 0 or less.

S46, result comparison

And finally, comparing the text mining analysis results of different types of consumers to obtain the policy perception difference of different consumer groups.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, platforms (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

According to another aspect of the present invention, there is also provided a consumer policy perception analysis system based on big data text mining, including: the data acquisition module is used for acquiring policy text data and consumer text data; the data preprocessing module is used for preprocessing the acquired text data; the corpus construction module is used for constructing a policy corpus based on the preprocessed text data; and the policy perception mining module is used for carrying out consumer policy perception analysis based on the policy corpus.

The modules referred to in the above system may be implemented by a computer chip or an entity, or by a product with certain functions. The present invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention.

Claims

1. The consumer policy perception analysis method based on big data text mining is characterized by comprising the following steps:

constructing a policy corpus based on the preprocessed text data;

consumer policy-aware analysis is performed based on a policy corpus.

2. The big data text mining based consumer policy-aware analysis method of claim 1, wherein the step of obtaining policy text data and consumer comment text data further comprises:

determining a data source;

3. The consumer policy-aware analytics method based on big data text mining as claimed in claim 1, wherein the preprocessing further comprises de-duplication, de-noising and text phrase deletion.

4. The big data text mining based consumer policy awareness analysis method of claim 1, wherein constructing a policy corpus based on preprocessed text data further comprises:

and screening the obtained policy professional vocabularies by the expert knowledge corpus, perfecting and identifying vocabularies related to policy perception in the consumer comments, and classifying the screened and perfected vocabularies into four classes according to the policy to form a final policy corpus.

5. The big-data text mining-based consumer policy-aware analysis method of claim 1, wherein performing consumer policy-aware mining analysis further comprises:

classifying consumers into different categories;

and comparing mining analysis results of different types of consumers to obtain the difference of different consumer groups for different types of policy perception.

6. The big-data text mining-based consumer policy awareness analysis method according to claim 5, wherein the word frequency strength analysis further comprises: importing the constructed policy corpus and the related field cell word library into a custom dictionary, performing word segmentation of a customer comment text by using a jieba, and counting word frequency intensity; and classifying and dividing the word frequency result according to policies, and generating a corresponding word cloud picture according to the word frequency intensity.

7. The big-data text mining-based consumer policy awareness analysis method according to claim 5, wherein topic identification further comprises:

8. The big-data text mining-based consumer policy-aware analysis method of claim 5, wherein the semantic web analysis further comprises:

9. The big-data text mining-based consumer policy perception analysis method according to claim 5, wherein the emotional tendency analysis further comprises:

and adding up all the emotion scores in the comment document to obtain the final emotion analysis score of the comment document.

10. A consumer policy-aware analysis system based on big data text mining, comprising: the data acquisition module is used for acquiring policy text data and consumer comment text data; the data preprocessing module is used for preprocessing the acquired text data; the policy corpus establishing module is used for establishing a policy corpus based on the preprocessed text data; and the policy perception mining module is used for carrying out policy perception analysis on the consumers based on the policy corpus.