WO2020186627A1

WO2020186627A1 - Public opinion polarity prediction method and apparatus, computer device, and storage medium

Info

Publication number: WO2020186627A1
Application number: PCT/CN2019/089224
Authority: WO
Inventors: 耿伟; 谷国栋; 周起如
Original assignee: 深圳市赛为智能股份有限公司
Priority date: 2019-03-15
Filing date: 2019-05-30
Publication date: 2020-09-24
Also published as: CN109933656A; CN109933656B

Abstract

A public opinion polarity prediction method and apparatus, a computer device, and a storage medium. The method comprises: obtaining public opinion data (S110); performing, by an AC automaton based on a double-array trie tree, emotional feature information extraction on data to be analyzed to obtain feature data (S120); performing polarity prediction on the feature data by a public opinion polarity prediction model to obtain a prediction result (S130); and outputting the prediction result (S140). An emotional dictionary is constructed by means of the storage structure of a double-array trie tree, thereby reducing the number of disk IO reads/writes and the occupied physical storage space; emotional feature information extraction is performed on public opinion data in the emotional dictionary by an AC automaton based on a double-array trie tree, character comparison is converted into state transition, backtracking is not needed at all when data to be analyzed is scanned, and the problem of repeated backward scanning is avoided; polarity prediction is conducted on feature data by a public opinion polarity prediction model, and the efficiency and accuracy of public opinion polarity prediction analysis are effectively improved.

Description

Public opinion polarity prediction method, device, computer equipment and storage medium

This application is based on a Chinese patent application with an application number of 201910199451.5 and an application date of March 15, 2019, and claims its priority. The entire content of this application is hereby incorporated into this application as a whole.

Technical field

This application relates to information processing methods, and more specifically to public opinion polarity prediction methods, devices, computer equipment and storage media.

Background technique

With the rapid development of applications such as WeChat and Weibo, more and more netizens express their opinions through the Internet. The integration of network information and social information has an increasingly greater impact on society, and it is even related to the country’s information security and long-term stability. Due to the huge amount of information on the Internet, it is impossible to process massive public opinion data by manual methods. To obtain the overall situation of public opinion comprehensively and completely, it is necessary to rely on sentiment polarity analysis technology to automatically monitor and analyze public opinion information.

Existing public opinion analysis application systems generally use keyword analysis methods, which are not only inefficient, but also inaccurate. Based on traditional Chinese word segmentation, pattern matching requires multiple back-scanning texts, and the performance efficiency is relatively low; the existing public opinion analysis application system uses a relatively crude statistical method to calculate emotional polarity, due to the limitation of feature information and the influence of context , The accuracy rate is not high; the public opinion sentiment dictionary occupies a relatively large storage space, which brings performance loss.

Therefore, it is necessary to design a new method to solve the problems of low speed of Chinese word segmentation, low accuracy of polarity prediction, and large performance loss.

Application content

The purpose of this application is to overcome the shortcomings of the prior art and provide a public opinion polarity prediction method, device, computer equipment and storage medium.

In order to achieve the above objectives, this application adopts the following technical solutions: a public opinion polarity prediction method, including:

Get public opinion data;

The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;

Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;

Output the prediction result.

The further technical solution is that the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.

Its further technical solution is: the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data, including:

Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;

Perform emotional feature information extraction on the output result to obtain feature data.

The further technical solution is: the pattern matching of the AC automata based on the double-array dictionary tree to obtain the output result includes:

Split the data to be analyzed into several characters;

Searching the emotional dictionary according to the characters;

Determine whether the character matches;

If it matches, output the matched characters to the set set to form the output result;

Determine whether the current character is the last character;

If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;

If not, get the next character;

Return to the search emotion dictionary according to the character;

If it does not match, then turn to the character pointed to by the invalidation function;

Determine whether the character pointed to by the invalid function is empty;

If not, output the character pointed to by the invalid function to the set set to form an output result;

Return the judgment whether the current character is the last character;

If yes, enter the end step.

The further technical solution is: the extraction of emotional feature information from the output result to obtain feature data includes:

Divide the output result into several atomic words;

Establish an adjacency table for storing array graphs;

Use the offset of the atomic word to determine the position of the atomic word;

Add the atomic word to the corresponding position of the array in the adjacency list;

Calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;

Score the entire array graph stored in the adjacency table;

The atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.

The further technical solution is: the polarity prediction of the feature data by the public opinion polarity prediction model to obtain the prediction result, the public opinion polarity prediction model is input into the XGBoost model through the sentiment feature data set extracted by the sentiment dictionary After the classification feature is obtained, the classification feature is input to the model obtained by the logistic regression model training.

The further technical solution is: the public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training, including:

Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;

Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;

Construct a new decision tree according to the residual;

Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;

Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;

Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.

This application also provides a public opinion polarity prediction device, including:

Public opinion data acquisition unit for acquiring public opinion data;

The extraction unit is used to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;

The prediction unit is used to predict the polarity of the feature data through the public opinion polarity prediction model to obtain the prediction result;

The output unit is used to output the prediction result.

The present application also provides a computer device that includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned method when the computer program is executed.

The present application also provides a storage medium storing a computer program, and the computer program can implement the above-mentioned method when being executed by a processor.

Compared with the prior art, the present application has the following beneficial effects: the present application constructs the emotional dictionary through the storage structure of the double-array dictionary tree, reduces the number of disk IO reads and writes and the physical storage space occupied, and uses AC based on the double-array dictionary tree. The automata extracts the sentiment feature information of the public opinion data in the sentiment dictionary, and converts character comparison into state transition. When scanning the data to be analyzed, there is no need to backtrack at all, avoiding the problem of multiple fallback scanning. The feature data is analyzed by the public opinion polarity prediction model. Carrying out polarity prediction, effectively improving the efficiency and accuracy of public opinion polarity prediction analysis.

The application will be further described below in conjunction with the drawings and specific embodiments.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of the application;

2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application;

FIG. 3 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application;

4 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application;

FIG. 5 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application;

FIG. 6 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application;

Fig. 7 is a state transition diagram provided by an embodiment of the application;

FIG. 8 is a schematic diagram of a failure function provided by an embodiment of the application;

FIG. 9 is a schematic diagram of public opinion polarity prediction provided by an embodiment of the application;

10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the application;

FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to FIG. 1 and FIG. 2. FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of this application. Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application. The public opinion polarity prediction method is applied to the server. According to the crawled target public opinion website content, the server adopts preprocessing operations, AC automata analysis based on double-array dictionary tree, and prediction of public opinion polarity prediction model to obtain public opinion polarity results, and output to the terminal for display.

Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S110 to S130.

S110. Obtain public opinion data.

In this embodiment, public opinion data refers to data representing the emotions of reviewers.

In an embodiment, the aforementioned step S110 may include the following steps:

Crawl the content of the target public opinion website;

In this embodiment, the content of the target public opinion website refers to content originating from a webpage website. Use crawler technology to crawl the content of the target public opinion website.

The content of the target public opinion website is preprocessed, web page analyzed, and de-noising processed to obtain public opinion data.

In this embodiment, it is necessary to perform preliminary processing on the content of the target public opinion website to obtain public opinion data and remove unnecessary data.

S120. The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data.

In this embodiment, the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the emotional dictionary.

The sentiment dictionary is constructed based on a double array dictionary tree.

In this embodiment, the emotional dictionary refers to a collection of all emotional words.

Based on the dictionary storage structure of the double-array dictionary tree, first determine the state of the word and the steering function, and calculate the failure function. The calculation of the output function is completed in two steps. The double-array dictionary tree is a compressed dictionary tree. Represent the entire tree by using two one-dimensional arrays BASE and CHECK.

For example, to construct an emotional dictionary composed of {Chinese national team national team}, in order to construct a steering function, a state transition diagram needs to be constructed. First, the state transition graph contains only a starting state 0. By adding a path starting from the starting state, each keyword p is input into the graph in turn, new vertices and edges are added to the graph, and finally generated A path that can spell the keyword p. In order to complete the construction of the steering function, add a loop from state 0 to state 0 for each character except the start character, to obtain the following figure 7 State transition diagram, this diagram represents the steering function.

The failure function is established based on the steering function. First, calculate the failure function values of all states with a depth of 1, and calculate all states with a depth of 2, and so on, until the failure function values of all states except state 0 are calculated, state The depth of 0 is not defined. When i=1, 2, 3, 4, 5, 6, 7, 8, 9 the corresponding state value is 0, 0, 0, 1, 2, 0, 3, 0, 3; Finally, the failure function as shown in Figure 8 is obtained.

In addition, when the AC automata is run for the first time, the emotion dictionary needs to be loaded into the memory, and the single-piece design mode is used to design the model objects of the emotion dictionary of the AC automata, and the persistent model is set in the first It is loaded into the memory during the second run, and there is no need to perform operations such as compilation and loading for each subsequent call. It realizes one compilation and loading and multiple runs, making full use of the high-efficiency features of memory access and improving the efficiency of emotional feature information extraction. Use double array dictionary tree to compress storage space, and use storage compression to reduce disk IO read and write times and storage space occupied to improve the efficiency of memory access.

Feature data refers to data with emotional feature information, that is, words that represent the emotion of the reviewer.

In an embodiment, referring to FIG. 3, the above-mentioned step S120 may include steps S121 to S122.

S121: Use an AC automata based on a double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain an output result;

The output result refers to a collection of words that match emotional words.

In an embodiment, referring to FIG. 4, the above-mentioned step S121 may include steps S121a to S121i.

S121a. Split the data to be analyzed into several characters;

S121b. Search the emotional dictionary according to the characters.

Searching for characters in the emotional dictionary, because the emotional dictionary is constructed by the steering function and the invalid function, when the AC automata extracts emotional feature information, it cleverly transforms the character comparison into a state transition to perform the comparison between the character and the emotional dictionary Matching processing, there is no need to backtrack when scanning the data to be analyzed, avoiding the problem of multiple back scanning.

S121c. Determine whether the characters match;

S121d. If they match, output the matched characters to the set set to form an output result.

When the characters are matched, when the output function of the emotional dictionary is not empty, the AC automaton outputs the matching mode and outputs the matched characters to the set set to form the output result.

S121e. Determine whether the current character is the last character;

If yes, go to step S122;

S121f. If not, get the next character;

Return to the step S121b;

S121g. If there is no match, turn to the character pointed to by the invalidation function.

When the current character does not match, it indicates that the current character is invalid, and the AC automaton turns to the character pointed to by the invalid function.

S121h: Determine whether the character pointed to by the invalidation function is empty;

S121i. If not, output the character pointed to by the invalid function to the set set to form an output result.

When the character pointed to by the invalidation function is not empty, the character is output to the set set to form the output result.

Return to the step S121e;

If yes, enter the end step.

Repeat the above steps to match all characters in the data to be analyzed to obtain a complete output result.

S122: Perform emotional feature information extraction on the output result to obtain feature data.

The sentiment dictionary provides a priori knowledge of the emotion of a word, which represents the emotion polarity and intensity of the word in most contexts. Extract emotional feature information based on the emotional dictionary, extract valuable emotional information from public opinion texts, and convert unstructured text with no regularity into structured feature information that the computer can understand and recognize. The final emotional feature information is the feature data representation format: {emotional words, part of speech, position in the sentence, emotional tendency, emotional intensity}.

In an embodiment, referring to FIG. 5, the above step S122 may include steps S1221 to S1227.

S1221. Divide the output result into several atomic words.

Atomic words refer to the smallest unit of words. Based on AC automata, a sentence is split into all possible atomic words.

S1222, establish an adjacency table for storing the array graph.

Use an adjacency list to store the graph.

S1223. Determine the position of the atomic word by using the offset of the atomic word;

S1224. Add the atomic word to the corresponding position of the array in the adjacency list;

Use the offset offset of each atomic term to determine where it is, and add the atomic term to the adjacency list array terms[offset].

S1225: Calculate the distance between the word frequencies of the atomic words of two nodes in the array based on the Viterbi algorithm;

S1226. Score the entire array graph stored in the adjacency table;

Based on the Viterbi algorithm, the distance between the atomic term term of the two nodes is calculated, and a distance is assigned to each node, which represents the length of the cumulative shortest path from the root node to the current node, and then the whole graph is scored by depth-first traversal. For each scoring, just add the distance from the root node to the current node.

S1227. Add the atomic word, location and attribute information with the shortest distance to the set emotion feature data set to form feature data.

Add the emotional words, location and attributes on the shortest path to the emotional feature data set. In this embodiment, the attribute information refers to information such as part of speech, position in a sentence, emotional tendency, and emotional strength.

S130: Perform polarity prediction on the feature data through the public opinion polarity prediction model to obtain a prediction result;

In this embodiment, the prediction result refers to the polarity value of the public opinion data. The public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then inputting the classification features to the logistic regression model for training.

The input feature data uses the XGBoost model to construct new features. The constructed new feature vector has a value of 0/1, and each element of the vector corresponds to the leaf node of the tree in the XGBoost model. When a sample point passes through a tree and finally falls on a leaf node of this tree, the value of the element corresponding to this leaf node in the new feature vector is 1, and the elements corresponding to other leaf nodes of this tree The value is 0, and the length of the new feature vector is equal to the sum of the number of leaf nodes contained in all trees in the XGBoost model. Finally, these new features are added to the original features to train the model to obtain the public opinion polarity prediction model. The output of each individual tree is regarded as the classification input feature of the sparse linear classifier. As shown in Figure 9, the input split has two trees, the upper tree has two leaf nodes, and the lower tree has three leaf nodes. The final feature is Is a five-dimensional vector. For input x, the second node on the tree is coded [0,1], suppose it falls on the first node of the tree down, code [1,0,0], so the final code is [0,1,1] , 0, 0], the code is used as the input feature of the prediction model and input into the logistic regression model for prediction.

In one embodiment, referring to FIG. 6, the above-mentioned public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training. The model includes steps S131 to S136.

S131: Construct a decision tree according to the emotional feature data set extracted from the emotional dictionary;

S132: Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary.

S133. Construct a new decision tree according to the residual;

S134. Use the new decision tree to iterate the decision tree to obtain a combination of emotional feature information.

The aforementioned XGBoost (extreme gradient boosting, eXtreme Gradient Boosting) model is a tool for massively parallel boosted trees. It is currently the fastest and best open source boosted tree toolkit. The Xgboost model is an integration of many CART regression trees.

Construct a decision tree on the residuals of the existing model and actual sample output, and iterate continuously. Each iteration will produce a large gain classification feature, and obtain multiple discriminative emotional feature information combinations through multiple trees.

S135: Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;

S136. Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.

The emotional feature information combination is used as the input of the logistic regression model; the logistic regression model is trained and the model is persisted.

XGBoost is an efficient implementation of the GBDT algorithm and supports parallel processing. The base learner uses a CART regression tree. The regularization term is related to the number of leaf nodes of the tree and the value of the leaf nodes; XGBoost approximates the objective function according to the Taylor expansion and calculates the pseudo residual The learning function FM(x) uses not only the first derivative but also the second derivative. At the same time, a regular term is added to the model cost function to control the complexity of the model and make the learned model simpler.

Use the public opinion polarity prediction model to predict the content of the network public opinion text to obtain the polarity result, and use F-Score to evaluate the final classification result, which is defined as follows:

F-Score=(2×Precision×Recall)/(Precision+Recall), where Precision represents the accuracy rate, and Recall represents the recall rate.

Precision = the number of correctly classified instances of a certain class / the total number of instances of a certain class predicted by the public opinion polarity prediction model

Recall = the number of instances of a certain type that are correctly classified/the total number of instances of a certain type in the test data.

S140. Output the prediction result.

The output of the prediction result adopts a json formatted string. The output format is as follows: {"sentiTrend":"front","sentineg":0.278,"sentipos":0.722}.

Using 20w pieces of microblog data captured by crawlers to test, the accuracy comparison of different public opinion polarity prediction algorithms is shown in Table 1 and Table 2.

Table 1. Feature data extraction speed comparison

算法algorithm	词典规模Dictionary size	提取速度Extraction speed
IK分词IK participle	35w35w	80w/s80w/s
Ansj分词Ansj participle	35w35w	210w/s210w/s
Fnlp分词Fnlp participle	35w35w	120w/s120w/s
双数组AC自动机Double array AC automata	35w35w	1600w/s1600w/s

Table 2. Accuracy comparison

预测算法Prediction algorithm	准确率Accuracy	F1F1
关键词统计方法Keyword statistical methods	0.7030.703	0.6330.633
Logistics算法Logistics algorithm	0.7180.718	0.6460.646
GBDT+lr算法GBDT+lr algorithm	0.8030.803	0.7250.725
XGBoost+lr算法XGBoost+lr algorithm	0.8120.812	0.7360.736

The above-mentioned public opinion polarity prediction method uses the storage structure of the double-array dictionary tree to construct the emotional dictionary, which reduces the number of disk IO reads and writes and the physical storage space occupied. Emotion feature information extraction is carried out in the dictionary, and character comparison is transformed into state transition. When scanning the data to be analyzed, there is no need to backtrack at all, avoiding the problem of multiple back-scanning, and predicting the polarity of feature data through the public opinion polarity prediction model, effectively improving The efficiency and accuracy of public opinion polarity prediction analysis.

FIG. 10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the present application. As shown in FIG. 10, corresponding to the above public opinion polarity prediction method, this application also provides a public opinion polarity prediction device. The public opinion polarity prediction device includes a unit for executing the above public opinion polarity prediction method, and the device can be configured in a server.

Specifically, referring to FIG. 10, the public opinion polarity prediction device 300 includes:

The public opinion data obtaining unit 301 is used to obtain public opinion data;

The extraction unit 302 is configured to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;

The prediction unit 303 is configured to perform polarity prediction on the feature data through the public opinion polarity prediction model to obtain a prediction result;

The output unit 304 is configured to output the prediction result.

In an embodiment, the extraction unit 302 includes:

The matching subunit is used to perform pattern matching on the data to be analyzed using the AC automata based on the double-array dictionary tree to obtain the output result;

The feature data forms a sub-unit for extracting emotional feature information from the output result to obtain feature data.

In an embodiment, the aforementioned matching subunit includes:

A splitting module for splitting the data to be analyzed into several characters;

The search module is used to search the emotional dictionary according to the characters;

The character judgment module is used to judge whether the character matches;

The first output module is used to output the matched characters to the set set if they match to form an output result;

The last character judging module is used to judge whether the current character is the last character; if it is, enter the emotional feature information extraction of the output result to obtain the feature data;

The character acquisition module is used to acquire the next character if not; return to the search emotion dictionary based on the character;

The steering module is used to turn to the character pointed to by the invalid function if it does not match;

The pointing judgment module is used to judge whether the character pointed to by the invalid function is empty; if so, enter the end step;

The second output module is configured to, if not, output the character pointed to by the invalid function to the set set to form an output result; return to the judgment whether the current character is the last character.

In an embodiment, the aforementioned feature data forming subunit includes:

The division module is used to divide the output result into several atomic words;

The adjacency list establishment module is used to establish the adjacency list for storing the array graph;

The position determination module is used to determine the position of the atomic word by using the offset of the atomic word;

Add module, used to add atomic words to the corresponding position of the array in the adjacency list;

The distance calculation module is used to calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;

The scoring module is used to score the entire array graph stored in the adjacency table;

The integration module is used to add the atom words, positions and attribute information with the shortest distance to the set emotion feature data set to form feature data.

In an embodiment, the aforementioned device further includes:

The model training unit is used to input the emotional feature data set extracted by the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features into the logistic regression model for training to obtain the public opinion polarity prediction model.

In an embodiment, the aforementioned model training unit includes:

The first construction subunit is used to construct a decision tree according to the emotional feature data set extracted from the emotional dictionary;

The first input subunit is used to input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;

The second construction subunit is used to construct a new decision tree according to the residual;

An iterative subunit for iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;

The combined input subunit is used to combine and input the emotional feature information into a logistic regression model to train the logistic regression model;

The processing subunit is used to perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.

It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above-mentioned public opinion polarity prediction device and each unit can be referred to the corresponding description in the foregoing method embodiment. For the convenience and conciseness of the description, here is No longer.

The above-mentioned public opinion polarity prediction device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 11.

Please refer to FIG. 11, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server.

Referring to FIG. 11, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions. When the program instructions are executed, the processor 502 can execute a public opinion polarity prediction method.

The processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute a public opinion polarity prediction method.

The network interface 505 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

Wherein, the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following steps:

Get public opinion data;

Output the prediction result.

Wherein, the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.

In one embodiment, when the processor 502 implements the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data, it specifically implements the following steps:

In one embodiment, when the processor 502 implements the pattern matching on the AC automata based on the double-array dictionary tree to obtain the output result step, the processor 502 specifically implements the following steps:

Split the data to be analyzed into several characters;

Searching the emotional dictionary according to the characters;

Determine whether the character matches;

Determine whether the current character is the last character;

If not, get the next character;

Return to the search emotion dictionary according to the character;

Determine whether the character pointed to by the invalid function is empty;

Return the judgment whether the current character is the last character;

If yes, enter the end step.

Wherein, the polarity prediction of the feature data is performed by the public opinion polarity prediction model to obtain the prediction result. In the public opinion polarity prediction model, the sentiment feature data set extracted by the sentiment dictionary is input into the XGBoost model to obtain the classification features. , Input the classification features into the model trained by the logistic regression model.

In one embodiment, the processor 502 realizes that the public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training. In the model step, the specific steps are as follows:

Construct a new decision tree according to the residual;

It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by computer programs instructing relevant hardware. The computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.

Therefore, this application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:

Get public opinion data;

Output the prediction result.

In one embodiment, when the processor executes the computer program to implement the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data, the following steps are specifically implemented:

In an embodiment, when the processor executes the computer program to implement the pattern matching on the AC automata based on the double-array dictionary tree to obtain the output result step, the following steps are specifically implemented:

Split the data to be analyzed into several characters;

Searching the emotional dictionary according to the characters;

Determine whether the character matches;

Determine whether the current character is the last character;

If not, get the next character;

Return to the search emotion dictionary according to the character;

Determine whether the character pointed to by the invalid function is empty;

Return the judgment whether the current character is the last character;

If yes, enter the end step.

In an embodiment, when the processor executes the computer program to implement the step of extracting emotional feature information from the output result to obtain feature data, the following steps are specifically implemented:

Divide the output result into several atomic words;

Establish an adjacency table for storing array graphs;

Use the offset of the atomic word to determine the position of the atomic word;

Score the entire array graph stored in the adjacency table;

In an embodiment, the processor executes the computer program to realize the public opinion polarity prediction model by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features to When the logistic regression model is trained on the model steps, the following steps are specifically implemented:

Construct a new decision tree according to the residual;

The storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described in terms of function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of each unit is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not implemented.

The steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs. The units in the device in the embodiment of the present application may be combined, divided, and deleted according to actual needs. In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

The public opinion polarity prediction method is characterized in that it includes:

Get public opinion data;

The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;

Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;

Output the prediction result.
The public opinion polarity prediction method according to claim 1, wherein the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the emotional dictionary. It is constructed based on the double-array dictionary tree.
The public opinion polarity prediction method according to claim 2, wherein the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain the feature data, comprising:

Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;

Perform emotional feature information extraction on the output result to obtain feature data.
The method for predicting the polarity of public opinion according to claim 3, wherein the pattern matching of an AC automata based on a double-array dictionary tree to obtain an output result comprises:

Split the data to be analyzed into several characters;

Searching the emotional dictionary according to the characters;

Determine whether the character matches;

If it matches, output the matched characters to the set set to form the output result;

Determine whether the current character is the last character;

If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;

If not, get the next character;

Return to the search emotion dictionary according to the character;

If it does not match, then turn to the character pointed to by the invalidation function;

Determine whether the character pointed to by the invalid function is empty;

If not, output the character pointed to by the invalid function to the set set to form an output result;

Return the judgment whether the current character is the last character;

If yes, enter the end step.
The method for predicting the polarity of public opinion according to claim 4, wherein said extracting emotional feature information from the output result to obtain feature data comprises:

Divide the output result into several atomic words;

Establish an adjacency table for storing array graphs;

Use the offset of the atomic word to determine the position of the atomic word;

Add the atomic word to the corresponding position of the array in the adjacency list;

Calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;

Score the entire array graph stored in the adjacency table;

The atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
The method for predicting the polarity of public opinion according to claim 2, wherein the polarity prediction of the feature data is performed by the prediction model of the polarity of the public opinion to obtain the prediction result. After the extracted emotional feature data set is input into the XGBoost model to obtain the classification feature, the classification feature is input to the model trained by the logistic regression model.
The public opinion polarity prediction method according to claim 6, characterized in that, the public opinion polarity prediction model is input into the XGBoost model to obtain the classification features after the sentiment feature data set extracted from the sentiment dictionary is input into the logic The model obtained by training the regression model includes:

Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;

Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;

Construct a new decision tree according to the residual;

Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;

Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;

Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
The device for predicting public opinion polarity is characterized in that it includes:

Public opinion data acquisition unit for acquiring public opinion data;

The extraction unit is used to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;

The prediction unit is used to predict the polarity of the feature data through the public opinion polarity prediction model to obtain the prediction result;

The output unit is used to output the prediction result.
A computer device, characterized in that the computer device includes a memory and a processor, and a computer program is stored on the memory, and the processor executes the computer program as described in any one of claims 1 to 7. The method described.
A storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 can be implemented.