WO2020186627A1 - Public opinion polarity prediction method and apparatus, computer device, and storage medium - Google Patents

Public opinion polarity prediction method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2020186627A1
WO2020186627A1 PCT/CN2019/089224 CN2019089224W WO2020186627A1 WO 2020186627 A1 WO2020186627 A1 WO 2020186627A1 CN 2019089224 W CN2019089224 W CN 2019089224W WO 2020186627 A1 WO2020186627 A1 WO 2020186627A1
Authority
WO
WIPO (PCT)
Prior art keywords
public opinion
emotional
data
polarity
model
Prior art date
Application number
PCT/CN2019/089224
Other languages
French (fr)
Chinese (zh)
Inventor
耿伟
谷国栋
周起如
Original Assignee
深圳市赛为智能股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市赛为智能股份有限公司 filed Critical 深圳市赛为智能股份有限公司
Publication of WO2020186627A1 publication Critical patent/WO2020186627A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to information processing methods, and more specifically to public opinion polarity prediction methods, devices, computer equipment and storage media.
  • Existing public opinion analysis application systems generally use keyword analysis methods, which are not only inefficient, but also inaccurate. Based on traditional Chinese word segmentation, pattern matching requires multiple back-scanning texts, and the performance efficiency is relatively low; the existing public opinion analysis application system uses a relatively crude statistical method to calculate emotional polarity, due to the limitation of feature information and the influence of context , The accuracy rate is not high; the public opinion sentiment dictionary occupies a relatively large storage space, which brings performance loss.
  • the purpose of this application is to overcome the shortcomings of the prior art and provide a public opinion polarity prediction method, device, computer equipment and storage medium.
  • a public opinion polarity prediction method including:
  • the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
  • the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
  • the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data, including:
  • the pattern matching of the AC automata based on the double-array dictionary tree to obtain the output result includes:
  • the extraction of emotional feature information from the output result to obtain feature data includes:
  • the atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
  • the further technical solution is: the polarity prediction of the feature data by the public opinion polarity prediction model to obtain the prediction result, the public opinion polarity prediction model is input into the XGBoost model through the sentiment feature data set extracted by the sentiment dictionary After the classification feature is obtained, the classification feature is input to the model obtained by the logistic regression model training.
  • the public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training, including:
  • This application also provides a public opinion polarity prediction device, including:
  • Public opinion data acquisition unit for acquiring public opinion data
  • the extraction unit is used to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;
  • the prediction unit is used to predict the polarity of the feature data through the public opinion polarity prediction model to obtain the prediction result;
  • the output unit is used to output the prediction result.
  • the present application also provides a computer device that includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned method when the computer program is executed.
  • the present application also provides a storage medium storing a computer program, and the computer program can implement the above-mentioned method when being executed by a processor.
  • the present application constructs the emotional dictionary through the storage structure of the double-array dictionary tree, reduces the number of disk IO reads and writes and the physical storage space occupied, and uses AC based on the double-array dictionary tree.
  • the automata extracts the sentiment feature information of the public opinion data in the sentiment dictionary, and converts character comparison into state transition.
  • the feature data is analyzed by the public opinion polarity prediction model. Carrying out polarity prediction, effectively improving the efficiency and accuracy of public opinion polarity prediction analysis.
  • FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of the application
  • FIG. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application
  • FIG. 6 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application
  • Fig. 7 is a state transition diagram provided by an embodiment of the application.
  • FIG. 8 is a schematic diagram of a failure function provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of public opinion polarity prediction provided by an embodiment of the application.
  • FIG. 10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the application.
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of this application.
  • Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application.
  • the public opinion polarity prediction method is applied to the server.
  • the server According to the crawled target public opinion website content, the server adopts preprocessing operations, AC automata analysis based on double-array dictionary tree, and prediction of public opinion polarity prediction model to obtain public opinion polarity results, and output to the terminal for display.
  • Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S110 to S130.
  • public opinion data refers to data representing the emotions of reviewers.
  • step S110 may include the following steps:
  • the content of the target public opinion website refers to content originating from a webpage website.
  • the content of the target public opinion website is preprocessed, web page analyzed, and de-noising processed to obtain public opinion data.
  • the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data.
  • the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the emotional dictionary.
  • the sentiment dictionary is constructed based on a double array dictionary tree.
  • the emotional dictionary refers to a collection of all emotional words.
  • the double-array dictionary tree is a compressed dictionary tree. Represent the entire tree by using two one-dimensional arrays BASE and CHECK.
  • a state transition diagram For example, to construct an emotional dictionary composed of ⁇ Chinese national team national team ⁇ , in order to construct a steering function, a state transition diagram needs to be constructed.
  • the state transition graph contains only a starting state 0.
  • each keyword p is input into the graph in turn, new vertices and edges are added to the graph, and finally generated A path that can spell the keyword p.
  • the emotion dictionary needs to be loaded into the memory, and the single-piece design mode is used to design the model objects of the emotion dictionary of the AC automata, and the persistent model is set in the first It is loaded into the memory during the second run, and there is no need to perform operations such as compilation and loading for each subsequent call. It realizes one compilation and loading and multiple runs, making full use of the high-efficiency features of memory access and improving the efficiency of emotional feature information extraction.
  • Use double array dictionary tree to compress storage space, and use storage compression to reduce disk IO read and write times and storage space occupied to improve the efficiency of memory access.
  • Feature data refers to data with emotional feature information, that is, words that represent the emotion of the reviewer.
  • the above-mentioned step S120 may include steps S121 to S122.
  • S121 Use an AC automata based on a double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain an output result;
  • the output result refers to a collection of words that match emotional words.
  • the above-mentioned step S121 may include steps S121a to S121i.
  • the AC automaton When the characters are matched, when the output function of the emotional dictionary is not empty, the AC automaton outputs the matching mode and outputs the matched characters to the set set to form the output result.
  • step S122 If yes, go to step S122;
  • S122 Perform emotional feature information extraction on the output result to obtain feature data.
  • the sentiment dictionary provides a priori knowledge of the emotion of a word, which represents the emotion polarity and intensity of the word in most contexts. Extract emotional feature information based on the emotional dictionary, extract valuable emotional information from public opinion texts, and convert unstructured text with no regularity into structured feature information that the computer can understand and recognize.
  • the final emotional feature information is the feature data representation format: ⁇ emotional words, part of speech, position in the sentence, emotional tendency, emotional intensity ⁇ .
  • step S122 may include steps S1221 to S1227.
  • Atomic words refer to the smallest unit of words. Based on AC automata, a sentence is split into all possible atomic words.
  • S1225 Calculate the distance between the word frequencies of the atomic words of two nodes in the array based on the Viterbi algorithm
  • the distance between the atomic term term of the two nodes is calculated, and a distance is assigned to each node, which represents the length of the cumulative shortest path from the root node to the current node, and then the whole graph is scored by depth-first traversal. For each scoring, just add the distance from the root node to the current node.
  • the attribute information refers to information such as part of speech, position in a sentence, emotional tendency, and emotional strength.
  • the prediction result refers to the polarity value of the public opinion data.
  • the public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then inputting the classification features to the logistic regression model for training.
  • the input feature data uses the XGBoost model to construct new features.
  • the constructed new feature vector has a value of 0/1, and each element of the vector corresponds to the leaf node of the tree in the XGBoost model.
  • the value of the element corresponding to this leaf node in the new feature vector is 1, and the elements corresponding to other leaf nodes of this tree
  • the value is 0, and the length of the new feature vector is equal to the sum of the number of leaf nodes contained in all trees in the XGBoost model.
  • these new features are added to the original features to train the model to obtain the public opinion polarity prediction model.
  • each individual tree is regarded as the classification input feature of the sparse linear classifier.
  • the input split has two trees, the upper tree has two leaf nodes, and the lower tree has three leaf nodes.
  • the final feature is Is a five-dimensional vector.
  • the second node on the tree is coded [0,1], suppose it falls on the first node of the tree down, code [1,0,0], so the final code is [0,1,1] , 0, 0], the code is used as the input feature of the prediction model and input into the logistic regression model for prediction.
  • the above-mentioned public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training.
  • the model includes steps S131 to S136.
  • S132 Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary.
  • the aforementioned XGBoost extreme gradient boosting, eXtreme Gradient Boosting
  • XGBoost extreme gradient boosting, eXtreme Gradient Boosting
  • the Xgboost model is an integration of many CART regression trees.
  • the emotional feature information combination is used as the input of the logistic regression model; the logistic regression model is trained and the model is persisted.
  • XGBoost is an efficient implementation of the GBDT algorithm and supports parallel processing.
  • the base learner uses a CART regression tree.
  • the regularization term is related to the number of leaf nodes of the tree and the value of the leaf nodes;
  • XGBoost approximates the objective function according to the Taylor expansion and calculates the pseudo residual
  • the learning function FM(x) uses not only the first derivative but also the second derivative.
  • a regular term is added to the model cost function to control the complexity of the model and make the learned model simpler.
  • F-Score (2 ⁇ Precision ⁇ Recall)/(Precision+Recall), where Precision represents the accuracy rate, and Recall represents the recall rate.
  • Precision the number of correctly classified instances of a certain class / the total number of instances of a certain class predicted by the public opinion polarity prediction model
  • Recall the number of instances of a certain type that are correctly classified/the total number of instances of a certain type in the test data.
  • the output of the prediction result adopts a json formatted string.
  • the output format is as follows: ⁇ "sentiTrend":"front”,"sentineg”:0.278,”sentipos”:0.722 ⁇ .
  • the above-mentioned public opinion polarity prediction method uses the storage structure of the double-array dictionary tree to construct the emotional dictionary, which reduces the number of disk IO reads and writes and the physical storage space occupied. Emotion feature information extraction is carried out in the dictionary, and character comparison is transformed into state transition.
  • FIG. 10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the present application. As shown in FIG. 10, corresponding to the above public opinion polarity prediction method, this application also provides a public opinion polarity prediction device.
  • the public opinion polarity prediction device includes a unit for executing the above public opinion polarity prediction method, and the device can be configured in a server.
  • the public opinion polarity prediction device 300 includes:
  • the public opinion data obtaining unit 301 is used to obtain public opinion data
  • the extraction unit 302 is configured to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;
  • the prediction unit 303 is configured to perform polarity prediction on the feature data through the public opinion polarity prediction model to obtain a prediction result;
  • the output unit 304 is configured to output the prediction result.
  • the extraction unit 302 includes:
  • the matching subunit is used to perform pattern matching on the data to be analyzed using the AC automata based on the double-array dictionary tree to obtain the output result;
  • the feature data forms a sub-unit for extracting emotional feature information from the output result to obtain feature data.
  • the aforementioned matching subunit includes:
  • the search module is used to search the emotional dictionary according to the characters
  • the character judgment module is used to judge whether the character matches
  • the first output module is used to output the matched characters to the set set if they match to form an output result
  • the last character judging module is used to judge whether the current character is the last character; if it is, enter the emotional feature information extraction of the output result to obtain the feature data;
  • the character acquisition module is used to acquire the next character if not; return to the search emotion dictionary based on the character;
  • the steering module is used to turn to the character pointed to by the invalid function if it does not match;
  • the pointing judgment module is used to judge whether the character pointed to by the invalid function is empty; if so, enter the end step;
  • the second output module is configured to, if not, output the character pointed to by the invalid function to the set set to form an output result; return to the judgment whether the current character is the last character.
  • the aforementioned feature data forming subunit includes:
  • the division module is used to divide the output result into several atomic words
  • the adjacency list establishment module is used to establish the adjacency list for storing the array graph
  • the position determination module is used to determine the position of the atomic word by using the offset of the atomic word
  • Add module used to add atomic words to the corresponding position of the array in the adjacency list
  • the distance calculation module is used to calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm
  • the scoring module is used to score the entire array graph stored in the adjacency table
  • the integration module is used to add the atom words, positions and attribute information with the shortest distance to the set emotion feature data set to form feature data.
  • the aforementioned device further includes:
  • the model training unit is used to input the emotional feature data set extracted by the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features into the logistic regression model for training to obtain the public opinion polarity prediction model.
  • the aforementioned model training unit includes:
  • the first construction subunit is used to construct a decision tree according to the emotional feature data set extracted from the emotional dictionary
  • the first input subunit is used to input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
  • the second construction subunit is used to construct a new decision tree according to the residual
  • the combined input subunit is used to combine and input the emotional feature information into a logistic regression model to train the logistic regression model;
  • the processing subunit is used to perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
  • the above-mentioned public opinion polarity prediction device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 11.
  • FIG. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the computer program 5032 includes program instructions.
  • the processor 502 can execute a public opinion polarity prediction method.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute a public opinion polarity prediction method.
  • the network interface 505 is used for network communication with other devices.
  • the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following steps:
  • the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
  • the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
  • the processor 502 when the processor 502 implements the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data, it specifically implements the following steps:
  • the processor 502 when the processor 502 implements the pattern matching on the AC automata based on the double-array dictionary tree to obtain the output result step, the processor 502 specifically implements the following steps:
  • the polarity prediction of the feature data is performed by the public opinion polarity prediction model to obtain the prediction result.
  • the sentiment feature data set extracted by the sentiment dictionary is input into the XGBoost model to obtain the classification features.
  • the processor 502 realizes that the public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training.
  • the specific steps are as follows:
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium.
  • the program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
  • the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
  • the processor executes the computer program to implement the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data
  • the following steps are specifically implemented:
  • the atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
  • the polarity prediction of the feature data is performed by the public opinion polarity prediction model to obtain the prediction result.
  • the sentiment feature data set extracted by the sentiment dictionary is input into the XGBoost model to obtain the classification features.
  • the processor executes the computer program to realize the public opinion polarity prediction model by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features to
  • the logistic regression model is trained on the model steps, the following steps are specifically implemented:
  • the storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
  • ROM Read-Only Memory
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored or not implemented.
  • the steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the device in the embodiment of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A public opinion polarity prediction method and apparatus, a computer device, and a storage medium. The method comprises: obtaining public opinion data (S110); performing, by an AC automaton based on a double-array trie tree, emotional feature information extraction on data to be analyzed to obtain feature data (S120); performing polarity prediction on the feature data by a public opinion polarity prediction model to obtain a prediction result (S130); and outputting the prediction result (S140). An emotional dictionary is constructed by means of the storage structure of a double-array trie tree, thereby reducing the number of disk IO reads/writes and the occupied physical storage space; emotional feature information extraction is performed on public opinion data in the emotional dictionary by an AC automaton based on a double-array trie tree, character comparison is converted into state transition, backtracking is not needed at all when data to be analyzed is scanned, and the problem of repeated backward scanning is avoided; polarity prediction is conducted on feature data by a public opinion polarity prediction model, and the efficiency and accuracy of public opinion polarity prediction analysis are effectively improved.

Description

舆情极性预测方法、装置、计算机设备及存储介质Public opinion polarity prediction method, device, computer equipment and storage medium
本申请是以申请号为201910199451.5,申请日为2019年3月15日的中国专利申请为基础,并主张其优先权,该申请的全部内容在此作为整体引入本申请中。This application is based on a Chinese patent application with an application number of 201910199451.5 and an application date of March 15, 2019, and claims its priority. The entire content of this application is hereby incorporated into this application as a whole.
技术领域Technical field
本申请涉及信息处理方法,更具体地说是指舆情极性预测方法、装置、计算机设备及存储介质。This application relates to information processing methods, and more specifically to public opinion polarity prediction methods, devices, computer equipment and storage media.
背景技术Background technique
随着微信、微博等应用的快速发展,越来越多的网民通过互联网来表达观点。网络信息和社会信息的融合对社会产生的影响越来越大,甚至关系到国家的信息安全和长治久安。由于互联网上的信息量十分庞大,依靠人工的方法无法处理海量的舆情数据,要想全面、完整的获取舆情总体态势情况,需要依靠情感极性分析技术对舆情信息进行自动地监控及分析。With the rapid development of applications such as WeChat and Weibo, more and more netizens express their opinions through the Internet. The integration of network information and social information has an increasingly greater impact on society, and it is even related to the country’s information security and long-term stability. Due to the huge amount of information on the Internet, it is impossible to process massive public opinion data by manual methods. To obtain the overall situation of public opinion comprehensively and completely, it is necessary to rely on sentiment polarity analysis technology to automatically monitor and analyze public opinion information.
现有的舆情分析应用系统,普遍采用的是关键词分析方法,不仅效率低,准确率也不高。基于传统的中文分词,进行模式匹配要多次回退扫描文本,性能效率比较低;现有的舆情分析应用系统采用较为粗糙地统计方法计算情感极性,由于特征信息的局限及上下文语境的影响,准确率不高;舆情情感词典占用存储空间比较大,带来性能上的损耗。Existing public opinion analysis application systems generally use keyword analysis methods, which are not only inefficient, but also inaccurate. Based on traditional Chinese word segmentation, pattern matching requires multiple back-scanning texts, and the performance efficiency is relatively low; the existing public opinion analysis application system uses a relatively crude statistical method to calculate emotional polarity, due to the limitation of feature information and the influence of context , The accuracy rate is not high; the public opinion sentiment dictionary occupies a relatively large storage space, which brings performance loss.
因此,有必要设计一种新的方法,以解决中文分词的速度低、极性预测准确率低、性能上的损耗大的问题。Therefore, it is necessary to design a new method to solve the problems of low speed of Chinese word segmentation, low accuracy of polarity prediction, and large performance loss.
申请内容Application content
本申请的目的在于克服现有技术的缺陷,提供舆情极性预测方法、装置、计算机设备及存储介质。The purpose of this application is to overcome the shortcomings of the prior art and provide a public opinion polarity prediction method, device, computer equipment and storage medium.
为实现上述目的,本申请采用以下技术方案:舆情极性预测方法,包括:In order to achieve the above objectives, this application adopts the following technical solutions: a public opinion polarity prediction method, including:
获取舆情数据;Get public opinion data;
基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;
输出所述预测结果。Output the prediction result.
其进一步技术方案为:所述基于双数组字典树的AC自动机是基于情感词典对待分析数据进行情感特征信息提取的多模匹配算法,所述情感词典是基于双数组字典树构建的。The further technical solution is that the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
其进一步技术方案为:所述基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据,包括:Its further technical solution is: the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data, including:
利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;
对输出结果进行情感特征信息提取,以得到特征数据。Perform emotional feature information extraction on the output result to obtain feature data.
其进一步技术方案为:所述对基于双数组字典树的AC自动机进行模式匹配,以得到输出结果,包括:The further technical solution is: the pattern matching of the AC automata based on the double-array dictionary tree to obtain the output result includes:
对所述待分析数据拆分为若干个字符;Split the data to be analyzed into several characters;
根据所述字符搜索情感词典;Searching the emotional dictionary according to the characters;
判断所述字符是否匹配;Determine whether the character matches;
若匹配,则输出匹配的字符至设定集合中,以形成输出结果;If it matches, output the matched characters to the set set to form the output result;
判断当前的字符是否为最后一个字符;Determine whether the current character is the last character;
若是,则进入所述对输出结果进行情感特征信息提取,以得到特征数据;If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;
若否,则获取下一字符;If not, get the next character;
返回所述根据所述字符搜索情感词典;Return to the search emotion dictionary according to the character;
若不匹配,则转向失效函数指向的字符;If it does not match, then turn to the character pointed to by the invalidation function;
判断所述失效函数指向的字符是否空;Determine whether the character pointed to by the invalid function is empty;
若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果;If not, output the character pointed to by the invalid function to the set set to form an output result;
返回所述判断当前的字符是否为最后一个字符;Return the judgment whether the current character is the last character;
若是,则进入结束步骤。If yes, enter the end step.
其进一步技术方案为:所述对输出结果进行情感特征信息提取,以得到特征数据,包括:The further technical solution is: the extraction of emotional feature information from the output result to obtain feature data includes:
将输出结果划分为若干个原子词语;Divide the output result into several atomic words;
建立用于存储数组图的邻接表;Establish an adjacency table for storing array graphs;
利用原子词语的偏移量确定原子词语的位置;Use the offset of the atomic word to determine the position of the atomic word;
将原子词语加入到邻接表内的数组相应的位置;Add the atomic word to the corresponding position of the array in the adjacency list;
基于维特比算法计算数组中两个节点的原子词语之间的距离;Calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;
对邻接表存储的整个数组图进行打分;Score the entire array graph stored in the adjacency table;
将所述距离最短的原子词语、位置以及属性信息加入设定的情感特征数据集合,以形成特征数据。The atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
其进一步技术方案为:所述通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果中,所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型训练所得的模型。The further technical solution is: the polarity prediction of the feature data by the public opinion polarity prediction model to obtain the prediction result, the public opinion polarity prediction model is input into the XGBoost model through the sentiment feature data set extracted by the sentiment dictionary After the classification feature is obtained, the classification feature is input to the model obtained by the logistic regression model training.
其进一步技术方案为:所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练所得的模型,包括:The further technical solution is: the public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training, including:
根据情感词典所提取的情感特征数据集构造决策树;Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;
将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差;Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
根据所述残差构造新决策树;Construct a new decision tree according to the residual;
利用新决策树迭代所述决策树,以得到情感特征信息组合;Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;
将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;
对训练后的逻辑回归模型进行模型持久化处理,以得到舆情极性预测模型。Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
本申请还提供了舆情极性预测装置,包括:This application also provides a public opinion polarity prediction device, including:
舆情数据获取单元,用于获取舆情数据;Public opinion data acquisition unit for acquiring public opinion data;
提取单元,用于基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The extraction unit is used to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;
预测单元,用于通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;The prediction unit is used to predict the polarity of the feature data through the public opinion polarity prediction model to obtain the prediction result;
输出单元,用于输出所述预测结果。The output unit is used to output the prediction result.
本申请还提供了一种计算机设备,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现上述的方法。The present application also provides a computer device that includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned method when the computer program is executed.
本申请还提供了一种存储介质,所述存储介质存储有计算机程序,所述计 算机程序被处理器执行时可实现上述的方法。The present application also provides a storage medium storing a computer program, and the computer program can implement the above-mentioned method when being executed by a processor.
本申请与现有技术相比的有益效果是:本申请通过双数组字典树的存储结构来构建情感词典,减少了磁盘IO读写次数和占用的物理存储空间,利用基于双数组字典树的AC自动机将舆情数据在情感词典内进行情感特征信息提取,将字符比较转化为状态转移,扫描待分析数据时完全不需要回溯,避免了多次回退扫描问题,通过舆情极性预测模型对特征数据进行极性预测,有效提高舆情极性预测分析的效率和准确性。Compared with the prior art, the present application has the following beneficial effects: the present application constructs the emotional dictionary through the storage structure of the double-array dictionary tree, reduces the number of disk IO reads and writes and the physical storage space occupied, and uses AC based on the double-array dictionary tree. The automata extracts the sentiment feature information of the public opinion data in the sentiment dictionary, and converts character comparison into state transition. When scanning the data to be analyzed, there is no need to backtrack at all, avoiding the problem of multiple fallback scanning. The feature data is analyzed by the public opinion polarity prediction model. Carrying out polarity prediction, effectively improving the efficiency and accuracy of public opinion polarity prediction analysis.
下面结合附图和具体实施例对本申请作进一步描述。The application will be further described below in conjunction with the drawings and specific embodiments.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的舆情极性预测方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of the application;
图2为本申请实施例提供的舆情极性预测方法的流程示意图;2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application;
图3为本申请实施例提供的舆情极性预测方法的子流程示意图;FIG. 3 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application;
图4为本申请实施例提供的舆情极性预测方法的子流程示意图;4 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application;
图5为本申请实施例提供的舆情极性预测方法的子流程示意图;FIG. 5 is a schematic diagram of a sub-process of a method for predicting public opinion polarity provided by an embodiment of the application;
图6为本申请实施例提供的舆情极性预测方法的子流程示意图;FIG. 6 is a schematic diagram of a sub-process of the method for predicting public opinion polarity provided by an embodiment of the application;
图7为本申请实施例提供的状态转移图;Fig. 7 is a state transition diagram provided by an embodiment of the application;
图8为本申请实施例提供的失效函数的示意图;FIG. 8 is a schematic diagram of a failure function provided by an embodiment of the application;
图9为本申请实施例提供的舆情极性预测示意图;FIG. 9 is a schematic diagram of public opinion polarity prediction provided by an embodiment of the application;
图10为本申请实施例提供的舆情极性预测装置的示意性框图;10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the application;
图11为本申请实施例提供的计算机设备的示意性框图。FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1和图2,图1为本申请实施例提供的舆情极性预测方法的应用场景示意图。图2为本申请实施例提供的舆情极性预测方法的示意性流程图。该舆情极性预测方法应用于服务器中。服务器根据爬取的目标舆情网站内容,对其采用预处理操作、基于双数组字典树的AC自动机分析以及舆情极性预测模型的预测,以得到舆情极性结果,并输出至终端显示。Please refer to FIG. 1 and FIG. 2. FIG. 1 is a schematic diagram of an application scenario of a public opinion polarity prediction method provided by an embodiment of this application. Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the application. The public opinion polarity prediction method is applied to the server. According to the crawled target public opinion website content, the server adopts preprocessing operations, AC automata analysis based on double-array dictionary tree, and prediction of public opinion polarity prediction model to obtain public opinion polarity results, and output to the terminal for display.
图2是本申请实施例提供的舆情极性预测方法的流程示意图。如图2所示,该方法包括以下步骤S110至S130。Fig. 2 is a schematic flowchart of a public opinion polarity prediction method provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S110 to S130.
S110、获取舆情数据。S110. Obtain public opinion data.
在本实施例中,舆情数据是指代表评论者情感的数据。In this embodiment, public opinion data refers to data representing the emotions of reviewers.
在一实施例中,上述的步骤S110可包括以下步骤:In an embodiment, the aforementioned step S110 may include the following steps:
爬取目标舆情网站内容;Crawl the content of the target public opinion website;
在本实施例中,目标舆情网站内容是指来源于网页网站的内容。采用爬虫技术爬取目标舆情网站内容。In this embodiment, the content of the target public opinion website refers to content originating from a webpage website. Use crawler technology to crawl the content of the target public opinion website.
对所述目标舆情网站内容进行预处理、网页分析以及去噪处理,以得到舆情数据。The content of the target public opinion website is preprocessed, web page analyzed, and de-noising processed to obtain public opinion data.
在本实施例中,需要对目标舆情网站内容进行初步处理,得到舆情数据,去除不必要的数据。In this embodiment, it is necessary to perform preliminary processing on the content of the target public opinion website to obtain public opinion data and remove unnecessary data.
S120、基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据。S120. The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data.
在本实施例中,基于双数组字典树的AC自动机是基于情感词典对待分析数据进行情感特征信息提取的多模匹配算法。In this embodiment, the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the emotional dictionary.
所述情感词典是基于双数组字典树构建的。The sentiment dictionary is constructed based on a double array dictionary tree.
在本实施例中,情感词典是指所有带情感色彩的词语组成的集合。In this embodiment, the emotional dictionary refers to a collection of all emotional words.
基于双数组字典树的词典存储结构,先确定词语的状态以及转向函数,并计算失效函数,输出函数的计算则是穿插在两步之中完成,双数组字典树是一颗压缩的字典树,通过使用两个一维数组BASE和CHECK来表示整个树。Based on the dictionary storage structure of the double-array dictionary tree, first determine the state of the word and the steering function, and calculate the failure function. The calculation of the output function is completed in two steps. The double-array dictionary tree is a compressed dictionary tree. Represent the entire tree by using two one-dimensional arrays BASE and CHECK.
举个例子,构建由{国人中国人国家队国人团队}组成的情感词典,为了构建转向函数,需要构建一个状态转移图。首先,状态转移图只包含一个起始状态0,通过添加一条从起始状态出发的路径的方式,依次向图中输入每个关键字p,新的顶点和边被加入到图表中,最终产生一条能拼写出关键字p的路径,为了完成转向函数的构建,对除起始字符外的其他每个字符,都增加一个从状态0到状态0的循环,以得到了如下图7所示的状态转移图,这个图就代表转向函数。For example, to construct an emotional dictionary composed of {Chinese national team national team}, in order to construct a steering function, a state transition diagram needs to be constructed. First, the state transition graph contains only a starting state 0. By adding a path starting from the starting state, each keyword p is input into the graph in turn, new vertices and edges are added to the graph, and finally generated A path that can spell the keyword p. In order to complete the construction of the steering function, add a loop from state 0 to state 0 for each character except the start character, to obtain the following figure 7 State transition diagram, this diagram represents the steering function.
失效函数是根据转向函数建立,先计算所有深度是1的状态的失效函数值,计算所有深度为2的状态,以此类推,直到所有除了状态0的状态的失效函数值都被计算出,状态0的深度没有定义,得到i=1,2,3,4,5,6,7,8,9时对应的状态值为0,0,0,1,2,0,3,0,3;最终得到如图8所述的失效函数。The failure function is established based on the steering function. First, calculate the failure function values of all states with a depth of 1, and calculate all states with a depth of 2, and so on, until the failure function values of all states except state 0 are calculated, state The depth of 0 is not defined. When i=1, 2, 3, 4, 5, 6, 7, 8, 9 the corresponding state value is 0, 0, 0, 1, 2, 0, 3, 0, 3; Finally, the failure function as shown in Figure 8 is obtained.
另外,在第一次运行AC自动机时,需要将情感词典加载到内存内,使用单件设计模式,对AC自动机的情感词典的模型对象进行设计,将持久化后的模型,在第一次运行时加载到内存,后面每次调用就不需要再执行编译和加载等操作,实现一次编译加载,多次运行,充分利用内存访问的高效率特征,提高情感特征信息提取的效率。使用双数组字典树压缩存储空间,利用存储压缩减少磁盘IO读写次数和占用的存储空间,以提高内存访问的效率。In addition, when the AC automata is run for the first time, the emotion dictionary needs to be loaded into the memory, and the single-piece design mode is used to design the model objects of the emotion dictionary of the AC automata, and the persistent model is set in the first It is loaded into the memory during the second run, and there is no need to perform operations such as compilation and loading for each subsequent call. It realizes one compilation and loading and multiple runs, making full use of the high-efficiency features of memory access and improving the efficiency of emotional feature information extraction. Use double array dictionary tree to compress storage space, and use storage compression to reduce disk IO read and write times and storage space occupied to improve the efficiency of memory access.
特征数据是指带有情感特征信息的数据,也就是代表评论者情感的词语。Feature data refers to data with emotional feature information, that is, words that represent the emotion of the reviewer.
在一实施例中,请参阅图3,上述的步骤S120可包括步骤S121~S122。In an embodiment, referring to FIG. 3, the above-mentioned step S120 may include steps S121 to S122.
S121、利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;S121: Use an AC automata based on a double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain an output result;
输出结果是指与情感词语相匹配的词语集合。The output result refers to a collection of words that match emotional words.
在一实施例中,请参阅图4,上述的步骤S121可包括步骤S121a~S121i。In an embodiment, referring to FIG. 4, the above-mentioned step S121 may include steps S121a to S121i.
S121a、对所述待分析数据拆分为若干个字符;S121a. Split the data to be analyzed into several characters;
S121b、根据所述字符搜索情感词典。S121b. Search the emotional dictionary according to the characters.
在情感词典中搜索字符,由于情感词典是由转向函数和失效函数构建而成,因此,AC自动机进行情感特征信息提取时,巧妙地将字符比较转化为状态转移,以进行字符与情感词典的匹配处理,扫描待分析数据时完全不需要回溯,避免了多次回退扫描问题。Searching for characters in the emotional dictionary, because the emotional dictionary is constructed by the steering function and the invalid function, when the AC automata extracts emotional feature information, it cleverly transforms the character comparison into a state transition to perform the comparison between the character and the emotional dictionary Matching processing, there is no need to backtrack when scanning the data to be analyzed, avoiding the problem of multiple back scanning.
S121c、判断所述字符是否匹配;S121c. Determine whether the characters match;
S121d、若匹配,则输出匹配的字符至设定集合中,以形成输出结果。S121d. If they match, output the matched characters to the set set to form an output result.
字符匹配时,情感词典的输出函数不为空时,AC自动机是输出匹配模式,输出匹配的字符至设定集合中,以形成输出结果。When the characters are matched, when the output function of the emotional dictionary is not empty, the AC automaton outputs the matching mode and outputs the matched characters to the set set to form the output result.
S121e、判断当前的字符是否为最后一个字符;S121e. Determine whether the current character is the last character;
若是,则进入步骤S122;If yes, go to step S122;
S121f、若否,则获取下一字符;S121f. If not, get the next character;
返回所述步骤S121b;Return to the step S121b;
S121g、若不匹配,则转向失效函数指向的字符。S121g. If there is no match, turn to the character pointed to by the invalidation function.
当当前的字符不匹配时,则表明当前的字符失效了,则AC自动机转向失效函数指向的字符。When the current character does not match, it indicates that the current character is invalid, and the AC automaton turns to the character pointed to by the invalid function.
S121h、判断所述失效函数指向的字符是否空;S121h: Determine whether the character pointed to by the invalidation function is empty;
S121i、若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果。S121i. If not, output the character pointed to by the invalid function to the set set to form an output result.
当失效函数指向的字符不是空,则输出该字符至设定集合中,以形成输出结果。When the character pointed to by the invalidation function is not empty, the character is output to the set set to form the output result.
返回所述步骤S121e;Return to the step S121e;
若是,则进入结束步骤。If yes, enter the end step.
循环上述的步骤,对待分析数据中的所有字符均匹配,以得到完整的输出结果。Repeat the above steps to match all characters in the data to be analyzed to obtain a complete output result.
S122、对输出结果进行情感特征信息提取,以得到特征数据。S122: Perform emotional feature information extraction on the output result to obtain feature data.
通过情感词典提供了一个词语在情感上的先验知识,表示该词语在大多数语境下的情感极性及其强度等信息。基于情感词典提取情感特征信息,抽取舆 情文本中具有价值的情感信息,将没有一点规律的非结构化文本转换成计算机能够理解识别的结构化特征信息。最终得到的情感特征信息即特征数据表示格式:{情感词、词性、句中位置、情感倾向、情感强度}。The sentiment dictionary provides a priori knowledge of the emotion of a word, which represents the emotion polarity and intensity of the word in most contexts. Extract emotional feature information based on the emotional dictionary, extract valuable emotional information from public opinion texts, and convert unstructured text with no regularity into structured feature information that the computer can understand and recognize. The final emotional feature information is the feature data representation format: {emotional words, part of speech, position in the sentence, emotional tendency, emotional intensity}.
在一实施例中,请参阅图5,上述的步骤S122可包括步骤S1221~S1227。In an embodiment, referring to FIG. 5, the above step S122 may include steps S1221 to S1227.
S1221、将输出结果划分为若干个原子词语。S1221. Divide the output result into several atomic words.
原子词语是指最小单位的词语。基于AC自动机实现将一个句子拆成所有可能的原子词语。Atomic words refer to the smallest unit of words. Based on AC automata, a sentence is split into all possible atomic words.
S1222、建立用于存储数组图的邻接表。S1222, establish an adjacency table for storing the array graph.
使用一个邻接表来储存图。Use an adjacency list to store the graph.
S1223、利用原子词语的偏移量确定原子词语的位置;S1223. Determine the position of the atomic word by using the offset of the atomic word;
S1224、将原子词语加入到邻接表内的数组相应的位置;S1224. Add the atomic word to the corresponding position of the array in the adjacency list;
利用每个原子词语term的偏移量offset来判断它在什么位置,将原子词语term加入到邻接表数组terms[offset]处。Use the offset offset of each atomic term to determine where it is, and add the atomic term to the adjacency list array terms[offset].
S1225、基于维特比算法计算数组中两个节点的原子词语词频之间的距离;S1225: Calculate the distance between the word frequencies of the atomic words of two nodes in the array based on the Viterbi algorithm;
S1226、对邻接表存储的整个数组图进行打分;S1226. Score the entire array graph stored in the adjacency table;
基于维特比算法计算两个节点的原子词语term之间的距离,为每个节点分配了一个距离,代表从根节点到当前节点的累计最短路径的长度,然后通过深度优先遍历整个图进行打分,每次打分只要加上从根节点到当前节点的距离。Based on the Viterbi algorithm, the distance between the atomic term term of the two nodes is calculated, and a distance is assigned to each node, which represents the length of the cumulative shortest path from the root node to the current node, and then the whole graph is scored by depth-first traversal. For each scoring, just add the distance from the root node to the current node.
S1227、将所述距离最短的原子词语、位置以及属性信息加入设定的情感特征数据集合,以形成特征数据。S1227. Add the atomic word, location and attribute information with the shortest distance to the set emotion feature data set to form feature data.
将最短路径上的情感词语及位置和属性等信息加入到情感特征数据集合。在本实施例中,属性信息是指词性、句中位置、情感倾向、情感强度等信息。Add the emotional words, location and attributes on the shortest path to the emotional feature data set. In this embodiment, the attribute information refers to information such as part of speech, position in a sentence, emotional tendency, and emotional strength.
S130、通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;S130: Perform polarity prediction on the feature data through the public opinion polarity prediction model to obtain a prediction result;
在本实施例中,预测结果是指舆情数据的极性值。所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练所得的模型。In this embodiment, the prediction result refers to the polarity value of the public opinion data. The public opinion polarity prediction model is a model obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then inputting the classification features to the logistic regression model for training.
输入特征数据利用XGBoost模型构造新特征,构造的新特征向量是取值0/1的,向量的每个元素对应于XGBoost模型中树的叶子结点。当一个样本点通过某棵树最终落在这棵树的一个叶子结点上,那么在新特征向量中这个叶子结点对应的元素值为1,而这棵树的其他叶子结点对应的元素值为0,新特征向量的 长度等于XGBoost模型里所有树包含的叶子结点数之和,最后把这些新特征加入原有特征一起训练模型,以得到舆情极性预测模型。每个单独树的输出被视为稀疏线性分类器的分类输入特征,如图9所示,输入分裂有两棵树,上树有两个叶子节点,下树有三个叶子节点,最终的特征即为五维的向量。对于输入x,上树第二个节点则编码[0,1],假设他落在下树第一个节点,编码[1,0,0],落在所以最终的编码为[0,1,1,0,0],将编码作为预测模型的输入特征,输入到逻辑回归模型中进行预测。The input feature data uses the XGBoost model to construct new features. The constructed new feature vector has a value of 0/1, and each element of the vector corresponds to the leaf node of the tree in the XGBoost model. When a sample point passes through a tree and finally falls on a leaf node of this tree, the value of the element corresponding to this leaf node in the new feature vector is 1, and the elements corresponding to other leaf nodes of this tree The value is 0, and the length of the new feature vector is equal to the sum of the number of leaf nodes contained in all trees in the XGBoost model. Finally, these new features are added to the original features to train the model to obtain the public opinion polarity prediction model. The output of each individual tree is regarded as the classification input feature of the sparse linear classifier. As shown in Figure 9, the input split has two trees, the upper tree has two leaf nodes, and the lower tree has three leaf nodes. The final feature is Is a five-dimensional vector. For input x, the second node on the tree is coded [0,1], suppose it falls on the first node of the tree down, code [1,0,0], so the final code is [0,1,1] , 0, 0], the code is used as the input feature of the prediction model and input into the logistic regression model for prediction.
在一实施例中,请参阅图6,上述的舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练所得的模型,包括步骤S131~S136。In one embodiment, referring to FIG. 6, the above-mentioned public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training. The model includes steps S131 to S136.
S131、根据情感词典所提取的情感特征数据集构造决策树;S131: Construct a decision tree according to the emotional feature data set extracted from the emotional dictionary;
S132、将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差。S132: Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary.
S133、根据所述残差构造新决策树;S133. Construct a new decision tree according to the residual;
S134、利用新决策树迭代所述决策树,以得到情感特征信息组合。S134. Use the new decision tree to iterate the decision tree to obtain a combination of emotional feature information.
上述的XGBoost(极端梯度提升,eXtreme Gradient Boosting)模型是大规模并行boosted tree的工具,它是目前最快最好的开源boosted tree工具包,Xgboost模型是很多CART回归树集成。The aforementioned XGBoost (extreme gradient boosting, eXtreme Gradient Boosting) model is a tool for massively parallel boosted trees. It is currently the fastest and best open source boosted tree toolkit. The Xgboost model is an integration of many CART regression trees.
在已有的模型和实际样本输出的残差上再构造一颗决策树,不断地进行迭代。每一次迭代都会产生一个增益较大的分类特征,通过多棵树获取多个具有区分度的情感特征信息组合。Construct a decision tree on the residuals of the existing model and actual sample output, and iterate continuously. Each iteration will produce a large gain classification feature, and obtain multiple discriminative emotional feature information combinations through multiple trees.
S135、将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;S135: Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;
S136、对训练后的逻辑回归模型进行模型持久化处理,以得到舆情极性预测模型。S136. Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
将该情感特征信息组合作为逻辑回归模型的输入;训练逻辑回归模型并将模型持久化。The emotional feature information combination is used as the input of the logistic regression model; the logistic regression model is trained and the model is persisted.
XGBoost是GBDT算法的高效实现,支持并行处理,基学习器使用CART回归树,正则化项与树的叶子节点数量和叶子节点的值有关;XGBoost根据泰勒展开式来近似目标函数,计算伪残差学习函数FM(x),不仅使用了一阶导数, 还使用了二阶导数,同时模型代价函数里还加入了正则项,用于控制模型的复杂度,使得学习出来的模型更加简单。XGBoost is an efficient implementation of the GBDT algorithm and supports parallel processing. The base learner uses a CART regression tree. The regularization term is related to the number of leaf nodes of the tree and the value of the leaf nodes; XGBoost approximates the objective function according to the Taylor expansion and calculates the pseudo residual The learning function FM(x) uses not only the first derivative but also the second derivative. At the same time, a regular term is added to the model cost function to control the complexity of the model and make the learned model simpler.
利用舆情极性预测模型对网络舆情文本内容进行预测以得到极性结果,并采用F-Score对最终分类结果进行评价,其定义如下:Use the public opinion polarity prediction model to predict the content of the network public opinion text to obtain the polarity result, and use F-Score to evaluate the final classification result, which is defined as follows:
F-Score=(2×Precision×Recall)/(Precision+Recall),其中,Precision代表准确率,Recall代表召回率。F-Score=(2×Precision×Recall)/(Precision+Recall), where Precision represents the accuracy rate, and Recall represents the recall rate.
Precision=某类被正确分类的实例个数/舆情极性预测模型预测某类实例的总数Precision = the number of correctly classified instances of a certain class / the total number of instances of a certain class predicted by the public opinion polarity prediction model
Recall=某类被正确分类的实例个数/测试数据中某类实例的总数。Recall = the number of instances of a certain type that are correctly classified/the total number of instances of a certain type in the test data.
S140、输出所述预测结果。S140. Output the prediction result.
预测结果输出采用json格式化字符串,输出格式事例如下:{"sentiTrend":"正面","sentineg":0.278,"sentipos":0.722}。The output of the prediction result adopts a json formatted string. The output format is as follows: {"sentiTrend":"front","sentineg":0.278,"sentipos":0.722}.
采用爬虫抓取的微博数据20w条测试,不同舆情极性预测算法准确率对比情况如表1和表2所示。Using 20w pieces of microblog data captured by crawlers to test, the accuracy comparison of different public opinion polarity prediction algorithms is shown in Table 1 and Table 2.
表1.特征数据提取速度对比Table 1. Feature data extraction speed comparison
算法algorithm 词典规模Dictionary size 提取速度Extraction speed
IK分词IK participle 35w35w 80w/s80w/s
Ansj分词Ansj participle 35w35w 210w/s210w/s
Fnlp分词Fnlp participle 35w35w 120w/s120w/s
双数组AC自动机Double array AC automata 35w35w 1600w/s1600w/s
表2.准确率对比Table 2. Accuracy comparison
预测算法Prediction algorithm 准确率Accuracy F1F1
关键词统计方法Keyword statistical methods 0.7030.703 0.6330.633
Logistics算法Logistics algorithm 0.7180.718 0.6460.646
GBDT+lr算法GBDT+lr algorithm 0.8030.803 0.7250.725
XGBoost+lr算法XGBoost+lr algorithm 0.8120.812 0.7360.736
上述的舆情极性预测方法,通过双数组字典树的存储结构来构建情感词典, 减少了磁盘IO读写次数和占用的物理存储空间,利用基于双数组字典树的AC自动机将舆情数据在情感词典内进行情感特征信息提取,将字符比较转化为状态转移,扫描待分析数据时完全不需要回溯,避免了多次回退扫描问题,通过舆情极性预测模型对特征数据进行极性预测,有效提高舆情极性预测分析的效率和准确性。The above-mentioned public opinion polarity prediction method uses the storage structure of the double-array dictionary tree to construct the emotional dictionary, which reduces the number of disk IO reads and writes and the physical storage space occupied. Emotion feature information extraction is carried out in the dictionary, and character comparison is transformed into state transition. When scanning the data to be analyzed, there is no need to backtrack at all, avoiding the problem of multiple back-scanning, and predicting the polarity of feature data through the public opinion polarity prediction model, effectively improving The efficiency and accuracy of public opinion polarity prediction analysis.
图10是本申请实施例提供的一种舆情极性预测装置的示意性框图。如图10所示,对应于以上舆情极性预测方法,本申请还提供一种舆情极性预测装置。该舆情极性预测装置包括用于执行上述舆情极性预测方法的单元,该装置可以被配置于服务器中。FIG. 10 is a schematic block diagram of a public opinion polarity prediction device provided by an embodiment of the present application. As shown in FIG. 10, corresponding to the above public opinion polarity prediction method, this application also provides a public opinion polarity prediction device. The public opinion polarity prediction device includes a unit for executing the above public opinion polarity prediction method, and the device can be configured in a server.
具体地,请参阅图10,该舆情极性预测装置300包括:Specifically, referring to FIG. 10, the public opinion polarity prediction device 300 includes:
舆情数据获取单元301,用于获取舆情数据;The public opinion data obtaining unit 301 is used to obtain public opinion data;
提取单元302,用于基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The extraction unit 302 is configured to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;
预测单元303,用于通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;The prediction unit 303 is configured to perform polarity prediction on the feature data through the public opinion polarity prediction model to obtain a prediction result;
输出单元304,用于输出所述预测结果。The output unit 304 is configured to output the prediction result.
在一实施例中,所述提取单元302包括:In an embodiment, the extraction unit 302 includes:
匹配子单元,用于利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;The matching subunit is used to perform pattern matching on the data to be analyzed using the AC automata based on the double-array dictionary tree to obtain the output result;
特征数据形成子单元,用于对输出结果进行情感特征信息提取,以得到特征数据。The feature data forms a sub-unit for extracting emotional feature information from the output result to obtain feature data.
在一实施例中,上述的匹配子单元包括:In an embodiment, the aforementioned matching subunit includes:
拆分模块,用于对所述待分析数据拆分为若干个字符;A splitting module for splitting the data to be analyzed into several characters;
搜索模块,用于根据所述字符搜索情感词典;The search module is used to search the emotional dictionary according to the characters;
字符判断模块,用于判断所述字符是否匹配;The character judgment module is used to judge whether the character matches;
第一输出模块,用于若匹配,则输出匹配的字符至设定集合中,以形成输出结果;The first output module is used to output the matched characters to the set set if they match to form an output result;
末字符判断模块,用于判断当前的字符是否为最后一个字符;若是,则进入所述对输出结果进行情感特征信息提取,以得到特征数据;The last character judging module is used to judge whether the current character is the last character; if it is, enter the emotional feature information extraction of the output result to obtain the feature data;
字符获取模块,用于若否,则获取下一字符;返回所述根据所述字符搜索 情感词典;The character acquisition module is used to acquire the next character if not; return to the search emotion dictionary based on the character;
转向模块,用于若不匹配,则转向失效函数指向的字符;The steering module is used to turn to the character pointed to by the invalid function if it does not match;
指向判断模块,用于判断所述失效函数指向的字符是否空;若是,则进入结束步骤;The pointing judgment module is used to judge whether the character pointed to by the invalid function is empty; if so, enter the end step;
第二输出模块,用于若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果;返回所述判断当前的字符是否为最后一个字符。The second output module is configured to, if not, output the character pointed to by the invalid function to the set set to form an output result; return to the judgment whether the current character is the last character.
在一实施例中,上述的特征数据形成子单元包括:In an embodiment, the aforementioned feature data forming subunit includes:
划分模块,用于将输出结果划分为若干个原子词语;The division module is used to divide the output result into several atomic words;
邻接表建立模块,用于建立用于存储数组图的邻接表;The adjacency list establishment module is used to establish the adjacency list for storing the array graph;
位置确定模块,用于利用原子词语的偏移量确定原子词语的位置;The position determination module is used to determine the position of the atomic word by using the offset of the atomic word;
加入模块,用于将原子词语加入到邻接表内的数组相应的位置;Add module, used to add atomic words to the corresponding position of the array in the adjacency list;
距离计算模块,用于基于维特比算法计算数组中两个节点的原子词语之间的距离;The distance calculation module is used to calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;
打分模块,用于对邻接表存储的整个数组图进行打分;The scoring module is used to score the entire array graph stored in the adjacency table;
整合模块,用于将所述距离最短的原子词语、位置以及属性信息加入设定的情感特征数据集合,以形成特征数据。The integration module is used to add the atom words, positions and attribute information with the shortest distance to the set emotion feature data set to form feature data.
在一实施例中,上述的装置还包括:In an embodiment, the aforementioned device further includes:
模型训练单元,用于通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练,以得到舆情极性预测模型。The model training unit is used to input the emotional feature data set extracted by the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features into the logistic regression model for training to obtain the public opinion polarity prediction model.
在一实施例中,上述的模型训练单元包括:In an embodiment, the aforementioned model training unit includes:
第一构造子单元,用于根据情感词典所提取的情感特征数据集构造决策树;The first construction subunit is used to construct a decision tree according to the emotional feature data set extracted from the emotional dictionary;
第一输入子单元,用于将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差;The first input subunit is used to input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
第二构造子单元,用于根据所述残差构造新决策树;The second construction subunit is used to construct a new decision tree according to the residual;
迭代子单元,用于利用新决策树迭代所述决策树,以得到情感特征信息组合;An iterative subunit for iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;
组合输入子单元,用于将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;The combined input subunit is used to combine and input the emotional feature information into a logistic regression model to train the logistic regression model;
处理子单元,用于对训练后的逻辑回归模型进行模型持久化处理,以得到 舆情极性预测模型。The processing subunit is used to perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
需要说明的是,所属领域的技术人员可以清楚地了解到,上述舆情极性预测装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。It should be noted that those skilled in the art can clearly understand that the specific implementation process of the above-mentioned public opinion polarity prediction device and each unit can be referred to the corresponding description in the foregoing method embodiment. For the convenience and conciseness of the description, here is No longer.
上述舆情极性预测装置可以实现为一种计算机程序的形式,该计算机程序可以在如图11所示的计算机设备上运行。The above-mentioned public opinion polarity prediction device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 11.
请参阅图11,图11是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500是服务器。Please refer to FIG. 11, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server.
参阅图11,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 11, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032包括程序指令,该程序指令被执行时,可使得处理器502执行一种舆情极性预测方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions. When the program instructions are executed, the processor 502 can execute a public opinion polarity prediction method.
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种舆情极性预测方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute a public opinion polarity prediction method.
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现如下步骤:Wherein, the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following steps:
获取舆情数据;Get public opinion data;
基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;
输出所述预测结果。Output the prediction result.
其中,所述基于双数组字典树的AC自动机是基于情感词典对待分析数据进行情感特征信息提取的多模匹配算法,所述情感词典是基于双数组字典树构建的。Wherein, the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
在一实施例中,处理器502在实现所述基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据步骤时,具体实现如下步骤:In one embodiment, when the processor 502 implements the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data, it specifically implements the following steps:
利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;
对输出结果进行情感特征信息提取,以得到特征数据。Perform emotional feature information extraction on the output result to obtain feature data.
在一实施例中,处理器502在实现所述对基于双数组字典树的AC自动机进行模式匹配,以得到输出结果步骤时,具体实现如下步骤:In one embodiment, when the processor 502 implements the pattern matching on the AC automata based on the double-array dictionary tree to obtain the output result step, the processor 502 specifically implements the following steps:
对所述待分析数据拆分为若干个字符;Split the data to be analyzed into several characters;
根据所述字符搜索情感词典;Searching the emotional dictionary according to the characters;
判断所述字符是否匹配;Determine whether the character matches;
若匹配,则输出匹配的字符至设定集合中,以形成输出结果;If it matches, output the matched characters to the set set to form the output result;
判断当前的字符是否为最后一个字符;Determine whether the current character is the last character;
若是,则进入所述对输出结果进行情感特征信息提取,以得到特征数据;If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;
若否,则获取下一字符;If not, get the next character;
返回所述根据所述字符搜索情感词典;Return to the search emotion dictionary according to the character;
若不匹配,则转向失效函数指向的字符;If it does not match, then turn to the character pointed to by the invalidation function;
判断所述失效函数指向的字符是否空;Determine whether the character pointed to by the invalid function is empty;
若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果;If not, output the character pointed to by the invalid function to the set set to form an output result;
返回所述判断当前的字符是否为最后一个字符;Return the judgment whether the current character is the last character;
若是,则进入结束步骤。If yes, enter the end step.
其中,所述通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果中,所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型训练所得的模型。Wherein, the polarity prediction of the feature data is performed by the public opinion polarity prediction model to obtain the prediction result. In the public opinion polarity prediction model, the sentiment feature data set extracted by the sentiment dictionary is input into the XGBoost model to obtain the classification features. , Input the classification features into the model trained by the logistic regression model.
在一实施例中,处理器502在实现所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输 入至逻辑回归模型进行训练所得的模型步骤时,具体实现如下步骤:In one embodiment, the processor 502 realizes that the public opinion polarity prediction model is obtained by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain classification features, and then inputting the classification features into the logistic regression model for training. In the model step, the specific steps are as follows:
根据情感词典所提取的情感特征数据集构造决策树;Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;
将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差;Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
根据所述残差构造新决策树;Construct a new decision tree according to the residual;
利用新决策树迭代所述决策树,以得到情感特征信息组合;Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;
将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;
对训练后的逻辑回归模型进行模型持久化处理,以得到舆情极性预测模型。Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序包括程序指令,计算机程序可存储于一存储介质中,该存储介质为计算机可读存储介质。该程序指令被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by computer programs instructing relevant hardware. The computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
因此,本申请还提供一种存储介质。该存储介质可以为计算机可读存储介质。该存储介质存储有计算机程序,其中该计算机程序被处理器执行时使处理器执行如下步骤:Therefore, this application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
获取舆情数据;Get public opinion data;
基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;
输出所述预测结果。Output the prediction result.
其中,所述基于双数组字典树的AC自动机是基于情感词典对待分析数据进行情感特征信息提取的多模匹配算法,所述情感词典是基于双数组字典树构建的。Wherein, the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the sentiment dictionary, and the sentiment dictionary is constructed based on the double-array dictionary tree.
在一实施例中,所述处理器在执行所述计算机程序而实现所述基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据步骤时,具体实现如下步骤:In one embodiment, when the processor executes the computer program to implement the step of extracting emotional feature information from the data to be analyzed by the AC automaton based on the double-array dictionary tree to obtain feature data, the following steps are specifically implemented:
利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;
对输出结果进行情感特征信息提取,以得到特征数据。Perform emotional feature information extraction on the output result to obtain feature data.
在一实施例中,所述处理器在执行所述计算机程序而实现所述对基于双数组字典树的AC自动机进行模式匹配,以得到输出结果步骤时,具体实现如下步骤:In an embodiment, when the processor executes the computer program to implement the pattern matching on the AC automata based on the double-array dictionary tree to obtain the output result step, the following steps are specifically implemented:
对所述待分析数据拆分为若干个字符;Split the data to be analyzed into several characters;
根据所述字符搜索情感词典;Searching the emotional dictionary according to the characters;
判断所述字符是否匹配;Determine whether the character matches;
若匹配,则输出匹配的字符至设定集合中,以形成输出结果;If it matches, output the matched characters to the set set to form the output result;
判断当前的字符是否为最后一个字符;Determine whether the current character is the last character;
若是,则进入所述对输出结果进行情感特征信息提取,以得到特征数据;If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;
若否,则获取下一字符;If not, get the next character;
返回所述根据所述字符搜索情感词典;Return to the search emotion dictionary according to the character;
若不匹配,则转向失效函数指向的字符;If it does not match, then turn to the character pointed to by the invalidation function;
判断所述失效函数指向的字符是否空;Determine whether the character pointed to by the invalid function is empty;
若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果;If not, output the character pointed to by the invalid function to the set set to form an output result;
返回所述判断当前的字符是否为最后一个字符;Return the judgment whether the current character is the last character;
若是,则进入结束步骤。If yes, enter the end step.
在一实施例中,所述处理器在执行所述计算机程序而实现所述对输出结果进行情感特征信息提取,以得到特征数据步骤时,具体实现如下步骤:In an embodiment, when the processor executes the computer program to implement the step of extracting emotional feature information from the output result to obtain feature data, the following steps are specifically implemented:
将输出结果划分为若干个原子词语;Divide the output result into several atomic words;
建立用于存储数组图的邻接表;Establish an adjacency table for storing array graphs;
利用原子词语的偏移量确定原子词语的位置;Use the offset of the atomic word to determine the position of the atomic word;
将原子词语加入到邻接表内的数组相应的位置;Add the atomic word to the corresponding position of the array in the adjacency list;
基于维特比算法计算数组中两个节点的原子词语之间的距离;Calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;
对邻接表存储的整个数组图进行打分;Score the entire array graph stored in the adjacency table;
将所述距离最短的原子词语、位置以及属性信息加入设定的情感特征数据集合,以形成特征数据。The atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
其中,所述通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果中,所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型训练所得的模型。Wherein, the polarity prediction of the feature data is performed by the public opinion polarity prediction model to obtain the prediction result. In the public opinion polarity prediction model, the sentiment feature data set extracted by the sentiment dictionary is input into the XGBoost model to obtain the classification features. , Input the classification features into the model trained by the logistic regression model.
在一实施例中,所述处理器在执行所述计算机程序而实现所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练所得的模型步骤时,具体实现如下步骤:In an embodiment, the processor executes the computer program to realize the public opinion polarity prediction model by inputting the emotional feature data set extracted from the emotional dictionary into the XGBoost model to obtain the classification features, and then input the classification features to When the logistic regression model is trained on the model steps, the following steps are specifically implemented:
根据情感词典所提取的情感特征数据集构造决策树;Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;
将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差;Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
根据所述残差构造新决策树;Construct a new decision tree according to the residual;
利用新决策树迭代所述决策树,以得到情感特征信息组合;Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;
将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;
对训练后的逻辑回归模型进行模型持久化处理,以得到舆情极性预测模型。Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
所述存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的计算机可读存储介质。The storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described in terms of function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的。例如,各个单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特 征可以忽略,或不执行。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of each unit is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not implemented.
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。The steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs. The units in the device in the embodiment of the present application may be combined, divided, and deleted according to actual needs. In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (10)

  1. 舆情极性预测方法,其特征在于,包括:The public opinion polarity prediction method is characterized in that it includes:
    获取舆情数据;Get public opinion data;
    基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain feature data;
    通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;Use the public opinion polarity prediction model to predict the polarity of feature data to obtain the prediction result;
    输出所述预测结果。Output the prediction result.
  2. 根据权利要求1所述的舆情极性预测方法,其特征在于,所述基于双数组字典树的AC自动机是基于情感词典对待分析数据进行情感特征信息提取的多模匹配算法,所述情感词典是基于双数组字典树构建的。The public opinion polarity prediction method according to claim 1, wherein the AC automata based on the double-array dictionary tree is a multi-modal matching algorithm that extracts emotional feature information from the data to be analyzed based on the emotional dictionary. It is constructed based on the double-array dictionary tree.
  3. 根据权利要求2所述的舆情极性预测方法,其特征在于,所述基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据,包括:The public opinion polarity prediction method according to claim 2, wherein the AC automata based on the double-array dictionary tree extracts emotional feature information from the data to be analyzed to obtain the feature data, comprising:
    利用基于双数组字典树的AC自动机对待分析数据进行模式匹配,以得到输出结果;Use AC automata based on double-array dictionary tree to perform pattern matching on the data to be analyzed to obtain output results;
    对输出结果进行情感特征信息提取,以得到特征数据。Perform emotional feature information extraction on the output result to obtain feature data.
  4. 根据权利要求3所述的舆情极性预测方法,其特征在于,所述对基于双数组字典树的AC自动机进行模式匹配,以得到输出结果,包括:The method for predicting the polarity of public opinion according to claim 3, wherein the pattern matching of an AC automata based on a double-array dictionary tree to obtain an output result comprises:
    对所述待分析数据拆分为若干个字符;Split the data to be analyzed into several characters;
    根据所述字符搜索情感词典;Searching the emotional dictionary according to the characters;
    判断所述字符是否匹配;Determine whether the character matches;
    若匹配,则输出匹配的字符至设定集合中,以形成输出结果;If it matches, output the matched characters to the set set to form the output result;
    判断当前的字符是否为最后一个字符;Determine whether the current character is the last character;
    若是,则进入所述对输出结果进行情感特征信息提取,以得到特征数据;If yes, proceed to the extraction of emotional feature information on the output result to obtain feature data;
    若否,则获取下一字符;If not, get the next character;
    返回所述根据所述字符搜索情感词典;Return to the search emotion dictionary according to the character;
    若不匹配,则转向失效函数指向的字符;If it does not match, then turn to the character pointed to by the invalidation function;
    判断所述失效函数指向的字符是否空;Determine whether the character pointed to by the invalid function is empty;
    若否,则输出所述失效函数指向的字符至设定集合中,以形成输出结果;If not, output the character pointed to by the invalid function to the set set to form an output result;
    返回所述判断当前的字符是否为最后一个字符;Return the judgment whether the current character is the last character;
    若是,则进入结束步骤。If yes, enter the end step.
  5. 根据权利要求4所述的舆情极性预测方法,其特征在于,所述对输出结果进行情感特征信息提取,以得到特征数据,包括:The method for predicting the polarity of public opinion according to claim 4, wherein said extracting emotional feature information from the output result to obtain feature data comprises:
    将输出结果划分为若干个原子词语;Divide the output result into several atomic words;
    建立用于存储数组图的邻接表;Establish an adjacency table for storing array graphs;
    利用原子词语的偏移量确定原子词语的位置;Use the offset of the atomic word to determine the position of the atomic word;
    将原子词语加入到邻接表内的数组相应的位置;Add the atomic word to the corresponding position of the array in the adjacency list;
    基于维特比算法计算数组中两个节点的原子词语之间的距离;Calculate the distance between the atomic words of two nodes in the array based on the Viterbi algorithm;
    对邻接表存储的整个数组图进行打分;Score the entire array graph stored in the adjacency table;
    将所述距离最短的原子词语、位置以及属性信息加入设定的情感特征数据集合,以形成特征数据。The atom words, positions and attribute information with the shortest distance are added to the set emotion feature data set to form feature data.
  6. 根据权利要求2所述的舆情极性预测方法,其特征在于,所述通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果中,所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型训练所得的模型。The method for predicting the polarity of public opinion according to claim 2, wherein the polarity prediction of the feature data is performed by the prediction model of the polarity of the public opinion to obtain the prediction result. After the extracted emotional feature data set is input into the XGBoost model to obtain the classification feature, the classification feature is input to the model trained by the logistic regression model.
  7. 根据权利要求6所述的舆情极性预测方法,其特征在于,所述舆情极性预测模型是通过情感词典所提取的情感特征数据集输入XGBoost模型中得到分类特征后,将分类特征输入至逻辑回归模型进行训练所得的模型,包括:The public opinion polarity prediction method according to claim 6, characterized in that, the public opinion polarity prediction model is input into the XGBoost model to obtain the classification features after the sentiment feature data set extracted from the sentiment dictionary is input into the logic The model obtained by training the regression model includes:
    根据情感词典所提取的情感特征数据集构造决策树;Construct a decision tree based on the emotional feature data set extracted from the emotional dictionary;
    将决策树输入至XGBoost模型中,以得到XGBoost模型和情感词典所提取的情感特征数据集实际输出的残差;Input the decision tree into the XGBoost model to obtain the residuals of the actual output of the emotional feature data set extracted by the XGBoost model and the emotional dictionary;
    根据所述残差构造新决策树;Construct a new decision tree according to the residual;
    利用新决策树迭代所述决策树,以得到情感特征信息组合;Iterating the decision tree using the new decision tree to obtain a combination of emotional feature information;
    将所述情感特征信息组合输入逻辑回归模型中,对逻辑回归模型进行训练;Input the emotional feature information combination into a logistic regression model, and train the logistic regression model;
    对训练后的逻辑回归模型进行模型持久化处理,以得到舆情极性预测模型。Perform model persistence processing on the trained logistic regression model to obtain a public opinion polarity prediction model.
  8. 舆情极性预测装置,其特征在于,包括:The device for predicting public opinion polarity is characterized in that it includes:
    舆情数据获取单元,用于获取舆情数据;Public opinion data acquisition unit for acquiring public opinion data;
    提取单元,用于基于双数组字典树的AC自动机对待分析数据进行情感特征信息提取,以得到特征数据;The extraction unit is used to extract emotional feature information from the data to be analyzed based on the AC automaton of the double-array dictionary tree to obtain feature data;
    预测单元,用于通过舆情极性预测模型对特征数据进行极性预测,以得到预测结果;The prediction unit is used to predict the polarity of the feature data through the public opinion polarity prediction model to obtain the prediction result;
    输出单元,用于输出所述预测结果。The output unit is used to output the prediction result.
  9. 一种计算机设备,其特征在于,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的方法。A computer device, characterized in that the computer device includes a memory and a processor, and a computer program is stored on the memory, and the processor executes the computer program as described in any one of claims 1 to 7. The method described.
  10. 一种存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现如权利要求1至7中任一项所述的方法。A storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1 to 7 can be implemented.
PCT/CN2019/089224 2019-03-15 2019-05-30 Public opinion polarity prediction method and apparatus, computer device, and storage medium WO2020186627A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910199451.5A CN109933656B (en) 2019-03-15 2019-03-15 Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
CN201910199451.5 2019-03-15

Publications (1)

Publication Number Publication Date
WO2020186627A1 true WO2020186627A1 (en) 2020-09-24

Family

ID=66987288

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089224 WO2020186627A1 (en) 2019-03-15 2019-05-30 Public opinion polarity prediction method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN109933656B (en)
WO (1) WO2020186627A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328793A (en) * 2020-11-09 2021-02-05 北京小米松果电子有限公司 Comment text data processing method and device and storage medium
CN113642881A (en) * 2021-08-09 2021-11-12 平安国际智慧城市科技股份有限公司 Public opinion data risk identification method and device, computer equipment and storage medium
CN113643060A (en) * 2021-08-12 2021-11-12 工银科技有限公司 Product price prediction method and device
CN114117149A (en) * 2021-11-25 2022-03-01 深圳前海微众银行股份有限公司 Sensitive word filtering method and device and storage medium
CN114701870A (en) * 2022-02-11 2022-07-05 国能黄骅港务有限责任公司 Tippler feeding system and high material level detection method and device thereof
CN114722723A (en) * 2022-04-29 2022-07-08 湖北工业大学 Emotion tendency prediction method and equipment based on kernel extreme learning machine optimization
CN114861027A (en) * 2022-04-29 2022-08-05 深圳市东晟数据有限公司 Multi-dimensional public opinion recommendation method based on big data and natural language processing
CN114897270A (en) * 2022-06-15 2022-08-12 青岛文达通科技股份有限公司 Semantic information fused public opinion propagation quantity prediction method and system
CN117407527A (en) * 2023-10-19 2024-01-16 重庆邮电大学 Education field public opinion big data classification method
CN117640259A (en) * 2024-01-25 2024-03-01 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362669B (en) * 2019-07-18 2022-07-01 中科信息安全共性技术国家工程研究中心有限公司 Method suitable for fast keyword retrieval
CN110674297B (en) * 2019-09-24 2022-04-29 支付宝(杭州)信息技术有限公司 Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN113051925B (en) * 2019-12-26 2024-06-18 中国移动通信集团有限公司 Time identification method, device, equipment and computer storage medium
CN111831824B (en) * 2020-07-16 2024-02-09 民生科技有限责任公司 Public opinion positive and negative surface classification method
CN111859074B (en) * 2020-07-29 2023-12-29 东北大学 Network public opinion information source influence evaluation method and system based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200969A (en) * 2010-03-25 2011-09-28 日电(中国)有限公司 Text sentiment polarity classification system and method based on sentence sequence
CN105512687A (en) * 2015-12-15 2016-04-20 北京锐安科技有限公司 Emotion classification model training and textual emotion polarity analysis method and system
CN106294326A (en) * 2016-08-23 2017-01-04 成都科来软件有限公司 A kind of news report Sentiment orientation analyzes method
CN108021569A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779174B (en) * 2012-06-26 2016-03-30 北京奇虎科技有限公司 A kind of public opinion information display system and method
CN103365991B (en) * 2013-07-03 2017-03-08 深圳市华傲数据技术有限公司 A kind of dictionaries store management method realizing Trie tree based on one-dimensional linear space

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200969A (en) * 2010-03-25 2011-09-28 日电(中国)有限公司 Text sentiment polarity classification system and method based on sentence sequence
CN105512687A (en) * 2015-12-15 2016-04-20 北京锐安科技有限公司 Emotion classification model training and textual emotion polarity analysis method and system
CN106294326A (en) * 2016-08-23 2017-01-04 成都科来软件有限公司 A kind of news report Sentiment orientation analyzes method
CN108021569A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 The structure of AC automatic machines and Chinese multi-model matching method and relevant apparatus

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328793B (en) * 2020-11-09 2024-07-09 北京小米松果电子有限公司 Comment text data processing method and device and storage medium
CN112328793A (en) * 2020-11-09 2021-02-05 北京小米松果电子有限公司 Comment text data processing method and device and storage medium
CN113642881A (en) * 2021-08-09 2021-11-12 平安国际智慧城市科技股份有限公司 Public opinion data risk identification method and device, computer equipment and storage medium
CN113643060A (en) * 2021-08-12 2021-11-12 工银科技有限公司 Product price prediction method and device
CN114117149A (en) * 2021-11-25 2022-03-01 深圳前海微众银行股份有限公司 Sensitive word filtering method and device and storage medium
CN114701870B (en) * 2022-02-11 2024-03-29 国能黄骅港务有限责任公司 Feeding system of dumper and high material level detection method and device thereof
CN114701870A (en) * 2022-02-11 2022-07-05 国能黄骅港务有限责任公司 Tippler feeding system and high material level detection method and device thereof
CN114861027A (en) * 2022-04-29 2022-08-05 深圳市东晟数据有限公司 Multi-dimensional public opinion recommendation method based on big data and natural language processing
CN114722723A (en) * 2022-04-29 2022-07-08 湖北工业大学 Emotion tendency prediction method and equipment based on kernel extreme learning machine optimization
CN114897270A (en) * 2022-06-15 2022-08-12 青岛文达通科技股份有限公司 Semantic information fused public opinion propagation quantity prediction method and system
CN117407527A (en) * 2023-10-19 2024-01-16 重庆邮电大学 Education field public opinion big data classification method
CN117640259A (en) * 2024-01-25 2024-03-01 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium
CN117640259B (en) * 2024-01-25 2024-06-04 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN109933656A (en) 2019-06-25
CN109933656B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
WO2020186627A1 (en) Public opinion polarity prediction method and apparatus, computer device, and storage medium
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN108121700B (en) Keyword extraction method and device and electronic equipment
WO2021042503A1 (en) Information classification extraction method, apparatus, computer device and storage medium
CN106844368B (en) Method for man-machine conversation, neural network system and user equipment
CN113239186B (en) Graph convolution network relation extraction method based on multi-dependency relation representation mechanism
WO2019019860A1 (en) Method and apparatus for training classification model
CN105095204B (en) The acquisition methods and device of synonym
US20180053107A1 (en) Aspect-based sentiment analysis
US20220318275A1 (en) Search method, electronic device and storage medium
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN110781673B (en) Document acceptance method and device, computer equipment and storage medium
CN105069647A (en) Improved method for extracting evaluation object in Chinese commodity review
CN112528653B (en) Short text entity recognition method and system
CN112579729B (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN111160014A (en) Intelligent word segmentation method
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN114626367A (en) Sentiment analysis method, system, equipment and medium based on news article content
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN112948573A (en) Text label extraction method, device, equipment and computer storage medium
JP2024003750A (en) Language model training method and device, electronic device, and storage medium
WO2018171499A1 (en) Information detection method, device and storage medium
CN116070642A (en) Text emotion analysis method and related device based on expression embedding
CN115577109A (en) Text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919794

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919794

Country of ref document: EP

Kind code of ref document: A1