WO2018184518A1 - Microblog data processing method and device, computer device and storage medium - Google Patents

Microblog data processing method and device, computer device and storage medium Download PDF

Info

Publication number
WO2018184518A1
WO2018184518A1 PCT/CN2018/081697 CN2018081697W WO2018184518A1 WO 2018184518 A1 WO2018184518 A1 WO 2018184518A1 CN 2018081697 W CN2018081697 W CN 2018081697W WO 2018184518 A1 WO2018184518 A1 WO 2018184518A1
Authority
WO
WIPO (PCT)
Prior art keywords
microblog
sentiment
emotional
feature
stock market
Prior art date
Application number
PCT/CN2018/081697
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
黄章成
吴天博
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018184518A1 publication Critical patent/WO2018184518A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • the present application relates to the field of computer processing, and in particular, to a microblog data processing method, apparatus, computer device, and storage medium.
  • a microblog data processing method, apparatus, computer device, and storage medium are provided.
  • a microblog data processing method comprising:
  • a time series of emotional index charts is generated based on the emotional index values.
  • a microblog data processing device includes:
  • a discovery module configured to discover hotspot events in real time by monitoring microblogging stream data
  • An analysis module configured to perform an emotional tendency analysis on the microblog containing the hot event
  • a determining module configured to determine a corresponding emotional index value according to the sentiment orientation analysis
  • a generating module configured to generate a time series of emotional index charts according to the emotional index value.
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors such that the one or more The processors perform the following steps:
  • a time series of emotional index charts is generated based on the emotional index values.
  • One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • a time series of emotional index charts is generated based on the emotional index values.
  • FIG. 1 is an application scenario diagram of a microblog data processing method according to one or more embodiments.
  • FIG. 2 is a block diagram of the internal structure of a terminal in accordance with one or more embodiments.
  • FIG. 3 is a block diagram of a server in accordance with one or more embodiments.
  • FIG. 4 is a flow diagram of a method of microblog data processing in accordance with one or more embodiments.
  • FIG. 5 is a flow diagram of a method for discovering hotspot events in real time by monitoring microblogging stream data in accordance with one or more embodiments.
  • FIG. 6 is a flow diagram of a method for performing sentiment orientation analysis of a microblog containing a hotspot event, in accordance with one or more embodiments.
  • FIG. 7 is a flow diagram of a method for determining a corresponding sentiment index value based on sentiment orientation analysis in accordance with one or more embodiments.
  • FIG. 8 is a block diagram of a microblog data processing apparatus in accordance with one or more embodiments.
  • FIG. 9 is a block diagram of a discovery module in accordance with one or more embodiments.
  • Figure 10 is a block diagram of an analysis module in accordance with one or more embodiments.
  • 11 is a block diagram of a determination module in accordance with one or more embodiments.
  • the microblog data processing method provided by the present application can be applied to the application scenario shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 over a network.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers.
  • the terminal 102 uploads the microblog stream data to the server 104.
  • the server 104 discovers the hotspot event in real time by monitoring the microblog stream data, performs sentiment orientation analysis on the microblog containing the hot event, and determines corresponding corresponding according to the sentiment tendency analysis.
  • the internal structure of the terminal 102 is as shown in FIG. 2, including a processor, a memory, a network interface, a display screen, and an input device connected through a system bus.
  • the processor of the terminal is used to provide calculation and control capabilities to support the operation of the entire terminal.
  • the memory of the terminal includes a non-transitory computer readable storage medium, an internal memory.
  • the non-transitory computer readable storage medium of terminal 102 stores an operating system and computer readable instructions that, when executed, cause the processor to perform a method of processing microblog data.
  • the internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium.
  • the network interface is used to connect to the network for communication.
  • the display screen of the terminal 102 may be a liquid crystal display or an electronic ink display screen.
  • the input device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the outer casing of the electronic device, or may be An external keyboard, trackpad, or mouse.
  • the terminal can be a tablet, a laptop, a desktop computer, or the like. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied.
  • the specific terminal may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • the internal structure of the server 104 is as shown in FIG. 3, including a processor, a memory, and a network interface connected through a system bus.
  • the server's processor is used to provide computing and control capabilities, supporting the operation of the entire server.
  • the memory of the server includes a non-transitory computer readable storage medium, an internal memory.
  • the non-transitory computer readable storage medium can store operating systems and computer readable instructions.
  • the processor can be caused to perform a microblog data processing method.
  • the server's network interface is used to communicate with external servers and terminals over a network connection.
  • FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the server to which the solution of the present application is applied.
  • the specific server may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • a microblog data processing method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
  • Step 402 Discover hotspot events in real time by monitoring the microblog stream data.
  • the hotspot event refers to an event with a current degree of interest or a relatively large influence.
  • the current hot topic can be found in real time, and the event corresponding to the hot topic is a hot event.
  • Hot events can reflect investors' attention and emotional tendencies, so timely detection of hot events is conducive to discovering investors' attention and emotional tendencies.
  • the real-time hot topic of the microblog platform is discovered through a social media real-time hotspot discovery algorithm.
  • the real-time main microblog flow of the microblog platform is obtained by dynamically updating the microblog flow algorithm.
  • the main microblog stream refers to the most representative part of the microblog data stream sampled from a large number of real-time microblog data streams.
  • the dynamic update microblog flow algorithm actually refers to an update rule for the monitored microblog account, and the update rule is used to ensure that the microblog data stream obtained through the monitoring account can fully reflect the overall microblog flow.
  • the Weibo data stream was originally built by collecting Weibo from the selected 100,000 Weibo accounts, but the 100,000 accounts may have different operating conditions over time, such as , stop updating, hacking, or the like, or the real-time microblogging stream of the 100,000 Weibo accounts can not represent the entire real-time microblog data stream, then you need to dynamically update the selected Weibo account, for example, delete more than Add a new active account to the account that is not updated for 5 days.
  • the main microblog stream is vectorized to calculate the acceleration of the first-order word frequency and the second-order word frequency. Among them, vectorization means that the obtained main microblog stream is expressed in the form of a word vector.
  • the word2vec method can be used to represent the main microblog stream as a vector form
  • word2vec is a tool for converting words into a vector form.
  • You can simplify the processing of text content into vector operations in vector space.
  • you first need to mathematicalize the text which is to convert the text into a mathematical representation.
  • the microphone is expressed as [0.792, -0.177, -0.17, 0.109.-0.542, ...].
  • the text is represented by a mathematical vector.
  • the semantic similarity of the text can be expressed by calculating the similarity in the vector space.
  • the first-order word frequency refers to the frequency at which a single word appears
  • the second-order word frequency refers to the frequency at which two words appear at the same time.
  • the microblog stream data is acquired in real time, and the microblog stream data is vectorized; secondly, the vectorized microblog data stream is monitored, and the frequency and number of occurrences of each word are recorded; finally, according to each The frequency and frequency of occurrences of the word determine the current hot event.
  • the microblog stream data in order to perform real-time monitoring on the microblog stream, the microblog stream data needs to be converted into a vector representation for monitoring, so after acquiring the microblog stream data, each microblog is converted into a vector. Form to represent. Then, the vectorized microblog data is monitored, and the frequency and number of occurrences of each word are recorded. When the frequency and frequency of occurrence of a word are relatively high, the corresponding event is determined as a hot event.
  • Step 404 Perform an emotional tendency analysis on the microblog containing the hot event.
  • an emotional tendency analysis is performed for each microblog.
  • emotional distraction according to the latest research results of psychology, human emotions are divided into four categories, namely happiness, sadness, anger and surprise.
  • human emotions are divided into four categories, namely happiness, sadness, anger and surprise.
  • emotional tendency analysis model Before the establishment of the sentiment orientation analysis model, it is first necessary to determine the training set for training the sentiment orientation analysis model.
  • the training set is a four-category microblog data set after emotion annotation.
  • the emotion annotation is a hybrid annotation method based on dynamic annotation dictionary and manual annotation.
  • the advantage of the hybrid labeling method is that, on the one hand, the error of simply using one type of labeling method can be avoided, and on the other hand, the time of manual labeling can be saved.
  • features are extracted from the microblog texts after emotional annotation for training the sentiment orientation analysis model.
  • feature extraction is performed using an LDA model and a word2vec model, wherein the LDA model is used to predict the topic distribution of each microblog; the word2vec model is used to obtain a word vector representation for each microblog.
  • principal component analysis may also be employed to extract features.
  • the sentiment orientation analysis model is used to analyze the sentiment orientation.
  • the sentiment orientation analysis model will be acquired after the training is completed.
  • the LDA topic feature corresponding to the new Weibo and the word2vec word vector feature are used together as an input vector of the sentiment orientation analysis model, and the sentiment orientation analysis model outputs the sentiment tendency corresponding to each Weibo.
  • Emotional tendencies are also divided into four categories: happiness, sadness, anger and surprise.
  • Step 406 Determine a corresponding emotional index value according to the sentiment orientation analysis.
  • the number of microblogs corresponding to each type of emotion is counted, and according to the emotion category, the emotions can be divided into four categories, respectively. It is happiness, sadness, anger and surprise.
  • the emotional index value corresponding to the hot event is calculated according to the number of microblogs corresponding to each type of emotion.
  • the emotional index includes one or more of a happy emotional proportion, a sad emotional proportion, an angry emotional proportion, a surprise emotional proportion, and an emotional titer.
  • the emotional valence refers to the reflection value of the overall microblogging sentiment, and the emotional valence calculation formula is:
  • SentmentValence log ⁇ (1+P)/(1+N) ⁇ , where P is the total number of positive emotional microblogs, and N is the total number of negative emotional microblogs.
  • the total number of positive emotional microblogs is the sum of the number of happy and acknowledged emotions, and the total number of negative emotions is the sum of the number of sad and angry emotions.
  • the number of total microblogs related to hot events of the day is 100,000, of which, the number of microblogs of happy emotions is 30,000, the number of microblogs of sad emotions is 10,000, and the number of microblogs of angry emotions For 10,000, the number of microblogs with acknowledged emotions is 50,000, then the corresponding proportion of happy emotions is 0.3, the proportion of sadness and anger is 0.1, and the proportion of surprise emotions is 0.5.
  • the emotional titer is log3.
  • Step 408 Generate a time series of emotional index charts according to the emotional index value.
  • the emotion index value is correspondingly stored with the corresponding statistical time point or time period, and the emotion corresponding to each time point in the period is recorded.
  • a time series of sentiment index charts are generated in time order, and the sentiment index charts can fully and intuitively reflect the fluctuation of investor sentiment within a certain period of time.
  • the sentiment factor affecting the stock market trend can be determined.
  • the sentiment factor can be used together with other stock market data as a predictor of the stock market forecasting model to make corresponding stock market forecasts. Due to the full consideration of the bettor's attention and emotion, it is beneficial to improve the accuracy of stock market forecasting.
  • the hot spot event is detected in real time by monitoring the microblog stream data, and the sentiment tendency analysis is performed on the microblog containing the hot event, and then the corresponding sentiment index value is determined according to the sentiment orientation analysis, and the time is generated according to the sentiment index value.
  • the emotional index chart of the sequence selects a microblog hotspot event that can reflect the investor's attention and emotion, analyzes the emotional tendency of the microblog hot event, and generates a time series emotional index chart according to the emotional index value, The emotional index chart can fully and intuitively reflect the investor's attention and emotions. If applied to the stock market forecast, it will help improve the accuracy of the stock market forecast.
  • the step 402 of discovering hotspot events in real time by monitoring the microblog stream data includes:
  • Step 402A Acquire microblog stream data in real time, and vectorize the microblog stream data.
  • the microblog stream data is acquired in real time, and the initially obtained microblog data stream is text data, text data and ordinary numerical data or generic data.
  • text data is a kind of semi-structured data.
  • text data analysis text data needs to be preprocessed, and vectorized values are used to express these semi-structured text data.
  • the text data is subjected to word segmentation processing. After the word segmentation process, a text data can be represented as a multi-dimensional vector represented by a plurality of keywords, which facilitates subsequent monitoring of the feature words. .
  • Step 402B monitoring the vectorized microblog stream data, and recording the frequency and frequency of occurrence of each feature word.
  • the feature word refers to a vocabulary that can represent an event. Not all words in a weibo text need to be recorded, because some common words may appear in any event, so recording these commonly used words Words have no meaning, for example, the auxiliary words "", "land” and so on.
  • the vectorized microblog stream data is monitored in real time, and the frequency and number of occurrences of each feature word are monitored.
  • the frequency refers to the number of occurrences of the word in a unit time.
  • the number of times refers to the total number of occurrences of the word.
  • Step 402C Determine a current hot event according to the frequency and number of occurrences of each feature word.
  • the frequency and the number of occurrences of each feature word are monitored and recorded, and when the frequency or number of one or more feature words reaches a preset threshold, the event corresponding to the one or more feature words is determined.
  • the frequency threshold and the number of times thresholds may be respectively set. In one embodiment, when the frequency of occurrence of the feature word reaches the frequency threshold or when the number of occurrences of the feature word reaches the threshold of the number of times, the event corresponding to the feature word is determined as the hot event. In another embodiment, when the frequency of occurrence of the feature word reaches the frequency threshold, and the number of times also reaches the threshold of the number of times, the event corresponding to the feature word is determined as the hot event.
  • a hot event can be found in a short time, and by monitoring the number of occurrences of one or more characteristic words, the hot topic in a period of time can be inspected, thereby determining the popularity of the topic. And duration. This method facilitates the timely discovery and recording of hotspot events.
  • the step of performing an sentiment orientation analysis on the microblog containing the hot event includes:
  • Step 404A Extract the acquired LDA topic feature and the word2vec word vector feature of each microblog containing the hot event.
  • the hotspot event is found by monitoring the microblog stream data
  • all the microblogs containing the hotspot event are obtained, and each of the obtained microblogs containing the hotspot event is respectively extracted with the corresponding LDA subject feature and word2vec.
  • Word vector feature The LDA theme model and the word2vec word vector model are pre-trained, wherein the topic distribution of the microblog can be predicted by inputting the vector text of each microblog into the LDA topic model, for example, the probability distribution of the first 250 dimensions is taken as the strip.
  • the 250-dimensional feature of Weibo you can obtain the vector representation of any Weibo word by inputting the vector text of each Weibo into the word2vec model.
  • step 404B the LDA topic feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  • the acquired feature is substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  • the sentiment orientation analysis model can be established by Boosted tree algorithm, among which Boosted tree algorithm is mainly used for prediction of multi-classification problems.
  • the emotional propensity is divided into four categories, namely happiness, sadness, anger, and surprise.
  • the LDA topic feature and the word2vec word vector feature are substituted into the initialized sentiment orientation analysis model to obtain the corresponding model parameters, and the final sentiment orientation analysis model is obtained.
  • the initial sentiment orientation analysis model is established by using the Boosted tree method.
  • Each feature needs to be standardized before training.
  • X is the mean of all the values of the column features, and ⁇ is the standard deviation of all the values of the column features.
  • the formula is used to standardize the features of each dimension to facilitate the subsequent machine learning to obtain the sentiment orientation analysis model.
  • the step of determining the corresponding emotional index value according to the sentiment orientation analysis comprises:
  • Step 406A According to the sentiment orientation analysis, the microblogs containing the hot event are classified according to the emotion category, and the number of microblogs corresponding to each type of emotion is separately counted.
  • the microblog containing the hot event is classified according to the sentiment category, and the number of microblogs corresponding to each type of emotion is separately counted.
  • the emotional categories are divided into four categories, namely happiness, sadness, anger and surprise.
  • the number of microblogs corresponding to each type of emotion is separately counted, which facilitates the subsequent calculation of the sentiment index.
  • Step 406B Determine an emotional index value corresponding to the hot event according to the counted number of microblogs corresponding to each type of emotion.
  • the sentiment index includes one or more of a happy emotional proportion, a sad emotional proportion, an angry emotional proportion, a surprise emotional proportion, and an emotional titer.
  • the emotional valence refers to the reflection value of the overall microblogging emotion
  • the total number of positive emotional microblogs is the sum of the number of happy and acknowledged emotions, and the total number of negative emotions is the sum of the number of sad and angry emotions.
  • the emotional index value is the value of the corresponding emotional index.
  • the number of total microblogs related to hot events of the day is 100,000, of which the number of microblogs with happy emotions is 30,000, the number of microblogs with sad emotions is 10,000, and the number of microblogs with angry emotions is 1. 10,000, the number of microblogs of surprise emotion is 50,000, then the emotional index values include: happy emotional proportion 0.3, sadness and anger emotional ratio are 0.1, surprise emotional proportion 0.5, emotional valence log3. Subsequent calculations of the stock market based on the calculated emotional index values. It should be noted that the hot event here does not refer to a hot event, but all hot events in a period of time.
  • the following comprises: determining an affective factor affecting the stock market data according to the sentiment index chart, and using the sentiment factor together with other stock market data as the stock market
  • the predictors of the predictive model are correlated with the stock market forecast.
  • the emotional factors affecting the stock market data are determined by the emotional index charts.
  • the correlation sentiment index is extracted as the forecast factor of the stock market, that is, the sentiment factors affecting the stock market data are determined.
  • the emotional index includes emotional valence, happy emotional proportion, sad emotional proportion, anger emotional proportion, surprise emotional proportion and so on. Therefore, in the analysis, each factor needs to be analyzed to screen out one or more factors that have an impact on the stock market data, and together as an emotional factor of the stock market data.
  • the determined sentiment factor and other traditional stock market factors are used as predictors of the stock market forecasting model to make corresponding stock market forecasts.
  • Other stock market data include opening price, closing price, highest price, lowest price, turnover, and yield.
  • the correlation sentiment index is extracted as the forecasting feature of the stock market, and the machine learning algorithm is used to train the emotional characteristics.
  • the stock market forecasting model according to the calculated sentiment index value (ie, the sentiment index value corresponding to the sentiment factor) and other stock market data obtained, use the stock market forecasting model to perform corresponding stock market forecasting.
  • the stock market forecast here can predict the profit status of the market, and can also predict the profit status of a single stock.
  • the different forecasting purposes are different. By considering the investor's attention and emotions and making corresponding stock market forecasts, it is conducive to improving the accuracy and reliability of stock market forecasts and providing a reliable basis for decision makers.
  • the microblog stream data by monitoring the microblog stream data, it is also possible to find a hot event that has a great impact on the society and lasts for a long time, and it is beneficial to the stock selection by analyzing the hot event with great influence. Because the hot events that affect the big cause will cause excessive attention of the investing public, causing irrational fluctuations in current prices, and returning to basic values in the future, there will be some buying opportunities in the middle.
  • determining an event influence of a hot event when the event influence is greater than a preset threshold, selecting a stock or industry in the stock market name or a stock industry classification name that includes an event keyword to form a candidate stock pool;
  • the information selects the fundamental stocks from the candidate stock pool; determines the influence on the relevant stocks according to the event period and the market trend, and then feeds back to the stock selection timing model according to the corresponding influence degree for stock selection.
  • the US general election can choose the right time to buy at the bottom.
  • a microblog data processing apparatus comprising:
  • the discovery module 802 is configured to discover hotspot events in real time by monitoring the microblog stream data.
  • the analysis module 804 is configured to perform sentiment orientation analysis on the microblog containing the hot event.
  • the determining module 806 is configured to determine a corresponding emotional index value according to the sentiment orientation analysis.
  • the generating module 808 is configured to generate a time series of sentiment index charts according to the emotion index values.
  • the discovery module 702 includes:
  • the obtaining module 802A is configured to acquire microblog stream data in real time, and vectorize the microblog stream data.
  • the recording module 802B is configured to monitor the vectorized microblog stream data, and record the frequency and frequency of occurrence of each feature word.
  • the hotspot event determining module 802C is configured to determine a current hotspot event according to the frequency and number of occurrences of each feature word.
  • the analysis module 704 includes:
  • the extracting module 804A is configured to extract the acquired LDA topic feature and the word2vec word vector feature of each microblog containing the hot event.
  • the output module 804B is configured to substitute the LDA topic feature and the word2vec word vector feature into the sentiment orientation analysis model, and output the sentiment tendency of each microblog.
  • the determining module 706 includes:
  • the statistics module 806A is configured to classify the microblogs containing the hot event according to the sentiment analysis according to the sentiment analysis, and separately count the number of microblogs corresponding to each type of emotion.
  • the sentiment index value determining module 806B is configured to determine an emotional index value corresponding to the hotspot event according to the counted number of microblogs corresponding to each type of sentiment.
  • the microblog data processing apparatus further includes: a stock market prediction module, configured to determine an sentiment factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a stock market forecast
  • the predictive factors of the model are corresponding to the stock market forecast.
  • the network interface may be an Ethernet card or a wireless network card.
  • the above modules may be embedded in the hardware in the processor or in the memory in the server, or may be stored in the memory in the server, so that the processor calls the corresponding operations of the above modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • the apparatus for controlling the underwriting process described above can be implemented in the form of a computer readable instruction that can be run on a terminal or server as shown in FIG. 2 or 3.
  • a computer device is proposed.
  • the internal structure of the computer device may correspond to the structure as shown in FIG. 2 or 3, that is, the computer device may be either a server or a terminal, and includes a series of storage.
  • the computer readable instructions on the memory when the computer readable instructions are executed by the processor, may implement the microblog data processing method proposed by the embodiments of the present application.
  • the embodiment of the present application proposes a computer device.
  • the computer device includes a series of computer readable instructions stored on a memory, and when the computer readable instructions are executed by the processor, the microblog data processing method proposed by the embodiments of the present application can be implemented.
  • the computer device includes a memory, a processor, and computer readable instructions stored on the memory and operative on the processor, the processor executing the computer readable instructions to implement the step of monitoring a microblog stream
  • the data is used to discover hot events in real time; the sentiment orientation analysis is performed on the microblogs containing the hot event; the corresponding emotional index values are determined according to the sentiment orientation analysis; and the emotional index charts of the time series are generated according to the emotional index values.
  • the performing, by the processor, the real-time hotspot event by monitoring the microblog stream data includes: acquiring microblog stream data in real time, vectorizing the microblog stream data; and monitoring the vectorization
  • the microblog stream data records the frequency and number of occurrences of each feature word; the current hot event is determined according to the frequency and frequency of occurrence of each feature word.
  • the performing, by the processor, performing the sentiment orientation analysis on the microblog that will include the hotspot event comprises: extracting the acquired LDA topic feature of each microblog containing the hot event and word2vec The word vector feature; the LDA theme feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  • the determining, by the processor, the determining the corresponding sentiment index value according to the sentiment orientation analysis comprises: separately counting the number of microblogs corresponding to each type of emotion according to the sentiment category; The number of microblogs corresponding to each type of emotion determines the emotional index value corresponding to the hot event.
  • the processor is further configured to: determine an affective factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a predictor of the stock market forecasting model Carry out the corresponding stock market forecast.
  • one or more non-transitory computer readable storage media storing computer-executable instructions are provided, the computer-executable instructions being executed by one or more processors such that the one Or the plurality of processors perform the following steps: discovering hotspot events in real time by monitoring the microblogging stream data; performing sentiment orientation analysis on the microblogs containing the hotspot events; determining corresponding emotional index values according to the sentiment orientation analysis; The value generates a time series of emotional index charts.
  • the performing, by the processor, the real-time hotspot event by monitoring the microblog stream data includes: acquiring microblog stream data in real time, vectorizing the microblog stream data; and monitoring the vectorization
  • the microblog stream data records the frequency and number of occurrences of each feature word; the current hot event is determined according to the frequency and frequency of occurrence of each feature word.
  • the performing, by the processor, performing the sentiment orientation analysis on the microblog that will include the hotspot event comprises: extracting the acquired LDA topic feature of each microblog containing the hot event and word2vec The word vector feature; the LDA theme feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  • the determining, by the processor, the determining the corresponding sentiment index value according to the sentiment orientation analysis comprises: separately counting the number of microblogs corresponding to each type of emotion according to the sentiment category; The number of microblogs corresponding to each type of emotion determines the emotional index value corresponding to the hot event.
  • the processor is further configured to: determine an affective factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a predictor of the stock market forecasting model Carry out the corresponding stock market forecast.
  • the storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

A microblog data processing method comprises: detecting a hot event in real-time by monitoring microblog streaming data (402); analyzing emotional tendencies shown in microblogs containing the hot event (404); determining corresponding emotional index values according to the analysis on the emotional tendencies (406); and generating a trend chart of time series of emotional indexes according to the emotional index values (408).

Description

微博数据处理方法、装置、计算机设备及存储介质Microblog data processing method, device, computer device and storage medium
本申请要求于2017年4月7日提交中国专利局、申请号为2017102256815、发明名称为“微博数据处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on April 7, 2017, the Chinese Patent Office, the application number is 2017102256815, and the invention is entitled "microblog data processing method, device, computer equipment and storage medium". The citations are incorporated herein by reference.
技术领域Technical field
本申请涉及计算机处理领域,特别是涉及一种微博数据处理方法、装置、计算机设备及存储介质。The present application relates to the field of computer processing, and in particular, to a microblog data processing method, apparatus, computer device, and storage medium.
背景技术Background technique
随着社交媒体的发展,社交网站、在线社区、微博等已逐渐成为人们生活中不可或缺的一部分,也是当今时代信息传播的主要渠道。信息传播形态的巨大变革,冲击着投资者原有的信息利用模式与投资信念,并深刻影响了资本市场的信息传递与金融生态。行为金融理论认为投资者的投资决策行为受到投资者注意力、情绪等因素的共同影响,而传统的股市预测模型中虽然也有涉及到情感倾向分析,但是大多数情感分析是基于新闻网页,其并不能真正反映投资者注意力和情绪,导致在进行股市预测时往往会出现较大偏差。With the development of social media, social networking sites, online communities, and Weibo have gradually become an indispensable part of people's lives, and also the main channel for information dissemination in the current era. The tremendous changes in the form of information dissemination have impacted investors' original information utilization patterns and investment beliefs, and have profoundly affected the information transmission and financial ecology of capital markets. Behavioral finance theory believes that investors' investment decision-making behavior is affected by factors such as investor's attention and emotion. However, although the traditional stock market forecasting model also involves emotional sentiment analysis, most sentiment analysis is based on news pages. Can not truly reflect the attention and emotions of investors, resulting in large deviations in the stock market forecast.
发明内容Summary of the invention
根据本申请的各种实施例,提供了一种微博数据处理方法、装置、计算机设备以及存储介质。According to various embodiments of the present application, a microblog data processing method, apparatus, computer device, and storage medium are provided.
一种微博数据处理方法,包括:A microblog data processing method, comprising:
通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
一种微博数据处理装置,包括:A microblog data processing device includes:
发现模块,用于通过监控微博流数据实时发现热点事件;a discovery module, configured to discover hotspot events in real time by monitoring microblogging stream data;
分析模块,用于对含有所述热点事件的微博进行情感倾向分析;An analysis module, configured to perform an emotional tendency analysis on the microblog containing the hot event;
确定模块,用于根据所述情感倾向分析确定相应的情感指数值;及a determining module, configured to determine a corresponding emotional index value according to the sentiment orientation analysis; and
生成模块,用于根据所述情感指数值生成时间序列的情感指数走势图。And a generating module, configured to generate a time series of emotional index charts according to the emotional index value.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executed by the one or more processors such that the one or more The processors perform the following steps:
通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features, objects, and advantages of the invention will be apparent from the description and appended claims.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1为根据一个或多个实施例中微博数据处理方法的应用场景图。FIG. 1 is an application scenario diagram of a microblog data processing method according to one or more embodiments.
图2为根据一个或多个实施例中终端的内部结构框图。2 is a block diagram of the internal structure of a terminal in accordance with one or more embodiments.
图3为根据一个或多个实施例中服务器的框图。3 is a block diagram of a server in accordance with one or more embodiments.
图4为根据一个或多个实施例中微博数据处理方法的流程图。4 is a flow diagram of a method of microblog data processing in accordance with one or more embodiments.
图5为根据一个或多个实施例中通过监控微博流数据实时发现热点事件的方法流程图。5 is a flow diagram of a method for discovering hotspot events in real time by monitoring microblogging stream data in accordance with one or more embodiments.
图6为根据一个或多个实施例中将含有热点事件的微博进行情感倾向分析的方法流程图。6 is a flow diagram of a method for performing sentiment orientation analysis of a microblog containing a hotspot event, in accordance with one or more embodiments.
图7为根据一个或多个实施例中根据情感倾向分析确定相应的情感指数值的方法流程图。7 is a flow diagram of a method for determining a corresponding sentiment index value based on sentiment orientation analysis in accordance with one or more embodiments.
图8为根据一个或多个实施例中微博数据处理装置的框图。FIG. 8 is a block diagram of a microblog data processing apparatus in accordance with one or more embodiments.
图9为根据一个或多个实施例中发现模块的框图。9 is a block diagram of a discovery module in accordance with one or more embodiments.
图10为根据一个或多个实施例中分析模块的框图。Figure 10 is a block diagram of an analysis module in accordance with one or more embodiments.
图11为根据一个或多个实施例中确定模块的框图。11 is a block diagram of a determination module in accordance with one or more embodiments.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
本申请提供的微博数据处理方法,可以应用于如图1所示的应用场景中。终端102与服务器104通过网络进行通信。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。首先,终端102将微博流数据上传到服务器104,服务器104通过监控微博流数据实时发现热点事件,对含有所述热点事件的微博进行情感倾向分析,根据所述情感倾向分析确定相应的情感指数值;及根据所述情感指数值生成时间序列的情感指数走势图。The microblog data processing method provided by the present application can be applied to the application scenario shown in FIG. 1 . The terminal 102 communicates with the server 104 over a network. The terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 104 can be implemented with a stand-alone server or a server cluster composed of a plurality of servers. First, the terminal 102 uploads the microblog stream data to the server 104. The server 104 discovers the hotspot event in real time by monitoring the microblog stream data, performs sentiment orientation analysis on the microblog containing the hot event, and determines corresponding corresponding according to the sentiment tendency analysis. An emotional index value; and an emotional index chart that generates a time series based on the emotional index value.
如图2所示,在其中一个实施例中,终端102的内部结构如图2所示,包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该终端的处理器用于提供计算和控制能力,支撑整个终端的运行。该终端的存储器包括非易失性计算机可读存储介质、内存储器。终端102的非易失性计算机可读存储介质存储有操作系统和计算机可读指令,该计算机可读指令被执行时,可使得处理器执行一种微博数据处理方法。该内存储器为非易失性计算机可读存储介质中的操作系统和计算机可读指令的运行提供环境。网络接口用于连接到网络进行通信。终端102的显示屏可以是液晶显示屏或者电子墨水显示屏等,输入装置可以是显示屏上覆盖的触摸层,也可以是电子设备外壳上设置的按键、轨迹球或触控板,也可以是外接的键盘、触控板或鼠标等。该终端可以是平板电脑、笔记本电脑、台式计算机等。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的终端的限定,具体的终端可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in FIG. 2, in one embodiment, the internal structure of the terminal 102 is as shown in FIG. 2, including a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. The processor of the terminal is used to provide calculation and control capabilities to support the operation of the entire terminal. The memory of the terminal includes a non-transitory computer readable storage medium, an internal memory. The non-transitory computer readable storage medium of terminal 102 stores an operating system and computer readable instructions that, when executed, cause the processor to perform a method of processing microblog data. The internal memory provides an environment for the operation of an operating system and computer readable instructions in a non-transitory computer readable storage medium. The network interface is used to connect to the network for communication. The display screen of the terminal 102 may be a liquid crystal display or an electronic ink display screen. The input device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the outer casing of the electronic device, or may be An external keyboard, trackpad, or mouse. The terminal can be a tablet, a laptop, a desktop computer, or the like. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied. The specific terminal may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
如图3所示,在其中一个实施例中,服务器104的内部结构如图3所示,包括通过系统总线连接的处理器、存储器和网络接口。其中,该服务器的处理器用于提供计算和控制能力、支撑整个服务器的运行。该服务器的存储器 包括非易失性计算机可读存储介质、内存储器。该非易失性计算机可读存储介质可存储操作系统和计算机可读指令。该计算机指令被执行时,可使得处理器执行一种微博数据处理方法。该服务器的网络接口用于与外部的服务器和终端通过网络连接通信。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的服务器的限定,具体的服务器可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in FIG. 3, in one of the embodiments, the internal structure of the server 104 is as shown in FIG. 3, including a processor, a memory, and a network interface connected through a system bus. Among them, the server's processor is used to provide computing and control capabilities, supporting the operation of the entire server. The memory of the server includes a non-transitory computer readable storage medium, an internal memory. The non-transitory computer readable storage medium can store operating systems and computer readable instructions. When the computer instructions are executed, the processor can be caused to perform a microblog data processing method. The server's network interface is used to communicate with external servers and terminals over a network connection. Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the server to which the solution of the present application is applied. The specific server may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
如图4所示,在其中一个实施例中,提出了一种微博数据处理方法,该方法可应用于终端或服务器中,具体包括以下步骤:As shown in FIG. 4, in one embodiment, a microblog data processing method is proposed, which can be applied to a terminal or a server, and specifically includes the following steps:
步骤402,通过监控微博流数据实时发现热点事件。Step 402: Discover hotspot events in real time by monitoring the microblog stream data.
在其中一个实施例中,热点事件是指当前关注度或影响力比较大的事件。通过监控微博平台的数据流数据可以实时发现当前的热点话题,热点话题对应的事件就是热点事件。热点事件能够反映出投资者的注意力和情感倾向,所以及时发现热点事件有利于发现投资者的注意力和情感倾向。In one of the embodiments, the hotspot event refers to an event with a current degree of interest or a relatively large influence. By monitoring the data flow data of the Weibo platform, the current hot topic can be found in real time, and the event corresponding to the hot topic is a hot event. Hot events can reflect investors' attention and emotional tendencies, so timely detection of hot events is conducive to discovering investors' attention and emotional tendencies.
在其中一个实施例中,通过社交媒体实时热点发现算法发掘微博平台实时热点话题。具体地,首先,利用动态更新微博流算法,获取微博平台实时主微博流。其中,主微博流是指从大量实时微博数据流中抽样出的最具代表性的那部分微博数据流。动态更新微博流算法实际上是指对监控的微博账号的更新规则,该更新规则用于保证通过监控账号获取到的微博数据流能够全面反映整体微博流。举个例子,比如,微博数据流最初是通过采集选定的10万个微博账号中的微博构建的,但是这10万个账号随着时间推移,可能会出现不同的运营状况,比如,停止更新、被盗号之类的,或者这10万个微博账号的实时微博流不能很好地代表整个实时微博数据流,那么就需要动态更新选择的微博账号,比如,删除超过5天不更新的账号,加入新的活跃账号。其次,将主微博流向量化,计算一阶词频和二阶词频出现的加速度。其中,向量化是指将获取到的主微博流以词向量的形式进行表示,具体可以采用word2vec方法将该主微博流表示为向量的形式,word2vec是一个将单词转换成向量形式的工具,可以把文本内容的处理简化为向量空间中的向量运算。因为自然语言理解的问题要转换为机器学习的问题,首先需要将这些文本数学化,即将文本转换为数学的表示形式。举个例子,话筒表示为[0.792,-0.177,-0.107,0.109.-0.542,……]。即将文本以数学化的向量进行表示。可以通过计算出向量空间上的相似度就可以表示文本语义上的相似度。一阶词频是指单个词出现的频率,二阶词频是指同时两个词出现的频率。最后,根据预先设置的各个加速度阈值,实时监控发现微博热点话题。比如,可以设置一阶词频和二阶词频对应的加速度阈值,当达到该加速度阈值时,将对应的话 题确定为热点话题,相应的事件也就是热点事件。此外,还可以通过关注关键词的变化来获取热点话题内关注度的迁徙,通过监控热点事件关注点的迁移,可以实时获取与之相关的受到影响的股票。In one of the embodiments, the real-time hot topic of the microblog platform is discovered through a social media real-time hotspot discovery algorithm. Specifically, firstly, the real-time main microblog flow of the microblog platform is obtained by dynamically updating the microblog flow algorithm. Among them, the main microblog stream refers to the most representative part of the microblog data stream sampled from a large number of real-time microblog data streams. The dynamic update microblog flow algorithm actually refers to an update rule for the monitored microblog account, and the update rule is used to ensure that the microblog data stream obtained through the monitoring account can fully reflect the overall microblog flow. For example, the Weibo data stream was originally built by collecting Weibo from the selected 100,000 Weibo accounts, but the 100,000 accounts may have different operating conditions over time, such as , stop updating, hacking, or the like, or the real-time microblogging stream of the 100,000 Weibo accounts can not represent the entire real-time microblog data stream, then you need to dynamically update the selected Weibo account, for example, delete more than Add a new active account to the account that is not updated for 5 days. Secondly, the main microblog stream is vectorized to calculate the acceleration of the first-order word frequency and the second-order word frequency. Among them, vectorization means that the obtained main microblog stream is expressed in the form of a word vector. Specifically, the word2vec method can be used to represent the main microblog stream as a vector form, and word2vec is a tool for converting words into a vector form. You can simplify the processing of text content into vector operations in vector space. Because the problem of natural language understanding is to be converted to machine learning, you first need to mathematicalize the text, which is to convert the text into a mathematical representation. For example, the microphone is expressed as [0.792, -0.177, -0.17, 0.109.-0.542, ...]. The text is represented by a mathematical vector. The semantic similarity of the text can be expressed by calculating the similarity in the vector space. The first-order word frequency refers to the frequency at which a single word appears, and the second-order word frequency refers to the frequency at which two words appear at the same time. Finally, according to the preset acceleration thresholds, real-time monitoring and discovery of microblog hot topics. For example, an acceleration threshold corresponding to the first-order word frequency and the second-order word frequency may be set. When the acceleration threshold is reached, the corresponding topic is determined as a hot topic, and the corresponding event is a hot event. In addition, it is also possible to obtain the migration of attention within the hot topic by paying attention to the change of the keyword, and by monitoring the migration of the focus of the hot event, the affected stock may be acquired in real time.
在另一个实施例中,首先,实时获取微博流数据,将微博流数据向量化;其次,监控向量化后的微博数据流,记录每个词出现的频率和次数;最后,根据每个词出现的频率和次数确定当前的热点事件。在该实施例中,为了对微博流进行实时监控,需要将微博流数据转换为向量的表示形式来进行监控,所以当获取到微博流数据后,将每一条微博转换为向量的形式来表示。然后监控向量化的微博数据,记录每个词出现的频率和次数,当某个词出现的频率和次数都比较高时,将对应的事件确定为热点事件。In another embodiment, first, the microblog stream data is acquired in real time, and the microblog stream data is vectorized; secondly, the vectorized microblog data stream is monitored, and the frequency and number of occurrences of each word are recorded; finally, according to each The frequency and frequency of occurrences of the word determine the current hot event. In this embodiment, in order to perform real-time monitoring on the microblog stream, the microblog stream data needs to be converted into a vector representation for monitoring, so after acquiring the microblog stream data, each microblog is converted into a vector. Form to represent. Then, the vectorized microblog data is monitored, and the frequency and number of occurrences of each word are recorded. When the frequency and frequency of occurrence of a word are relatively high, the corresponding event is determined as a hot event.
步骤404,对含有热点事件的微博进行情感倾向分析。Step 404: Perform an emotional tendency analysis on the microblog containing the hot event.
在其中一个实施例中,为了获取投资者的情感倾向,当获取到包含热点事件的微博后,对每一条微博进行情感倾向分析。在情感倾向划分方面,根据心理学最新的一份研究结果,将人类情感分为四类,分别为快乐、悲伤、愤怒和惊奇。具体地,为了对每条微博进行情感倾向分析,需要建立一个情感倾向分析模型。在情感倾向分析模型建立之前,首先需要确定训练集,用于对情感倾向分析模型进行训练。训练集是经过情感标注后的四分类微博数据集,其中,情感标注是基于动态情感词典的机器标注和人工标注相结合的混合标注方法。该混合标注方法的优势是,一方面可以避免单纯采用一类标注方法的误差,另一方面,可以节省人工标注的时间。其次,从进行情感标注后的微博文本中抽取特征用于对情感倾向分析模型进行训练。在其中一个实施例中,采用LDA模型和word2vec模型进行特征的抽取,其中,LDA模型用于预测每条微博的主题分布;word2vec模型用于获取每条微博的词向量表示。在另一个实施例中,也可以采用主成分分析法进行特征的抽取。最后,采用情感倾向分析模型进行情感倾向分析,其中,若情感倾向分析模型在训练时是基于抽取的LDA主题特征和word2vec词向量特征进行训练的,那么该情感倾向分析模型训练完成后,将获取到的新的微博对应的LDA主题特征和word2vec词向量特征一起作为情感倾向分析模型的输入向量,经过该情感倾向分析模型输出每条微博对应的情感倾向。情感倾向也分为四种,快乐、悲伤、愤怒和惊奇。In one embodiment, in order to obtain the investor's emotional tendency, after obtaining the microblog containing the hot event, an emotional tendency analysis is performed for each microblog. In terms of emotional distraction, according to the latest research results of psychology, human emotions are divided into four categories, namely happiness, sadness, anger and surprise. Specifically, in order to analyze the sentiment orientation of each microblog, it is necessary to establish an emotional tendency analysis model. Before the establishment of the sentiment orientation analysis model, it is first necessary to determine the training set for training the sentiment orientation analysis model. The training set is a four-category microblog data set after emotion annotation. The emotion annotation is a hybrid annotation method based on dynamic annotation dictionary and manual annotation. The advantage of the hybrid labeling method is that, on the one hand, the error of simply using one type of labeling method can be avoided, and on the other hand, the time of manual labeling can be saved. Secondly, features are extracted from the microblog texts after emotional annotation for training the sentiment orientation analysis model. In one embodiment, feature extraction is performed using an LDA model and a word2vec model, wherein the LDA model is used to predict the topic distribution of each microblog; the word2vec model is used to obtain a word vector representation for each microblog. In another embodiment, principal component analysis may also be employed to extract features. Finally, the sentiment orientation analysis model is used to analyze the sentiment orientation. If the sentiment orientation analysis model is trained based on the extracted LDA topic features and the word2vec word vector feature, the sentiment orientation analysis model will be acquired after the training is completed. The LDA topic feature corresponding to the new Weibo and the word2vec word vector feature are used together as an input vector of the sentiment orientation analysis model, and the sentiment orientation analysis model outputs the sentiment tendency corresponding to each Weibo. Emotional tendencies are also divided into four categories: happiness, sadness, anger and surprise.
步骤406,根据情感倾向分析确定相应的情感指数值。Step 406: Determine a corresponding emotional index value according to the sentiment orientation analysis.
在其中一个实施例中,在根据情感倾向分析模型对包含热点事件的微博进行情感倾向分析后,统计每一类情感对应的微博数目,按照情感类别,可以将情感分为四类,分别是快乐、悲伤、愤怒和惊奇。根据统计出来的每一类情感对应的微博数目来计算热点事件对应的情感指数值。其中,情感指数 包括快乐情感占比、悲伤情感占比、愤怒情感占比、惊奇情感占比以及情感效价中的一种或多种。其中,情感效价是指整体微博情绪的反映值,情感效价计算公式为:In one embodiment, after analyzing the sentiment orientation of the microblog containing the hot event according to the sentiment orientation analysis model, the number of microblogs corresponding to each type of emotion is counted, and according to the emotion category, the emotions can be divided into four categories, respectively. It is happiness, sadness, anger and surprise. The emotional index value corresponding to the hot event is calculated according to the number of microblogs corresponding to each type of emotion. Among them, the emotional index includes one or more of a happy emotional proportion, a sad emotional proportion, an angry emotional proportion, a surprise emotional proportion, and an emotional titer. Among them, the emotional valence refers to the reflection value of the overall microblogging sentiment, and the emotional valence calculation formula is:
SentmentValence=log{(1+P)/(1+N)},其中,P为积极情感微博总数,N为消极情感微博总数。积极情感微博总数为快乐和惊奇情感微博数量之和,消极情感微博总数为悲伤和愤怒情感微博数量之和。具体地,比如当天的热点事件相关的总微博数目为10万条,其中,快乐情感的微博条数为3万,悲伤情感的微博条数为1万,愤怒情感的微博条数为1万,惊奇情感的微博条数为5万,那么对应的快乐情感占比为0.3,悲伤和愤怒情感占比都是0.1,惊奇情感占比为0.5。情感效价为log3。SentmentValence=log{(1+P)/(1+N)}, where P is the total number of positive emotional microblogs, and N is the total number of negative emotional microblogs. The total number of positive emotional microblogs is the sum of the number of happy and amazed emotions, and the total number of negative emotions is the sum of the number of sad and angry emotions. Specifically, for example, the number of total microblogs related to hot events of the day is 100,000, of which, the number of microblogs of happy emotions is 30,000, the number of microblogs of sad emotions is 10,000, and the number of microblogs of angry emotions For 10,000, the number of microblogs with amazed emotions is 50,000, then the corresponding proportion of happy emotions is 0.3, the proportion of sadness and anger is 0.1, and the proportion of surprise emotions is 0.5. The emotional titer is log3.
步骤408,根据情感指数值生成时间序列的情感指数走势图。Step 408: Generate a time series of emotional index charts according to the emotional index value.
在其中一个实施例中,通过情感倾向分析确定相应的情感指数值后,将该情感指数值与对应的统计时间点或时间段进行对应存储,通过记录一段时间内的每一个时间点对应的情感指数值后,按照时间的顺序生成时间序列的情感指数走势图,该情感指数走势图能够充分的且直观的反映出一段时间内投资者情绪的波动。在其中一个实施例中,通过将该情感指数的走势与股票大盘走势进行相关性分析,可以确定出影响股市走势的情感因子。后续可以将该情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测,由于充分考虑了投注者的注意力和情绪,有利于提高股市预测的准确率。In one embodiment, after determining the corresponding emotion index value by the sentiment orientation analysis, the emotion index value is correspondingly stored with the corresponding statistical time point or time period, and the emotion corresponding to each time point in the period is recorded. After the index value, a time series of sentiment index charts are generated in time order, and the sentiment index charts can fully and intuitively reflect the fluctuation of investor sentiment within a certain period of time. In one of the embodiments, by analyzing the correlation between the trend of the sentiment index and the trend of the stock market, the sentiment factor affecting the stock market trend can be determined. The sentiment factor can be used together with other stock market data as a predictor of the stock market forecasting model to make corresponding stock market forecasts. Due to the full consideration of the bettor's attention and emotion, it is beneficial to improve the accuracy of stock market forecasting.
在其中一个实施例中,通过监控微博流数据实时发现热点事件,对含有热点事件的微博进行情感倾向分析,进而根据情感倾向分析确定相应的情感指数值,根据所述情感指数值生成时间序列的情感指数走势图。该微博数据处理方法选出了能够反应投资者注意力和情绪的微博热点事件,通过对该微博热点事件进行情感倾向分析,并根据情感指数值生成时间序列的情感指数走势图,该情感指数走势图能够充分直观的反映投资者的注意力和情绪,若应用于股市预测有利于提升股市预测的准确率。In one embodiment, the hot spot event is detected in real time by monitoring the microblog stream data, and the sentiment tendency analysis is performed on the microblog containing the hot event, and then the corresponding sentiment index value is determined according to the sentiment orientation analysis, and the time is generated according to the sentiment index value. The emotional index chart of the sequence. The microblog data processing method selects a microblog hotspot event that can reflect the investor's attention and emotion, analyzes the emotional tendency of the microblog hot event, and generates a time series emotional index chart according to the emotional index value, The emotional index chart can fully and intuitively reflect the investor's attention and emotions. If applied to the stock market forecast, it will help improve the accuracy of the stock market forecast.
如图5所示,在其中一个实施例中,通过监控微博流数据实时发现热点事件的步骤402包括:As shown in FIG. 5, in one embodiment, the step 402 of discovering hotspot events in real time by monitoring the microblog stream data includes:
步骤402A,实时获取微博流数据,将微博流数据向量化。 Step 402A: Acquire microblog stream data in real time, and vectorize the microblog stream data.
在其中一个实施例中,为了能够及时的获取到投资者的注意和情绪,实时获取微博流数据,最初获取到的微博数据流为文本数据,文本数据与普通的数值数据或者类属数据不同,文本数据是一种半结构化数据,在进行文本数据分析之前,需要对文本数据进行预处理,采用向量化的数值来表达这些半结构的文本数据。在其中一个实施例中,在进行向量化之前,先对文本数 据进行分词处理,经过分词处理后,一个文本数据就可以表示为由若干关键词来表示的多维向量,便于后续进行特征词的监控。In one embodiment, in order to obtain the attention and emotion of the investor in time, the microblog stream data is acquired in real time, and the initially obtained microblog data stream is text data, text data and ordinary numerical data or generic data. Differently, text data is a kind of semi-structured data. Before text data analysis, text data needs to be preprocessed, and vectorized values are used to express these semi-structured text data. In one embodiment, before the vectorization, the text data is subjected to word segmentation processing. After the word segmentation process, a text data can be represented as a multi-dimensional vector represented by a plurality of keywords, which facilitates subsequent monitoring of the feature words. .
步骤402B,监控向量化后的微博流数据,记录每个特征词出现的频率和次数。 Step 402B, monitoring the vectorized microblog stream data, and recording the frequency and frequency of occurrence of each feature word.
在其中一个实施例中,特征词是指能够代表某个事件的词汇,一条微博文本中并不是所有的词语都需要记录,因为有一些常用词在任何事件中都可能出现,所以记录这些常用词并没有意义,比如,助词“的”、“地”等。将微博流数据向量化后,实时监控向量化后的微博流数据,监控每个特征词出现的频率和次数,频率是指单位时间内该词语出现的次数。次数是指该词语出现的总数量。通过监控每个特征词出现的频率可以发现短时间内出现的热点话题,通过记录每个特征词出现的次数,可以考察较长一段时间内人们对某个话题的关注度。In one embodiment, the feature word refers to a vocabulary that can represent an event. Not all words in a weibo text need to be recorded, because some common words may appear in any event, so recording these commonly used words Words have no meaning, for example, the auxiliary words "", "land" and so on. After vectorizing the microblog stream data, the vectorized microblog stream data is monitored in real time, and the frequency and number of occurrences of each feature word are monitored. The frequency refers to the number of occurrences of the word in a unit time. The number of times refers to the total number of occurrences of the word. By monitoring the frequency of occurrence of each feature word, you can find hot topics that appear in a short period of time. By recording the number of occurrences of each feature word, you can examine the degree of attention of a topic over a long period of time.
步骤402C,根据每个特征词出现的频率和次数确定当前的热点事件。 Step 402C: Determine a current hot event according to the frequency and number of occurrences of each feature word.
在其中一个实施例中,监听并记录每个特征词出现的频率和次数,当一个或多个特征词的频率或次数达到了预设阈值时,则判定该一个或多个特征词对应的事件为热点事件。可以分别设置频率阈值和次数阈值,一个实施例中,当特征词出现的频率达到了频率阈值或者当特征词出现的次数达到了次数阈值,则将特征词对应的事件确定为热点事件。在另一个实施例中,当特征词出现的频率达到了频率阈值,且,次数也达到了次数阈值时,则将该特征词对应的事件确定为热点事件。根据监控到的一个或多个特征词出现的频率可以发现短时间内的热点事件,而通过监控一个或多个特征词出现的次数则可以考察一段时间内的热点话题,进而确定该话题的热度和持续时间。该方法有利于及时发现并记录热点事件。In one embodiment, the frequency and the number of occurrences of each feature word are monitored and recorded, and when the frequency or number of one or more feature words reaches a preset threshold, the event corresponding to the one or more feature words is determined. For hot events. The frequency threshold and the number of times thresholds may be respectively set. In one embodiment, when the frequency of occurrence of the feature word reaches the frequency threshold or when the number of occurrences of the feature word reaches the threshold of the number of times, the event corresponding to the feature word is determined as the hot event. In another embodiment, when the frequency of occurrence of the feature word reaches the frequency threshold, and the number of times also reaches the threshold of the number of times, the event corresponding to the feature word is determined as the hot event. According to the frequency of the monitored one or more characteristic words, a hot event can be found in a short time, and by monitoring the number of occurrences of one or more characteristic words, the hot topic in a period of time can be inspected, thereby determining the popularity of the topic. And duration. This method facilitates the timely discovery and recording of hotspot events.
如图6所示,在其中一个实施例中,将含有热点事件的微博进行情感倾向分析的步骤包括:As shown in FIG. 6, in one embodiment, the step of performing an sentiment orientation analysis on the microblog containing the hot event includes:
步骤404A,抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征。 Step 404A: Extract the acquired LDA topic feature and the word2vec word vector feature of each microblog containing the hot event.
在其中一个实施例中,当通过监控微博流数据发现热点事件后,获取含有该热点事件的所有微博,对获取到的含有热点事件的每条微博分别抽取相应的LDA主体特征和word2vec词向量特征。预先训练好LDA主题模型和word2vec词向量模型,其中,通过将每条微博的向量文本输入该LDA主题模型可以预测该条微博的主题分布,比如,将前250维的概率分布作为该条微博的250维特征;通过将每条微博的向量文本输入该word2vec模型可以获取任一微博词语的向量表示,同样可以选取前250维作为特征,通过简单的同维相加,可以获取该条微博的500维词向量表示。其中,word2vec是一款 将词表征为实数值向量的高效工具,其利用深度学习的思想,可以通过训练,把对文本内容的处理简化为k维向量空间中的向量运算。In one embodiment, after the hotspot event is found by monitoring the microblog stream data, all the microblogs containing the hotspot event are obtained, and each of the obtained microblogs containing the hotspot event is respectively extracted with the corresponding LDA subject feature and word2vec. Word vector feature. The LDA theme model and the word2vec word vector model are pre-trained, wherein the topic distribution of the microblog can be predicted by inputting the vector text of each microblog into the LDA topic model, for example, the probability distribution of the first 250 dimensions is taken as the strip. The 250-dimensional feature of Weibo; you can obtain the vector representation of any Weibo word by inputting the vector text of each Weibo into the word2vec model. You can also select the first 250 dimensions as the feature, which can be obtained by simple simultaneous dimension addition. The 500-dimensional word vector representation of the microblog. Among them, word2vec is an efficient tool for characterizing words as real-valued vectors. It can use the idea of deep learning to simplify the processing of text content into vector operations in k-dimensional vector space.
步骤404B,将LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。In step 404B, the LDA topic feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
在其中一个实施例中,在获取到每条微博对应的LDA主题特征和word2vec词向量特征后,该获取到的特征代入情感倾向分析模型,输出每条微博的情感倾向。其中,情感倾向分析模型可以采用Boosted tree算法建立,其中,Boosted tree算法主要用于多分类问题的预测。在其中一个实施例中,情感倾向分为四类,分别是快乐、悲伤、愤怒和惊奇。在采用情感倾向分析模型之前,首先需要建立情感倾向分析模型,具体地,采用四分类标注的方法对微博数据集进行情感标注,将进行情感标注后的微博数据集作为训练集,从训练集中的每条微博中抽取LDA主题特征和word2vec词向量特征代入初始化的情感倾向分析模型进行训练,得到相应的模型参数,从而得到最终的情感倾向分析模型。其中,初始化的情感倾向分析模型是采用Boosted tree方法进行建立的。在训练之前需要对每个特征进行标准化处理,指标标准化的公式为:ZXt=(Xt-X)/σ,其中,Xt是每维特征具体的数值,ZXt是对应的每维数据标准化后的值,X为该列特征所有数值的均值,σ是该列特征所有数值的标准差,利用该公式对每维特征做标准化处理,便于后续进行机器学习得到情感倾向分析模型。所以对含有热点事件的每条微博进行情感倾向分析时,需要抽取每条微博中的LDA主题特征和word2vec词向量作为情感倾向分析模型的输入向量,然后输出每条微博对应的情感倾向,比如,是快乐的还是悲伤的,是愤怒的还是惊奇的。In one embodiment, after acquiring the LDA topic feature and the word2vec word vector feature corresponding to each microblog, the acquired feature is substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output. Among them, the sentiment orientation analysis model can be established by Boosted tree algorithm, among which Boosted tree algorithm is mainly used for prediction of multi-classification problems. In one of the embodiments, the emotional propensity is divided into four categories, namely happiness, sadness, anger, and surprise. Before adopting the sentiment orientation analysis model, we first need to establish an sentiment orientation analysis model. Specifically, the four-category annotation method is used to emotionally label the microblog dataset, and the emotion-labeled microblog dataset is used as the training set. In each centralized microblog, the LDA topic feature and the word2vec word vector feature are substituted into the initialized sentiment orientation analysis model to obtain the corresponding model parameters, and the final sentiment orientation analysis model is obtained. Among them, the initial sentiment orientation analysis model is established by using the Boosted tree method. Each feature needs to be standardized before training. The formula for index normalization is: ZXt=(Xt-X)/σ, where Xt is the specific value of each dimension, and ZXt is the corresponding value of each dimension of the data. X is the mean of all the values of the column features, and σ is the standard deviation of all the values of the column features. The formula is used to standardize the features of each dimension to facilitate the subsequent machine learning to obtain the sentiment orientation analysis model. Therefore, when analyzing the sentiment orientation of each microblog with hot events, it is necessary to extract the LDA theme feature and word2vec word vector in each microblog as the input vector of the sentiment orientation analysis model, and then output the sentiment tendency corresponding to each microblog. For example, whether it is happy or sad, angry or amazed.
如图7所示,在其中一个实施例中,根据情感倾向分析确定相应的情感指数值的步骤包括:As shown in FIG. 7, in one of the embodiments, the step of determining the corresponding emotional index value according to the sentiment orientation analysis comprises:
步骤406A,根据情感倾向分析将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数。 Step 406A: According to the sentiment orientation analysis, the microblogs containing the hot event are classified according to the emotion category, and the number of microblogs corresponding to each type of emotion is separately counted.
在其中一个实施例中,对包含热点事件的每条微博进行情感性分析以后,将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数。其中,情感类别分为四类,分别是快乐、悲伤、愤怒和惊奇。为了获取投资者的情绪,分别统计每一类情感对应的微博数,便于后续进行情感指数的计算。In one embodiment, after each sentiment containing the hot event is sentimentally analyzed, the microblog containing the hot event is classified according to the sentiment category, and the number of microblogs corresponding to each type of emotion is separately counted. Among them, the emotional categories are divided into four categories, namely happiness, sadness, anger and surprise. In order to obtain the investor's emotions, the number of microblogs corresponding to each type of emotion is separately counted, which facilitates the subsequent calculation of the sentiment index.
步骤406B,根据统计出的每一类情感对应的微博数确定热点事件对应的情感指数值。Step 406B: Determine an emotional index value corresponding to the hot event according to the counted number of microblogs corresponding to each type of emotion.
在其中一个实施例中,情感指数包括快乐情感占比、悲伤情感占比、愤怒情感占比、惊奇情感占比以及情感效价中的一种或多种。其中,情感效价 是指整体微博情绪的反映值,情感效价计算公式为:SentmentValence=log{(1+P)/(1+N)},其中,P为积极情感微博总数,N为消极情感微博总数。积极情感微博总数为快乐和惊奇情感微博数量之和,消极情感微博总数为悲伤和愤怒情感微博数量之和。情感指数值就是相应的情感指数的值。比如,当天的热点事件相关的总微博数目为10万条,其中,快乐情感的微博条数为3万,悲伤情感的微博条数为1万,愤怒情感的微博条数为1万,惊奇情感的微博条数为5万,那么情感指数值包括:快乐情感占比0.3,悲伤和愤怒情感占比都是0.1,惊奇情感占比0.5,情感效价log3。后续根据计算得到的情感指数值进行股市的预测。需要说明的是,这里的热点事件并不是指一个热点事件,而是一段时间内的所有热点事件。In one embodiment, the sentiment index includes one or more of a happy emotional proportion, a sad emotional proportion, an angry emotional proportion, a surprise emotional proportion, and an emotional titer. Among them, the emotional valence refers to the reflection value of the overall microblogging emotion, and the emotional valence calculation formula is: SentmentValence=log{(1+P)/(1+N)}, where P is the total number of positive emotional microblogs, N The total number of microblogs for negative emotions. The total number of positive emotional microblogs is the sum of the number of happy and amazed emotions, and the total number of negative emotions is the sum of the number of sad and angry emotions. The emotional index value is the value of the corresponding emotional index. For example, the number of total microblogs related to hot events of the day is 100,000, of which the number of microblogs with happy emotions is 30,000, the number of microblogs with sad emotions is 10,000, and the number of microblogs with angry emotions is 1. 10,000, the number of microblogs of surprise emotion is 50,000, then the emotional index values include: happy emotional proportion 0.3, sadness and anger emotional ratio are 0.1, surprise emotional proportion 0.5, emotional valence log3. Subsequent calculations of the stock market based on the calculated emotional index values. It should be noted that the hot event here does not refer to a hot event, but all hot events in a period of time.
在其中一个实施例中,在根据情感指数值生成时间序列的情感指数走势图的步骤之后包括:根据所述情感指数走势图确定影响股市数据的情感因子,将情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。In one embodiment, after the step of generating a time series of the emotional index chart according to the emotional index value, the following comprises: determining an affective factor affecting the stock market data according to the sentiment index chart, and using the sentiment factor together with other stock market data as the stock market The predictors of the predictive model are correlated with the stock market forecast.
在其中一个实施例中,在获取到时间序列的情感指数走势图后,通过情感指数走势图确定影响股市数据的情感因子。通过将时间序列的情感指数走势图与大盘日收益率、价格和交易额进行相关项分析,将存在相关性的情感指数抽取为股市大盘的预测因子,即确定影响股市数据的情感因子。其中,情感指数包括情感效价、快乐情感占比、悲伤情感占比、愤怒情感占比、惊奇情感占比等。所以在分析时,需要将每个因素都进行分析,筛选出对股市数据有影响的一个或多个因素,共同作为股市数据的情感因子。然后将确定的情感因子和其他传统的股市因子一起作为股市预测模型的预测因子进行相应的股市预测。其他股市数据包括开盘价、收盘价、最高价、最低价、成交额、收益率等数据。具体地,首先,需要通过对情感指数与大盘日收益率、价格和交易额进行相关性分析,存在相关性的情感指数抽取为股市大盘的预测特征,采用机器学习算法,训练得到包含情感特征的股市预测模型,进而根据计算得到的情感指数值(即情感因子对应的情感指数值)和获取的其他股市数据采用该股市预测模型进行相应的股市预测。这里的股市预测,可以预测大盘的收益状况,也可以预测单支股票的收益状况,不同的预测目的,采用的数据不同。通过考虑投资者注意力和情绪进行相应的股市预测,有利于提高股市预测的准确率和可靠性,为决策者提供可靠的依据。In one of the embodiments, after obtaining the time series of the emotional index charts, the emotional factors affecting the stock market data are determined by the emotional index charts. By analyzing the time series of sentiment index charts and the daily returns, prices and transaction amounts of the market, the correlation sentiment index is extracted as the forecast factor of the stock market, that is, the sentiment factors affecting the stock market data are determined. Among them, the emotional index includes emotional valence, happy emotional proportion, sad emotional proportion, anger emotional proportion, surprise emotional proportion and so on. Therefore, in the analysis, each factor needs to be analyzed to screen out one or more factors that have an impact on the stock market data, and together as an emotional factor of the stock market data. Then the determined sentiment factor and other traditional stock market factors are used as predictors of the stock market forecasting model to make corresponding stock market forecasts. Other stock market data include opening price, closing price, highest price, lowest price, turnover, and yield. Specifically, firstly, through the correlation analysis between the sentiment index and the daily yield, price and transaction amount of the market, the correlation sentiment index is extracted as the forecasting feature of the stock market, and the machine learning algorithm is used to train the emotional characteristics. The stock market forecasting model, according to the calculated sentiment index value (ie, the sentiment index value corresponding to the sentiment factor) and other stock market data obtained, use the stock market forecasting model to perform corresponding stock market forecasting. The stock market forecast here can predict the profit status of the market, and can also predict the profit status of a single stock. The different forecasting purposes are different. By considering the investor's attention and emotions and making corresponding stock market forecasts, it is conducive to improving the accuracy and reliability of stock market forecasts and providing a reliable basis for decision makers.
此外,在其中一个实施例中,通过监控微博流数据还可以发现对社会影响巨大,且持续时间较久的热点事件,通过分析该影响大的热点事件有利于股票的选择。因为影响大的热点事件会引起投资大众的过度关注,造成当前价格的非理性波动,而未来一段时间后会回归基本价值,所以这中间会存在 一些买入时机。具体地,首先,确定热点事件的事件影响力,当事件影响力大于预设阈值时,选择股市中股票名称或者股票行业分类名称中包含事件关键字的股票或行业,构成候选股票池;根据股票信息从候选股票池中选出基本面好股票;根据事件周期和大盘走势确定对相关股票的影响度,然后根据相应的影响度反馈给选股择时模型,用于进行股票的选择。比如,美国大选对于黄金、石油等价格引起的短时间内的非理性波动,可以选择合适的时机抄底买入。In addition, in one of the embodiments, by monitoring the microblog stream data, it is also possible to find a hot event that has a great impact on the society and lasts for a long time, and it is beneficial to the stock selection by analyzing the hot event with great influence. Because the hot events that affect the big cause will cause excessive attention of the investing public, causing irrational fluctuations in current prices, and returning to basic values in the future, there will be some buying opportunities in the middle. Specifically, first, determining an event influence of a hot event, when the event influence is greater than a preset threshold, selecting a stock or industry in the stock market name or a stock industry classification name that includes an event keyword to form a candidate stock pool; The information selects the fundamental stocks from the candidate stock pool; determines the influence on the relevant stocks according to the event period and the market trend, and then feeds back to the stock selection timing model according to the corresponding influence degree for stock selection. For example, in the short-term irrational fluctuations caused by prices such as gold and oil, the US general election can choose the right time to buy at the bottom.
应该理解的是,虽然图4至7的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图4至7中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowcharts of FIGS. 4 through 7 are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in Figures 4 to 7 may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, these sub-steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with at least a portion of other steps or sub-steps or stages of other steps.
如图8所示,在其中一个实施例中,提出了一种微博数据处理装置,该装置包括:As shown in FIG. 8, in one embodiment, a microblog data processing apparatus is proposed, the apparatus comprising:
发现模块802,用于通过监控微博流数据实时发现热点事件。The discovery module 802 is configured to discover hotspot events in real time by monitoring the microblog stream data.
分析模块804,用于对含有所述热点事件的微博进行情感倾向分析。The analysis module 804 is configured to perform sentiment orientation analysis on the microblog containing the hot event.
确定模块806,用于根据所述情感倾向分析确定相应的情感指数值。The determining module 806 is configured to determine a corresponding emotional index value according to the sentiment orientation analysis.
生成模块808,用于根据情感指数值生成时间序列的情感指数走势图。The generating module 808 is configured to generate a time series of sentiment index charts according to the emotion index values.
如图9所示,在其中一个实施例中,发现模块702包括:As shown in FIG. 9, in one embodiment, the discovery module 702 includes:
获取模块802A,用于实时获取微博流数据,将所述微博流数据向量化。The obtaining module 802A is configured to acquire microblog stream data in real time, and vectorize the microblog stream data.
记录模块802B,用于监控向量化后的微博流数据,记录每个特征词出现的频率和次数。The recording module 802B is configured to monitor the vectorized microblog stream data, and record the frequency and frequency of occurrence of each feature word.
热点事件确定模块802C,用于根据每个特征词出现的频率和次数确定当前的热点事件。The hotspot event determining module 802C is configured to determine a current hotspot event according to the frequency and number of occurrences of each feature word.
如图10所示,在其中一个实施例中,分析模块704包括:As shown in FIG. 10, in one embodiment, the analysis module 704 includes:
抽取模块804A,用于抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征。The extracting module 804A is configured to extract the acquired LDA topic feature and the word2vec word vector feature of each microblog containing the hot event.
输出模块804B,用于将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。The output module 804B is configured to substitute the LDA topic feature and the word2vec word vector feature into the sentiment orientation analysis model, and output the sentiment tendency of each microblog.
如图11所示,在其中一个实施例中,确定模块706包括:As shown in FIG. 11, in one of the embodiments, the determining module 706 includes:
统计模块806A,用于根据情感倾向分析将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数。The statistics module 806A is configured to classify the microblogs containing the hot event according to the sentiment analysis according to the sentiment analysis, and separately count the number of microblogs corresponding to each type of emotion.
情感指数值确定模块806B,用于根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。The sentiment index value determining module 806B is configured to determine an emotional index value corresponding to the hotspot event according to the counted number of microblogs corresponding to each type of sentiment.
在其中一个实施例中,上述微博数据处理装置还包括:股市预测模块,用于根据所述情感指数走势图确定影响股市数据的情感因子,将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。In one embodiment, the microblog data processing apparatus further includes: a stock market prediction module, configured to determine an sentiment factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a stock market forecast The predictive factors of the model are corresponding to the stock market forecast.
上述微博数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。其中,网络接口可以是以太网卡或无线网卡等。上述各模块可以硬件形式内嵌于或独立于服务器中的处理器中,也可以以软件形式存储于服务器中的存储器中,以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。Each of the above-described microblog data processing devices may be implemented in whole or in part by software, hardware, and combinations thereof. The network interface may be an Ethernet card or a wireless network card. The above modules may be embedded in the hardware in the processor or in the memory in the server, or may be stored in the memory in the server, so that the processor calls the corresponding operations of the above modules. The processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
上述控制承保处理的装置可以实现为一种计算机可读指令的形式,计算机可读指令可以在如图2或3所示的终端或服务器上运行。The apparatus for controlling the underwriting process described above can be implemented in the form of a computer readable instruction that can be run on a terminal or server as shown in FIG. 2 or 3.
在其中一个实施例中,提出了一种计算机设备,计算机设备的内部结构可对应于如图2或3所示的结构,即该计算机设备既可以是服务器也可以是终端,其包括一系列存储于存储器上的计算机可读指令,当该计算机可读指令被处理器执行时,可以实现本申请各实施例提出的微博数据处理方法。本申请实施例提出了一种计算机设备。该计算机设备包括一系列存储于存储器上的计算机可读指令,当该计算机可读指令被处理器执行时,可以实现本申请各实施例提出的微博数据处理方法。计算机设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:通过监控微博流数据实时发现热点事件;对含有所述热点事件的微博进行情感倾向分析;根据所述情感倾向分析确定相应的情感指数值;根据情感指数值生成时间序列的情感指数走势图。In one embodiment, a computer device is proposed. The internal structure of the computer device may correspond to the structure as shown in FIG. 2 or 3, that is, the computer device may be either a server or a terminal, and includes a series of storage. The computer readable instructions on the memory, when the computer readable instructions are executed by the processor, may implement the microblog data processing method proposed by the embodiments of the present application. The embodiment of the present application proposes a computer device. The computer device includes a series of computer readable instructions stored on a memory, and when the computer readable instructions are executed by the processor, the microblog data processing method proposed by the embodiments of the present application can be implemented. The computer device includes a memory, a processor, and computer readable instructions stored on the memory and operative on the processor, the processor executing the computer readable instructions to implement the step of monitoring a microblog stream The data is used to discover hot events in real time; the sentiment orientation analysis is performed on the microblogs containing the hot event; the corresponding emotional index values are determined according to the sentiment orientation analysis; and the emotional index charts of the time series are generated according to the emotional index values.
在其中一个实施例中,所述处理器所述执行的所述通过监控微博流数据实时发现热点事件包括:实时获取微博流数据,将所述微博流数据向量化;监控向量化后的微博流数据,记录每个特征词出现的频率和次数;根据每个特征词出现的频率和次数确定当前的热点事件。In one embodiment, the performing, by the processor, the real-time hotspot event by monitoring the microblog stream data includes: acquiring microblog stream data in real time, vectorizing the microblog stream data; and monitoring the vectorization The microblog stream data records the frequency and number of occurrences of each feature word; the current hot event is determined according to the frequency and frequency of occurrence of each feature word.
在其中一个实施例中,所述处理器所述执行的所述将含有所述热点事件的微博进行情感倾向分析包括:抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。In one embodiment, the performing, by the processor, performing the sentiment orientation analysis on the microblog that will include the hotspot event comprises: extracting the acquired LDA topic feature of each microblog containing the hot event and word2vec The word vector feature; the LDA theme feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
在其中一个实施例中,所述处理器所述执行的所述根据所述情感倾向分析确定相应的情感指数值包括:按照情感类别,分别统计每一类情感对应的微博数;根据统计出的所述每一类情感对应的微博数确定所述热点事件对应 的情感指数值。In one embodiment, the determining, by the processor, the determining the corresponding sentiment index value according to the sentiment orientation analysis comprises: separately counting the number of microblogs corresponding to each type of emotion according to the sentiment category; The number of microblogs corresponding to each type of emotion determines the emotional index value corresponding to the hot event.
在其中一个实施例中,所述处理器还用于执行以下步骤:根据所述情感指数走势图确定影响股市数据的情感因子,将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。In one embodiment, the processor is further configured to: determine an affective factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a predictor of the stock market forecasting model Carry out the corresponding stock market forecast.
在其中一个实施例中,提出了一个或多个存储有计算机可执行指令的非易失性计算机可读存储介质,所述计算机可执行指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:通过监控微博流数据实时发现热点事件;对含有所述热点事件的微博进行情感倾向分析;根据所述情感倾向分析确定相应的情感指数值;根据情感指数值生成时间序列的情感指数走势图。In one of the embodiments, one or more non-transitory computer readable storage media storing computer-executable instructions are provided, the computer-executable instructions being executed by one or more processors such that the one Or the plurality of processors perform the following steps: discovering hotspot events in real time by monitoring the microblogging stream data; performing sentiment orientation analysis on the microblogs containing the hotspot events; determining corresponding emotional index values according to the sentiment orientation analysis; The value generates a time series of emotional index charts.
在其中一个实施例中,所述处理器所述执行的所述通过监控微博流数据实时发现热点事件包括:实时获取微博流数据,将所述微博流数据向量化;监控向量化后的微博流数据,记录每个特征词出现的频率和次数;根据每个特征词出现的频率和次数确定当前的热点事件。In one embodiment, the performing, by the processor, the real-time hotspot event by monitoring the microblog stream data includes: acquiring microblog stream data in real time, vectorizing the microblog stream data; and monitoring the vectorization The microblog stream data records the frequency and number of occurrences of each feature word; the current hot event is determined according to the frequency and frequency of occurrence of each feature word.
在其中一个实施例中,所述处理器所述执行的所述将含有所述热点事件的微博进行情感倾向分析包括:抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。In one embodiment, the performing, by the processor, performing the sentiment orientation analysis on the microblog that will include the hotspot event comprises: extracting the acquired LDA topic feature of each microblog containing the hot event and word2vec The word vector feature; the LDA theme feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
在其中一个实施例中,所述处理器所述执行的所述根据所述情感倾向分析确定相应的情感指数值包括:按照情感类别,分别统计每一类情感对应的微博数;根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。In one embodiment, the determining, by the processor, the determining the corresponding sentiment index value according to the sentiment orientation analysis comprises: separately counting the number of microblogs corresponding to each type of emotion according to the sentiment category; The number of microblogs corresponding to each type of emotion determines the emotional index value corresponding to the hot event.
在其中一个实施例中,所述处理器还用于执行以下步骤:根据所述情感指数走势图确定影响股市数据的情感因子,将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。In one embodiment, the processor is further configured to: determine an affective factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a predictor of the stock market forecasting model Carry out the corresponding stock market forecast.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。A person skilled in the art can understand that all or part of the process of implementing the above embodiment method can be completed by a computer program to instruct related hardware, and the computer program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这 些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be considered as the scope of this manual.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (20)

  1. 一种微博数据处理方法,包括:A microblog data processing method, comprising:
    通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
    对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
    根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
    根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
  2. 根据权利要求1所述的方法,所述通过监控微博流数据实时发现热点事件,包括:The method according to claim 1, wherein the detecting hotspot events in real time by monitoring the microblog stream data comprises:
    实时获取微博流数据,将所述微博流数据向量化;Acquiring the microblog stream data in real time, and vectorizing the microblog stream data;
    监控向量化后的微博流数据,记录每个特征词出现的频率和次数;及Monitoring the vectorized microblog stream data, recording the frequency and number of occurrences of each feature word; and
    根据每个特征词出现的频率和次数确定当前的热点事件。The current hot event is determined based on the frequency and number of occurrences of each feature word.
  3. 根据权利要求1所述的方法,其特征在于,所述将含有所述热点事件的微博进行情感倾向分析,包括:The method according to claim 1, wherein the analyzing the sentiment of the microblog containing the hot event comprises:
    抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;及Extracting the LDA topic feature and the word2vec word vector feature of each microblog containing the hot event; and
    将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。The LDA topic feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述情感倾向分析确定相应的情感指数值,包括:The method according to claim 1, wherein said determining a corresponding emotional index value based on said sentiment orientation analysis comprises:
    根据所述情感倾向分析将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数;及According to the sentiment orientation analysis, the microblogs containing the hot event are classified according to the emotion category, and the number of microblogs corresponding to each type of emotion is separately counted;
    根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。The emotion index value corresponding to the hot event is determined according to the counted number of microblogs corresponding to each type of emotion.
  5. 根据权利要求1所述的方法,其特征在于,在所述根据所述情感指数值生成时间序列的情感指数走势图之后,还包括:The method according to claim 1, wherein after the generating the time series of the emotional index chart according to the emotional index value, the method further comprises:
    根据所述情感指数走势图确定影响股市数据的情感因子;及Determining an affective factor affecting stock market data based on the sentiment index chart; and
    将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。The sentiment factor is used together with other stock market data as a predictor of the stock market forecasting model to make corresponding stock market forecasts.
  6. 一种微博数据处理装置,包括:A microblog data processing device includes:
    发现模块,用于通过监控微博流数据实时发现热点事件;a discovery module, configured to discover hotspot events in real time by monitoring microblogging stream data;
    分析模块,用于对含有所述热点事件的微博进行情感倾向分析;An analysis module, configured to perform an emotional tendency analysis on the microblog containing the hot event;
    确定模块,用于根据所述情感倾向分析确定相应的情感指数值;及a determining module, configured to determine a corresponding emotional index value according to the sentiment orientation analysis; and
    生成模块,用于根据所述情感指数值生成时间序列的情感指数走势图。And a generating module, configured to generate a time series of emotional index charts according to the emotional index value.
  7. 根据权利要求6所述的装置,所述发现模块包括:The apparatus of claim 6, the discovery module comprising:
    获取模块,用于实时获取微博流数据,将所述微博流数据向量化;An acquiring module, configured to acquire microblog stream data in real time, and vectorize the microblog stream data;
    记录模块,用于监控向量化后的微博流数据,记录每个特征词出现的频率和次数;及a recording module, configured to monitor the vectorized microblog stream data, and record the frequency and frequency of occurrence of each feature word;
    热点事件确定模块,用于根据每个特征词出现的频率和次数确定当前的热点事件。The hotspot event determining module is configured to determine a current hotspot event according to the frequency and the number of times each feature word appears.
  8. 根据权利要求6所述的装置,其特征在于,所述分析模块包括:The apparatus according to claim 6, wherein the analysis module comprises:
    抽取模块,用于抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;及An extracting module, configured to extract an LDA topic feature and a word2vec word vector feature of each microblog containing the hot event; and
    输出模块,用于将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。The output module is configured to substitute the LDA topic feature and the word2vec word vector feature into the sentiment orientation analysis model, and output the sentiment tendency of each microblog.
  9. 根据权利要求6所述的装置,其特征在于,所述确定模块包括:The apparatus according to claim 6, wherein the determining module comprises:
    统计模块,用于根据情感倾向分析将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数;及a statistical module, configured to classify the microblogs containing the hot event according to the sentiment orientation according to the sentiment analysis, and separately count the number of microblogs corresponding to each type of emotion; and
    情感指数值确定模块,用于根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。The sentiment index value determining module is configured to determine an emotional index value corresponding to the hot event according to the counted number of microblogs corresponding to each type of sentiment.
  10. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    股市预测模块,用于根据所述情感指数走势图确定影响股市数据的情感因子,将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。The stock market forecasting module is configured to determine an sentiment factor affecting the stock market data according to the sentiment index chart, and use the sentiment factor together with other stock market data as a predictor of the stock market forecasting model to perform a corresponding stock market forecast.
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device comprising a memory and one or more processors, the memory storing computer readable instructions, the computer readable instructions being executed by the one or more processors such that the one or more The processors perform the following steps:
    通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
    对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
    根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
    根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
  12. 根据权利要求11所述的计算机设备,其特征在于,所述通过监控微博流数据实时发现热点事件,包括:The computer device according to claim 11, wherein the discovering hotspot events in real time by monitoring the microblogging stream data comprises:
    实时获取微博流数据,将所述微博流数据向量化;Acquiring the microblog stream data in real time, and vectorizing the microblog stream data;
    监控向量化后的微博流数据,记录每个特征词出现的频率和次数;及Monitoring the vectorized microblog stream data, recording the frequency and number of occurrences of each feature word; and
    根据每个特征词出现的频率和次数确定当前的热点事件。The current hot event is determined based on the frequency and number of occurrences of each feature word.
  13. 根据权利要求11所述的计算机设备,其特征在于,所述将含有所述热点事件的微博进行情感倾向分析,包括:The computer device according to claim 11, wherein the analyzing the sentiment of the microblog containing the hot event comprises:
    抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;及Extracting the LDA topic feature and the word2vec word vector feature of each microblog containing the hot event; and
    将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型, 输出每条微博的情感倾向。The LDA topic feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  14. 根据权利要求11所述的计算机设备,其特征在于,所述根据所述情感倾向分析确定相应的情感指数值,包括:The computer device according to claim 11, wherein the determining the corresponding emotional index value according to the sentiment orientation analysis comprises:
    根据所述情感倾向分析将含有热点事件的微博按照情感类别进行分类,分别统计每一类情感对应的微博数;及According to the sentiment orientation analysis, the microblogs containing the hot event are classified according to the emotion category, and the number of microblogs corresponding to each type of emotion is separately counted;
    根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。The emotion index value corresponding to the hot event is determined according to the counted number of microblogs corresponding to each type of emotion.
  15. 根据权利要求11所述的计算机设备,其特征在于,在所述根据所述情感指数值生成时间序列的情感指数走势图之后,所述处理器还用于执行以下步骤:The computer device according to claim 11, wherein after the generating an emotional index chart of the time series according to the emotional index value, the processor is further configured to perform the following steps:
    根据所述情感指数走势图确定影响股市数据的情感因子;及Determining an affective factor affecting stock market data based on the sentiment index chart; and
    将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。The sentiment factor is used together with other stock market data as a predictor of the stock market forecasting model to make corresponding stock market forecasts.
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    通过监控微博流数据实时发现热点事件;Discover hot events in real time by monitoring Weibo stream data;
    对含有所述热点事件的微博进行情感倾向分析;Conducting emotional sentiment analysis on Weibo containing the hot event;
    根据所述情感倾向分析确定相应的情感指数值;及Determining a corresponding emotional index value according to the sentiment orientation analysis; and
    根据所述情感指数值生成时间序列的情感指数走势图。A time series of emotional index charts is generated based on the emotional index values.
  17. 根据权利要求16所述的存储介质,其特征在于,所述通过监控微博流数据实时发现热点事件,包括:The storage medium according to claim 16, wherein the discovering hotspot events in real time by monitoring the microblogging stream data comprises:
    实时获取微博流数据,将所述微博流数据向量化;Acquiring the microblog stream data in real time, and vectorizing the microblog stream data;
    监控向量化后的微博流数据,记录每个特征词出现的频率和次数;及Monitoring the vectorized microblog stream data, recording the frequency and number of occurrences of each feature word; and
    根据每个特征词出现的频率和次数确定当前的热点事件。The current hot event is determined based on the frequency and number of occurrences of each feature word.
  18. 根据权利要求16所述的存储介质,其特征在于,所述将含有所述热点事件的微博进行情感倾向分析,包括:The storage medium according to claim 16, wherein the analyzing the sentiment of the microblog containing the hot event comprises:
    抽取获取到的含有热点事件的每条微博的LDA主题特征和word2vec词向量特征;及Extracting the LDA topic feature and the word2vec word vector feature of each microblog containing the hot event; and
    将所述LDA主题特征和word2vec词向量特征代入情感倾向分析模型,输出每条微博的情感倾向。The LDA topic feature and the word2vec word vector feature are substituted into the sentiment orientation analysis model, and the sentiment tendency of each microblog is output.
  19. 根据权利要求16所述的存储介质,其特征在于,所述根据所述情感倾向分析确定相应的情感指数值,包括:The storage medium according to claim 16, wherein the determining the corresponding emotional index value according to the sentiment orientation analysis comprises:
    根据所述情感倾向分析将含有热点事件的微博按照情感类别进行分类, 分别统计每一类情感对应的微博数;及According to the sentiment orientation analysis, the microblogs containing the hot event are classified according to the emotion category, and the number of microblogs corresponding to each type of emotion is separately counted;
    根据统计出的所述每一类情感对应的微博数确定所述热点事件对应的情感指数值。The emotion index value corresponding to the hot event is determined according to the counted number of microblogs corresponding to each type of emotion.
  20. 根据权利要求16所述的存储介质,其特征在于,在所述根据所述情感指数值生成时间序列的情感指数走势图之后,所述处理器还用于执行以下步骤:The storage medium according to claim 16, wherein after the generating an emotional index chart according to the emotional index value, the processor is further configured to perform the following steps:
    根据所述情感指数走势图确定影响股市数据的情感因子;及Determining an affective factor affecting stock market data based on the sentiment index chart; and
    将所述情感因子和其他股市数据一起作为股市预测模型的预测因子进行相应的股市预测。The sentiment factor is used together with other stock market data as a predictor of the stock market forecasting model to make corresponding stock market forecasts.
PCT/CN2018/081697 2017-04-07 2018-04-03 Microblog data processing method and device, computer device and storage medium WO2018184518A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710225681.5 2017-04-07
CN201710225681.5A CN107797983A (en) 2017-04-07 2017-04-07 Microblog data processing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2018184518A1 true WO2018184518A1 (en) 2018-10-11

Family

ID=61531049

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081697 WO2018184518A1 (en) 2017-04-07 2018-04-03 Microblog data processing method and device, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN107797983A (en)
WO (1) WO2018184518A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797983A (en) * 2017-04-07 2018-03-13 平安科技(深圳)有限公司 Microblog data processing method, device, computer equipment and storage medium
CN108595519A (en) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 Focus incident sorting technique, device and storage medium
CN108733782A (en) * 2018-05-08 2018-11-02 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of assets trend analysis
CN108629693A (en) * 2018-05-08 2018-10-09 平安科技(深圳)有限公司 Automatically generate method, apparatus, computer equipment and the storage medium of suggestion for investment
CN108959479B (en) * 2018-06-21 2022-03-25 成都睿码科技有限责任公司 Event emotion classification method based on text similarity
CN109344248B (en) * 2018-07-27 2021-10-22 中山大学 Academic topic life cycle analysis method based on scientific and technological literature abstract clustering
CN109213934A (en) * 2018-08-23 2019-01-15 阿里巴巴集团控股有限公司 A kind of processing method of resource, device and equipment
CN110968696B (en) * 2019-11-20 2023-06-06 国元证券股份有限公司 Financial blog text analysis method
CN111047353A (en) * 2019-11-27 2020-04-21 泰康保险集团股份有限公司 Data processing method and system and electronic equipment
CN111159166A (en) * 2019-12-27 2020-05-15 沃民高新科技(北京)股份有限公司 Event prediction method and device, storage medium and processor
CN113190682B (en) * 2021-06-30 2021-09-28 平安科技(深圳)有限公司 Method and device for acquiring event influence degree based on tree model and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537097A (en) * 2015-01-09 2015-04-22 成都布林特信息技术有限公司 Microblog public opinion monitoring system
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN107797983A (en) * 2017-04-07 2018-03-13 平安科技(深圳)有限公司 Microblog data processing method, device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116605B (en) * 2013-01-17 2016-02-10 上海交通大学 A kind of microblog hot event real-time detection method based on monitoring subnet and system
CN103500175B (en) * 2013-08-13 2017-09-15 中国人民解放军国防科学技术大学 A kind of method based on sentiment analysis on-line checking microblog hot event
CN104239383A (en) * 2014-06-09 2014-12-24 合肥工业大学 MicroBlog emotion visualization method
CN104598632B (en) * 2015-02-05 2017-12-01 北京航空航天大学 Focus incident detection method and device
CN106250513B (en) * 2016-08-02 2021-04-23 西南石油大学 Event modeling-based event personalized classification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537097A (en) * 2015-01-09 2015-04-22 成都布林特信息技术有限公司 Microblog public opinion monitoring system
CN105589941A (en) * 2015-12-15 2016-05-18 北京百分点信息科技有限公司 Emotional information detection method and apparatus for web text
CN105740228A (en) * 2016-01-25 2016-07-06 云南大学 Internet public opinion analysis method
CN107797983A (en) * 2017-04-07 2018-03-13 平安科技(深圳)有限公司 Microblog data processing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN107797983A (en) 2018-03-13

Similar Documents

Publication Publication Date Title
WO2018184518A1 (en) Microblog data processing method and device, computer device and storage medium
Albalawi et al. Using topic modeling methods for short-text data: A comparative analysis
Chan et al. Sentiment analysis in financial texts
Safa et al. Automatic detection of depression symptoms in twitter using multimodal analysis
Gupta et al. Prediction of research trends using LDA based topic modeling
CN110941953A (en) Automatic identification method and system for network false comments considering interpretability
CN115795030A (en) Text classification method and device, computer equipment and storage medium
Darwiesh et al. Business intelligence for risk management: A review
Farooqui et al. Sentiment analysis of twitter accounts using natural language processing
Majeed et al. Deep-EmoRU: mining emotions from roman urdu text using deep learning ensemble
Alzazah et al. Predict market movements based on the sentiment of financial video news sites
Addepalli et al. A proposed framework for measuring customer satisfaction and product recommendation for ecommerce
Fehrera et al. Improving decision analytics with deep learning: The case of financial disclosures
Ruposh et al. A computational approach of recognizing emotion from Bengali texts
CN113392920A (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
Trisal et al. K-RCC: A novel approach to reduce the computational complexity of KNN algorithm for detecting human behavior on social networks
CN111859955A (en) Public opinion data analysis model based on deep learning
Yenkikar et al. Sentimlbench: Benchmark evaluation of machine learning algorithms for sentiment analysis
JP6026036B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
Voronov et al. Forecasting popularity of news article by title analyzing with BN-LSTM network
CN115048536A (en) Knowledge graph generation method and device, computer equipment and storage medium
KR102155692B1 (en) Methods for performing sentiment analysis of messages in social network service based on part of speech feature and sentiment analysis apparatus for performing the same
KR102215259B1 (en) Method of analyzing relationships of words or documents by subject and device implementing the same
Harshvardhan et al. Topic modelling Twitterati sentiments using Latent Dirichlet allocation during demonetization
Jardim et al. A Multilingual Lexicon-based Approach for Sentiment Analysis in Social and Cultural Information System Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18781088

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/01/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18781088

Country of ref document: EP

Kind code of ref document: A1