CN109087205B

CN109087205B - Public opinion index prediction method and device, computer equipment and readable storage medium

Info

Publication number: CN109087205B
Application number: CN201810909879.XA
Authority: CN
Inventors: 邓江东; 李磊; 马维英
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2020-09-18
Anticipated expiration: 2038-08-10
Also published as: CN109087205A

Abstract

The application relates to a public sentiment index prediction method, which comprises the following steps: acquiring stock public opinion information; performing word segmentation on the stock public opinion information to obtain an initial word sequence, wherein the initial word sequence comprises at least one word segmentation word; performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech characteristic sequence corresponding to the tagged word sequence; obtaining a word vector sequence according to the tagged word sequence and the part-of-speech characteristic sequence; and inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index. The public opinion index prediction method can assist the user in predicting the price of future stocks, and improves the accuracy of prediction of the user. The application also relates to a public sentiment index prediction device, a computer device and a computer readable storage medium.

Description

Public opinion index prediction method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a public sentiment index prediction method and apparatus, a computer device, and a computer-readable storage medium.

Background

Nowadays, financial investment has become a means for ordinary users to manage money, for example, users can manage money by stock trading. However, most users are scattered users, are not as professional as financial institutions, and have no related technical means to assist them in decision making, so that many times they buy and sell stocks, they rely on their subjective will to see the K-line fluctuations.

In the financial products existing in the market at present, only a K-line graph about the stock price is formed, so that in the analysis process of selecting stocks to invest, a user can only make a forecast on the future price of the stocks by analyzing cold trading data in the past.

However, the accuracy of the user prediction is low due to the adoption of the method, so that the user investment is at great risk.

Disclosure of Invention

In view of the above, it is necessary to provide a public opinion index prediction method and apparatus, a computer device, and a computer readable storage medium, which can assist the user prediction and improve the prediction accuracy, for solving the problem of low accuracy of the user prediction.

A method of predicting a public sentiment index, the method comprising:

acquiring stock public opinion information;

performing word segmentation on the stock public opinion information to obtain an initial word sequence, wherein the initial word sequence comprises at least one word segmentation word;

performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech characteristic sequence corresponding to the tagged word sequence;

obtaining a word vector sequence according to the tagged word sequence and the part-of-speech characteristic sequence;

and inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index.

In one embodiment, the step of inputting the word vector sequence and the part-of-speech feature sequence into a preset public opinion model to obtain a stock public opinion index comprises the following steps:

acquiring a historical public opinion index corresponding to the stock;

and drawing a stock public opinion K line graph according to the stock public opinion index and the historical public opinion index.

In one embodiment, the step of segmenting the stock public opinion information to obtain an initial word sequence includes:

acquiring financial seed words, and performing near word expansion on the financial seed words to obtain financial keywords, wherein the financial seed words comprise words related to the stocks;

classifying the stock public opinion information according to the financial seed words and the financial key words to obtain a stock category corresponding to each stock public opinion information;

and segmenting the stock public opinion information according to the stock category to obtain the initial word sequence.

In one embodiment, the step of performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech feature sequence corresponding to the tagged word sequence includes:

performing part-of-speech tagging on the initial word sequence to obtain an initial part-of-speech characteristic sequence;

acquiring interference words and part-of-speech characteristics of the interference words, matching the interference words and the part-of-speech characteristics of the interference words with the initial part-of-speech characteristic sequence, and acquiring an interference word sequence corresponding to stock public opinion information containing the interference words and an interference part-of-speech characteristic sequence corresponding to the interference word sequence;

and removing the interference word sequence in the initial word sequence to obtain the tagged word sequence, and removing the interference word characteristic sequence in the initial word characteristic sequence to obtain the word characteristic sequence.

In one embodiment, the step of obtaining a word vector sequence according to the tagged word sequence and the part-of-speech feature sequence includes:

extracting key words in the tagged word sequence according to the part-of-speech characteristic sequence, and removing duplication of the tagged word sequence according to the key words to obtain a standard word sequence;

vectorizing the word segmentation words in the standard word sequence to obtain a word vector sequence.

In one embodiment, the step of inputting the word vector sequence and the part-of-speech feature sequence into a preset public opinion model to obtain a stock public opinion index comprises:

combining the part-of-speech characteristic sequences to obtain sentence level characteristics;

summarizing the sentence level characteristics to obtain discourse level characteristics;

and inputting the word vector sequence and the chapter level characteristics into a preset public opinion model to obtain a stock public opinion index.

performing word segmentation on the stock public opinion information to obtain a first word sequence;

performing sequence tagging on the first word sequence to obtain a tagged word sequence;

and performing off-line processing on the tagged word sequence to obtain the initial word sequence.

In one embodiment, the step of inputting the word vector sequence and the chapter-level features into a preset public opinion model to obtain a stock public opinion index comprises:

acquiring a public opinion information sample corresponding to each stock;

performing word segmentation on the public opinion information sample to obtain an initial word sequence sample, wherein the initial word sequence sample comprises at least one word segmentation word;

performing part-of-speech tagging on the initial word sequence sample to obtain a part-of-speech characteristic sequence sample and a tagged word sequence sample corresponding to the part-of-speech characteristic sequence sample;

extracting a keyword sample in the tagged word sequence sample according to the part-of-speech feature sequence sample, and removing duplication of the tagged word sequence sample according to the keyword sample to obtain a standard word sequence sample;

vectorizing word segmentation words in the standard word sequence sample to obtain a word vector sequence sample;

and obtaining the public opinion model according to the word vector sequence sample and the part of speech characteristic sequence sample.

A prediction apparatus of public opinion index, the prediction apparatus comprising:

the acquisition module is used for acquiring stock public opinion information;

the word segmentation module is used for segmenting the stock public opinion information to obtain an initial word sequence, and the initial word sequence comprises at least one word segmentation word;

the part-of-speech tagging module is used for performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech feature sequence corresponding to the tagged word sequence;

the vectorization module is used for obtaining a word vector sequence according to the tagged word sequence and the part-of-speech characteristic sequence;

and the scoring module is used for inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index.

A computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

The public opinion index prediction method and device, the computer equipment and the computer readable storage medium obtain the corresponding part-of-speech characteristic sequence and the labeled part-of-speech sequence by performing word segmentation and part-of-speech labeling on stock public opinion information. And further vectorizing the marked word sequence to obtain a word vector sequence, and obtaining the stock public opinion index according to the word vector sequence and the part-of-speech characteristic sequence. Because the importance of the stock public opinion information is considered, the stock public opinion information is quantized and scored, and the user is further helped to analyze the stock quotation according to the stock public opinion indexes, so that more reliable real-time reference data is provided for the user to assist the user in predicting the price of future stocks, the accuracy of user prediction is improved, meanwhile, the reference can be provided for professionals, and the time for analyzing related news public opinions is saved.

Drawings

FIG. 1 is a schematic diagram illustrating an application scenario architecture of a public sentiment index prediction method in an embodiment;

FIG. 2 is a flow chart illustrating a method for predicting a public sentiment index according to an embodiment;

FIG. 3 is a flow chart illustrating a method for predicting a public sentiment index according to another embodiment;

FIG. 4 is a flow diagram illustrating an embodiment of a root obtaining an initial sequence of words from financial seed words;

FIG. 5 is a flow chart illustrating an embodiment of obtaining a stock public opinion index based on a part-of-speech feature sequence;

FIG. 6 is a flow diagram illustrating an embodiment of obtaining an initial word sequence according to stock public opinion information;

fig. 7 is a flowchart illustrating a method for obtaining a public sentiment model according to an embodiment;

FIG. 8 is a block diagram of a public sentiment index prediction apparatus according to an embodiment;

FIG. 9 is a diagram of an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The public opinion index prediction method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 may provide a corresponding user-oriented web platform and may transmit the stock public opinion index to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

Referring to fig. 2, an embodiment of the present application provides a public sentiment index prediction method, which is applied to the server 104 in fig. 1 for illustration. The prediction method specifically comprises the following steps:

and S202, obtaining stock public opinion information.

The stock public opinion information includes public opinion information related to stocks, for example, the stock public opinion information includes financial information articles (e.g., stock information articles in a collection station, stock information articles outside a crawling station).

Specifically, the server acquires stock public opinion information. Further, the server marks the provenance of the stock public opinion information (such as financial information articles), namely marks the provenance or source of the stock public opinion information. Further, the server can effectively use or increase the weight of stock public opinion information issued by an authority according to the origin or the source.

S204, performing word segmentation on the stock public opinion information to obtain an initial word sequence, wherein the initial word sequence comprises at least one word segmentation word.

The word segmentation is a process of recombining continuous word sequences into a word sequence according to a certain standard.

Specifically, the server performs word segmentation operation on the stock public opinion information to obtain an initial word sequence. The number of initial word sequences may include one or more word sequences, and each of the word sequences may include one or more participle words. For example, for "company a will complete the merger of company B in 5/20 days", such a sentence, the server obtains the initial word sequence by segmenting the sentence as "company a/general/in 5/20/complete/pair/company B/merger". Alternatively, the word segmentation operation may be partitioned by a word segmentation model.

And S206, performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech characteristic sequence corresponding to the tagged word sequence.

The part-of-speech tagging, also called part-of-speech tagging or tagging for short, refers to a procedure for tagging each participle word in the participle result with a correct part-of-speech, that is, a process for determining whether each word is a noun, a verb, an adjective, or other part-of-speech. The tagged word sequence includes an initial word sequence and a part-of-speech feature sequence, in other words, the initial word sequence and the corresponding part-of-speech feature sequence are combined to obtain the tagged word sequence.

Specifically, after the server obtains the initial word sequence, the part of speech of each participle word is mined in a part of speech tagging mode. By analyzing the part of speech of the word segmentation words, the server is facilitated to better obtain article characteristics of the financial information article, and the server is facilitated to subsequently extract article keywords.

Optionally, the server excavates the part of speech of each participle word by using a hierarchical bidirectional recurrent neural network to obtain a tagged word sequence and a part of speech feature sequence corresponding to the tagged word sequence. The hierarchical bidirectional cyclic neural network can mine deeper features of part of speech, so that the part of speech tagging accuracy is guaranteed, and the hierarchical bidirectional cyclic neural network can guarantee that the part of speech mining of the current participle word is based on the constraint of the context part of speech of the current participle word.

And S208, obtaining a word vector sequence according to the tagged word sequence and the part-of-speech characteristic sequence. Where a word vector refers to a vector in which words are mapped to real numbers. A word vector sequence refers to a sequence of one or more word vectors.

Specifically, the server obtains a preset word vector model, and inputs the tagged word sequence and the part-of-speech feature sequence to obtain a word vector sequence. Optionally, word vector tools such as word2vec and GloVe can be used to process the tagged word sequence and the part-of-speech feature sequence to obtain a word vector sequence.

And S210, inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index.

Specifically, the server inputs the word vector sequence into a preset public opinion model on one hand, and also inputs the part-of-speech characteristic sequence into the preset public opinion model on the other hand. The server collects the part-of-speech characteristics in the part-of-speech characteristic sequence, establishes corresponding part-of-speech characteristic information and uses the part-of-speech characteristic information as a characteristic set. The feature set comprises feature data such as noun number, adjective number and the like. The method and the device are beneficial to modeling the article characteristics of the financial information articles by analyzing the part-of-speech characteristic sequence.

Further, in order to ensure the accuracy of the public opinion model, an odd number of public opinion models are set in the server to vote on the scoring result of the financial information article to determine the positive and negative public opinion results of the financial information article. Because each stock has the corresponding financial information article, the financial information article corresponding to the stock can be quantized by voting by adopting the public opinion model. Alternatively, the detailed process of quantification may include the server processing data of information sources, praise numbers, browsing numbers, and review numbers of published financial information articles, so as to measure the importance of the financial information articles. The server sends the importance degree and positive and negative public opinion results of the information article to the public opinion model, and finally obtains a quantifiable stock public opinion index.

The public opinion index prediction method obtains a corresponding part-of-speech characteristic sequence and a labeled word sequence by carrying out word segmentation and part-of-speech labeling on stock public opinion information. And further vectorizing the marked word sequence to obtain a word vector sequence, and obtaining the stock public opinion index according to the word vector sequence and the part-of-speech characteristic sequence. Because the importance of the stock public opinion information is considered, the stock public opinion information is quantized and scored, and the user is further helped to analyze the stock quotation according to the stock public opinion indexes, so that more reliable real-time reference data is provided for the user to assist the user in predicting the price of future stocks, the accuracy of user prediction is improved, meanwhile, the reference can be provided for professionals, and the time for analyzing related news public opinions is saved.

Referring to fig. 3, in one embodiment, the stock public opinion index is combined with a historical public opinion index to draw a stock public opinion K-line graph. Specifically, S210 includes the following steps thereafter:

s212, acquiring historical public opinion indexes corresponding to the stocks;

s214, drawing a stock public opinion K line graph according to the stock public opinion index and the historical public opinion index.

Specifically, the server may obtain a historical public opinion index corresponding to the stock from a stock public opinion graph, analyze the stock historical public opinion information, and combine the stock public opinion indexes to obtain a current public opinion score of the current whole network for the current stock, and draw the current public opinion score into a stock public opinion K-line graph for the investment reference of the users.

Referring to fig. 4, in one embodiment, the specific process of classifying stock public opinion information is involved. In this embodiment, S204 specifically includes the following steps:

s2042, acquiring financial seed words, and performing near-meaning word expansion on the financial seed words to obtain financial keywords, wherein the financial seed words comprise words related to the stocks;

s2044, classifying the stock public opinion information according to the financial seed words and the financial key words to obtain a stock category corresponding to each stock public opinion information; and

s2046, segmenting the stock public opinion information according to the stock categories to obtain the initial word sequence.

Specifically, in S2042, the financial seed words include words directly related to stocks, which may include stocks, households and stakeholders. The server may generate more financial keywords through a near-synonym mining algorithm. For example, the financial keywords such as stock investment and concept stock are obtained through stock.

In S2044, the stock category refers to a distinction made by different categories of stocks. Preferably, stocks are classified according to the listing area as: five major categories of A strand, B strand, H strand, S strand and N strand. The server performs semantic similarity matching on the related stock public opinion information according to the financial seed words and the financial key words, thereby completing classification of the stock public opinion information, obtaining a stock category corresponding to each stock public opinion information, and realizing mapping from stocks to the related stock public opinion information. For example, the A stock category includes a financial keyword "" circulation stock "", then the server matches a financial information article including "" circulation stock "", based on the "" circulation stock "", thereby confirming that the information article belongs to the financial information article of the A stock category, thereby mapping the financial information article into the A stock category.

In S2046, each stock has specific and special vocabulary, such as the vocabulary of the rmb ordinary stock, the circulation stock, the national enterprise stock, the high forward, the black swan, and the discount rate. Therefore, the server performs word segmentation on the stock public opinion information according to the stock categories, and the accuracy of the word segmentation on the stock public opinion information is guaranteed. Optionally, the server stores the special vocabulary in the word segmentation bank in advance for subsequent extraction.

In one embodiment, how to remove the interference information. Wherein, S206 specifically comprises the following steps:

s2062, performing part-of-speech tagging on the initial word sequence to obtain an initial part-of-speech characteristic sequence;

s2064, obtaining interference words and part-of-speech characteristics of the interference words, matching the interference words and the part-of-speech characteristics of the interference words with the initial part-of-speech characteristic sequence, and obtaining interference word sequences corresponding to stock public opinion information containing the interference words and interference part-of-speech characteristic sequences corresponding to the interference word sequences; and

s2066, removing the interference word sequence in the initial word sequence to obtain the tagged word sequence, and removing the interference word characteristic sequence in the initial word characteristic sequence to obtain the word characteristic sequence.

Specifically, after the server carries out word segmentation and part-of-speech tagging on stock public opinion information, some interference words in an initial part-of-speech characteristic sequence can be extracted. For example, since most rumor articles or false articles contain exaggerated or false adjectives to attract the eyes of users, the server recognizes the exaggerated or false adjectives as interfering words, which are helpful to help the server recognize rumor or false stock opinion information.

Further, the server may also use a dependency parsing technique to parse the structure of the rumor article or the fake article to achieve better recognition of the rumor article or the fake article. After identifying and removing rumor articles or false articles in stock public opinion information, the server obtains a part-of-speech characteristic sequence and a tagged word sequence corresponding to the part-of-speech characteristic sequence.

In one embodiment, the method relates to the duplication removal of stock public opinion information by extracting keywords. Wherein, S208 specifically comprises the following steps:

extracting key words in the tagged word sequence according to the part-of-speech characteristic sequence, and removing duplication of the tagged word sequence according to the key words to obtain a standard word sequence; and

Specifically, the content of a plurality of information articles is useless for understanding the meaning of the whole article, and the server can summarize the meaning of the whole article through keywords of some information articles. For example, for an information article recombined by a report company C, the server extracts keywords in the tagged word sequence of the information article according to the word characteristic sequence, such as the mixed change of the central enterprise, the incoming of XXX company and the like, so that the article idea can be well summarized. Further, the server matches the keywords of the information article, and if the participle words with the same or similar semantics as the keywords of the information article reach a preset threshold (e.g. above 90%), the server performs de-duplication on the tagged word sequence of the extremely similar information article according to the keywords to obtain a standard word sequence.

Preferably, the server adopts a sequence-to-sequence generation algorithm, namely, the information article is used as input, the keyword corresponding to the information article is used as output, the input is input into the sequence-to-sequence generation algorithm, and the end-to-end deep learning model training is carried out to finally obtain the generation model from the information article to the keyword. In the process of extracting the keywords, the information article input source needs to be fused with the information of the part-of-speech characteristic sequence, because the generated keywords are the combination of a few parts-of-speech in most cases, the situation that the parts-of-speech are not overlapped before and after the part-of-speech caused in the process of generating some keywords can be avoided through the parts-of-speech of the word segmentation words.

Furthermore, the word-segmentation word is the smallest component unit of the information article. And vectorizing the word segmentation words in the standard word sequence by the server to obtain a word vector sequence. The sequence of word vectors includes at least one word vector, wherein the word vector represents a continuous dense vector representing words to a fixed length. The server pre-trains the word and the word by adopting word vector tools such as word2vec, GloVe and the like, and sends the pre-trained word vector to the public opinion model. The server retrains the pre-trained word vectors in the public sentiment model training process, and covers the finally obtained word vectors with the initial word vectors in a multi-iteration mode.

Referring to fig. 5, in one embodiment, the method relates to a specific process of obtaining a stock public opinion index according to a part-of-speech feature sequence. Wherein, S210 specifically includes the following steps:

s2102, combining the part of speech feature sequences to obtain sentence level features;

s2104, the sentence level features are summarized to obtain chapter level features;

s2106, inputting the word vector sequence and the chapter level characteristics into a preset public opinion model to obtain a stock public opinion index.

Specifically, the server may input the obtained sequence data of the part of speech features into a preset feature model. The server can analyze the part of speech of the input word segmentation words through the characteristic model, obtain the characteristics of the sentence level by combining the characteristic sequences of the part of speech, and then abstract and integrate the characteristics of the sentence level to obtain the characteristics of the chapter level. Then, the server inputs the characteristics of the chapter level into a preset public opinion model, so as to obtain the stock public opinion index.

Preferably, the server inputs the word segmentation word samples into a machine learning model (such as a deep neural network) to perform data modeling on the financial information article, so as to obtain the feature model.

Referring to FIG. 6, in one embodiment, a specific process for word segmentation is involved. Wherein, S204 further comprises the following steps:

s2041, performing word segmentation on the stock public opinion information to obtain a first word sequence;

s2043, performing sequence tagging on the first word sequence to obtain a tagged word sequence;

and S2045, performing offline processing on the tagged word sequence to obtain the initial word sequence.

Specifically, the server may perform word segmentation on the stock public opinion information according to a preset word segmentation lexicon to obtain a first word sequence. However, it is clear that with the development of the times, new words such as digital currency, block chains, etc. which have appeared recently, often appear due to the diversification of financial vocabularies. Therefore, the server firstly corrects the first word sequence in an online direct prediction mode, for example, the server performs bonding processing on some separated words through a sequence tagging algorithm to form a new word, and obtains a tagged word sequence.

For example, the "digital currency" is a new word, and when the server performs word segmentation on the "digital currency" according to an existing word segmentation word bank, because only two words of "number" and "currency" exist in the existing word segmentation word bank, and the new word of the "digital currency" does not exist, the first word sequence obtained by the server is "number/currency". However, the server can recognize the digital currency as a new word by means of sequence labeling, and the semantic accuracy can be better ensured by judging the digital currency and the context by using the language model, so that the server combines the digital currency into a word to ensure the semantic accuracy of article word segmentation. Therefore, the server corrects the two words of the number and the currency by using a sequence tagging algorithm, and the finally obtained tagged word sequence is the number currency and is stored in the existing word segmentation word bank, so that the word segmentation accuracy and the integrity of the word segmentation word bank are ensured.

Further, since word segmentation by means of online direct prediction may have a certain misjudgment, some words with inaccurate language model identification may occur, and these words are new words stuck together by means of online direct prediction. The server splits the new words again to prevent semantic errors. And then the server judges whether the word segmentation is accurate or not in an offline batch prediction mode, namely the server performs offline calculation on the new words by counting word frequency, calculating information entropy and mutual information, determines whether the new words are accurate in word segmentation or not through manual verification, and stores the accurate word segmentation into a word segmentation word bank.

Please refer to fig. 7, which in one embodiment relates to a process of establishing a public opinion model. Specifically, S210 previously includes the following steps:

s2091, obtaining public opinion information samples corresponding to each stock;

s2092, performing word segmentation on the public sentiment information sample to obtain an initial word sequence sample, wherein the initial word sequence sample comprises at least one word segmentation word;

s2093, performing part-of-speech tagging on the initial part-of-speech sequence sample to obtain a part-of-speech feature sequence sample and a tagged part-of-speech sequence sample corresponding to the part-of-speech feature sequence sample;

s2094, extracting a keyword sample in the tagged word sequence sample according to the part-of-speech feature sequence sample, and removing duplication of the tagged word sequence sample according to the keyword sample to obtain a standard word sequence sample;

s2095, vectorizing the word segmentation words in the standard word sequence sample to obtain a word vector sequence sample;

s2096, obtaining the public opinion model according to the word vector sequence sample and the part-of-speech feature sequence sample.

Specifically, in the process of training the public opinion model, the server obtains public opinion information samples corresponding to each stock, wherein the public opinion information samples comprise information article samples. And then, the server divides the public sentiment information samples into words to obtain initial word sequence samples, and performs part-of-speech tagging on the initial word sequence samples to obtain part-of-speech characteristic sequence samples and tagged word sequence samples corresponding to the part-of-speech characteristic sequence samples. And then, the server extracts a keyword sample in the tagged word sequence sample according to the part-of-speech characteristic sequence sample, and removes the duplication of the tagged word sequence sample according to the keyword sample to obtain a standard word sequence sample.

It should be clear that most keyword extraction algorithms rely on the traditional BM25, graph-based algorithms. However, in the embodiment, it is found that an information article is mostly determined by several keywords, and many of the keywords are contents in the information article. Therefore, a sequence-to-sequence generation algorithm is adopted in the server, namely, the information article sample is used as input, the keyword sample corresponding to the information article sample is used as output, the keyword sample is input into the sequence-to-sequence generation algorithm, and the generation model from the information article sample to the keyword sample is finally obtained through end-to-end deep learning model training. In the process of generating the keyword samples, the information of the part-of-speech characteristic sequence samples needs to be fused with the input sources of the information article samples, and in most cases, the generated keyword samples are the combination of a few parts-of-speech, so that the condition that the parts-of-speech are not overlapped before and after the part-of-speech caused in the process of generating some keyword samples can be avoided through the parts-of-speech of the word segmentation words.

And then, vectorizing the word segmentation words in the standard word sequence sample by the server to obtain a word vector sequence sample, and obtaining a public opinion model according to the word vector sequence sample and the part-of-speech characteristic sequence sample.

Further, the server removes rumor information article samples or dummy information article samples according to the initial part-of-speech feature sequence samples and the dependency parsing technique. Because the dependency grammar analysis technology of the information article sample strongly depends on the part of speech information of the information article to construct the grammar of the sentence, a co-training mode is adopted to simultaneously train a dependency grammar analysis model and a part of speech tagging model in a server, namely, the result of part of speech tagging is used as the input of the dependency grammar analysis model, and a better dependency grammar analysis model and part of speech tagging model can be obtained.

It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a public opinion index prediction apparatus 800, including: an obtaining module 802, a word segmentation module 804, a part of speech tagging module 806, a vectorization module 808, and a scoring module 810, wherein:

an obtaining module 802, configured to obtain stock public opinion information;

a word segmentation module 804, configured to segment the stock public opinion information to obtain an initial word sequence, where the initial word sequence includes at least one segmented word;

a part-of-speech tagging module 806, configured to perform part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech feature sequence corresponding to the tagged word sequence;

a vectorization module 808, configured to obtain a word vector sequence according to the tagged word sequence and the part-of-speech feature sequence;

and the scoring module 810 is used for inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index.

The public opinion index predicting device obtains corresponding part-of-speech characteristic sequences and labeled word sequences by carrying out word segmentation and part-of-speech labeling on stock public opinion information. And further vectorizing the marked word sequence to obtain a word vector sequence, and obtaining the stock public opinion index according to the word vector sequence and the part-of-speech characteristic sequence. Because the importance of the stock public opinion information is considered, the stock public opinion information is quantized and scored, and the user is further helped to analyze the stock quotation according to the stock public opinion indexes, so that more reliable real-time reference data is provided for the user to assist the user in predicting the price of future stocks, the accuracy of user prediction is improved, meanwhile, the reference can be provided for professionals, and the time for analyzing related news public opinions is saved.

In one embodiment, the prediction apparatus 800 further comprises:

the first acquisition module is used for acquiring a historical public opinion index corresponding to the stock;

and the second acquisition module is used for drawing a stock public opinion K line graph according to the stock public opinion index and the historical public opinion index.

In one embodiment, the word segmentation module 804 is further configured to obtain financial seed words, perform near word expansion on the financial seed words, and obtain financial keywords, where the financial seed words include words related to the stock; classifying the stock public opinion information according to the financial seed words and the financial key words to obtain a stock category corresponding to each stock public opinion information; and segmenting the stock public opinion information according to the stock category to obtain the initial word sequence.

In an embodiment, the part-of-speech tagging module 806 is further configured to perform part-of-speech tagging on the initial word sequence to obtain an initial part-of-speech feature sequence; acquiring interference words and part-of-speech characteristics of the interference words, matching the interference words and the part-of-speech characteristics of the interference words with the initial part-of-speech characteristic sequence, and acquiring an interference word sequence corresponding to stock public opinion information containing the interference words and an interference part-of-speech characteristic sequence corresponding to the interference word sequence; and removing the interference word sequence in the initial word sequence to obtain the tagged word sequence, and removing the interference word characteristic sequence in the initial word characteristic sequence to obtain the word characteristic sequence.

In an embodiment, the vectorization module 808 is further configured to extract a keyword in the tagged word sequence according to the part-of-speech feature sequence, and deduplicate the tagged word sequence according to the keyword to obtain a standard word sequence; vectorizing the word segmentation words in the standard word sequence to obtain a word vector sequence.

In one embodiment, the scoring module 810 is further configured to combine the part-of-speech feature sequences to obtain sentence-level features; summarizing the sentence level characteristics to obtain discourse level characteristics; and inputting the word vector sequence and the chapter level characteristics into a preset public opinion model to obtain a stock public opinion index.

In one embodiment, the word segmentation module 804 is further configured to segment the stock public opinion information to obtain a first word sequence; performing sequence tagging on the first word sequence to obtain a tagged word sequence; and performing off-line processing on the tagged word sequence to obtain the initial word sequence.

For the specific definition of the public sentiment index prediction device, reference may be made to the above definition of the public sentiment index prediction method, which is not described herein again. All or part of the modules in the public opinion index prediction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing relevant data generated by predicting the stock public opinion index. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a public sentiment index prediction method.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring stock public opinion information;

The computer equipment obtains the corresponding part-of-speech characteristic sequence and the tagged word sequence by performing word segmentation and part-of-speech tagging on the stock public opinion information. And further vectorizing the marked word sequence to obtain a word vector sequence, and obtaining the stock public opinion index according to the word vector sequence and the part-of-speech characteristic sequence. Because the importance of the stock public opinion information is considered, the stock public opinion information is quantized and scored, and the user is further helped to analyze the stock quotation according to the stock public opinion indexes, so that more reliable real-time reference data is provided for the user to assist the user in predicting the price of future stocks, the accuracy of user prediction is improved, meanwhile, the reference can be provided for professionals, and the time for analyzing related news public opinions is saved.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a historical public opinion index corresponding to the stock; and drawing a stock public opinion K line graph according to the stock public opinion index and the historical public opinion index.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring financial seed words, and performing near word expansion on the financial seed words to obtain financial keywords, wherein the financial seed words comprise words related to the stocks; classifying the stock public opinion information according to the financial seed words and the financial key words to obtain a stock category corresponding to each stock public opinion information; and segmenting the stock public opinion information according to the stock category to obtain the initial word sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing part-of-speech tagging on the initial word sequence to obtain an initial part-of-speech characteristic sequence; acquiring interference words and part-of-speech characteristics of the interference words, matching the interference words and the part-of-speech characteristics of the interference words with the initial part-of-speech characteristic sequence, and acquiring an interference word sequence corresponding to stock public opinion information containing the interference words and an interference part-of-speech characteristic sequence corresponding to the interference word sequence; and removing the interference word sequence in the initial word sequence to obtain the tagged word sequence, and removing the interference word characteristic sequence in the initial word characteristic sequence to obtain the word characteristic sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting key words in the tagged word sequence according to the part-of-speech characteristic sequence, and removing duplication of the tagged word sequence according to the key words to obtain a standard word sequence; vectorizing the word segmentation words in the standard word sequence to obtain a word vector sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: combining the part-of-speech characteristic sequences to obtain sentence level characteristics; summarizing the sentence level characteristics to obtain discourse level characteristics; and inputting the word vector sequence and the chapter level characteristics into a preset public opinion model to obtain a stock public opinion index.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing word segmentation on the stock public opinion information to obtain a first word sequence; performing sequence tagging on the first word sequence to obtain a tagged word sequence; and performing off-line processing on the tagged word sequence to obtain the initial word sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a public opinion information sample corresponding to each stock; performing word segmentation on the public opinion information sample to obtain an initial word sequence sample, wherein the initial word sequence sample comprises at least one word segmentation word; performing part-of-speech tagging on the initial word sequence sample to obtain a part-of-speech characteristic sequence sample and a tagged word sequence sample corresponding to the part-of-speech characteristic sequence sample; extracting a keyword sample in the tagged word sequence sample according to the part-of-speech feature sequence sample, and removing duplication of the tagged word sequence sample according to the keyword sample to obtain a standard word sequence sample; vectorizing word segmentation words in the standard word sequence sample to obtain a word vector sequence sample; and obtaining the public opinion model according to the word vector sequence sample and the part of speech characteristic sequence sample.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring stock public opinion information;

The computer readable storage medium obtains a corresponding part-of-speech characteristic sequence and a tagged word sequence by performing word segmentation and part-of-speech tagging on stock public opinion information. And further vectorizing the marked word sequence to obtain a word vector sequence, and obtaining the stock public opinion index according to the word vector sequence and the part-of-speech characteristic sequence. Because the importance of the stock public opinion information is considered, the stock public opinion information is quantized and scored, and the user is further helped to analyze the stock quotation according to the stock public opinion indexes, so that more reliable real-time reference data is provided for the user to assist the user in predicting the price of future stocks, the accuracy of user prediction is improved, meanwhile, the reference can be provided for professionals, and the time for analyzing related news public opinions is saved.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a historical public opinion index corresponding to the stock; and drawing a stock public opinion K line graph according to the stock public opinion index and the historical public opinion index.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring financial seed words, and performing near word expansion on the financial seed words to obtain financial keywords, wherein the financial seed words comprise words related to the stocks; classifying the stock public opinion information according to the financial seed words and the financial key words to obtain a stock category corresponding to each stock public opinion information; and segmenting the stock public opinion information according to the stock category to obtain the initial word sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing part-of-speech tagging on the initial word sequence to obtain an initial part-of-speech characteristic sequence; acquiring interference words and part-of-speech characteristics of the interference words, matching the interference words and the part-of-speech characteristics of the interference words with the initial part-of-speech characteristic sequence, and acquiring an interference word sequence corresponding to stock public opinion information containing the interference words and an interference part-of-speech characteristic sequence corresponding to the interference word sequence; and removing the interference word sequence in the initial word sequence to obtain the tagged word sequence, and removing the interference word characteristic sequence in the initial word characteristic sequence to obtain the word characteristic sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: extracting key words in the tagged word sequence according to the part-of-speech characteristic sequence, and removing duplication of the tagged word sequence according to the key words to obtain a standard word sequence; vectorizing the word segmentation words in the standard word sequence to obtain a word vector sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: combining the part-of-speech characteristic sequences to obtain sentence level characteristics; summarizing the sentence level characteristics to obtain discourse level characteristics; and inputting the word vector sequence and the chapter level characteristics into a preset public opinion model to obtain a stock public opinion index.

In one embodiment, the computer program when executed by the processor further performs the steps of: performing word segmentation on the stock public opinion information to obtain a first word sequence; performing sequence tagging on the first word sequence to obtain a tagged word sequence; and performing off-line processing on the tagged word sequence to obtain the initial word sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a public opinion information sample corresponding to each stock; performing word segmentation on the public opinion information sample to obtain an initial word sequence sample, wherein the initial word sequence sample comprises at least one word segmentation word; performing part-of-speech tagging on the initial word sequence sample to obtain a part-of-speech characteristic sequence sample and a tagged word sequence sample corresponding to the part-of-speech characteristic sequence sample; extracting a keyword sample in the tagged word sequence sample according to the part-of-speech feature sequence sample, and removing duplication of the tagged word sequence sample according to the keyword sample to obtain a standard word sequence sample; vectorizing word segmentation words in the standard word sequence sample to obtain a word vector sequence sample; and obtaining the public opinion model according to the word vector sequence sample and the part of speech characteristic sequence sample.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A public opinion index prediction method is characterized by comprising the following steps:

acquiring stock public opinion information;

inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index;

the method for inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index comprises the following steps:

setting odd number of public opinion models to vote on scoring results of the financial information articles to determine positive and negative public opinion results of the financial information articles;

processing data of information sources, praise numbers, browsing numbers or comment numbers of published financial information articles so as to measure the importance degree of the financial information articles; and sending the importance degree of the information article and the positive and negative public opinion results to a public opinion model to obtain a quantifiable stock public opinion index.

2. The public opinion index prediction method as set forth in claim 1, wherein the step of inputting the word vector sequence and the part-of-speech feature sequence into a preset public opinion model to obtain a stock public opinion index is followed by the steps of:

acquiring a historical public opinion index corresponding to the stock;

3. The public opinion index prediction method as set forth in claim 1, wherein the step of segmenting the stock public opinion information to obtain an initial word sequence comprises:

4. The method for predicting the public opinion index according to claim 1, wherein the step of performing part-of-speech tagging on the initial word sequence to obtain a tagged word sequence and a part-of-speech feature sequence corresponding to the tagged word sequence comprises:

5. The method for predicting a public opinion index according to claim 1, wherein the step of obtaining a word vector sequence according to the tagged word sequence and the part-of-speech feature sequence comprises:

6. The public opinion index prediction method as set forth in claim 1, wherein the step of inputting the word vector sequence and the part-of-speech feature sequence into a preset public opinion model to obtain a stock public opinion index comprises:

7. The public opinion index prediction method as set forth in claim 1, wherein the step of segmenting the stock public opinion information to obtain an initial word sequence comprises:

8. The method for predicting a public opinion index according to claim 1, wherein the step of inputting the word vector sequence and the chapter-level features into a preset public opinion model to obtain a stock public opinion index comprises:

acquiring a public opinion information sample corresponding to each stock;

9. A public opinion index prediction apparatus, comprising:

the acquisition module is used for acquiring stock public opinion information;

the scoring module is used for inputting the word vector sequence and the part of speech characteristic sequence into a preset public opinion model to obtain a stock public opinion index;

the scoring module is used for setting odd number of public opinion models to vote on scoring results of the financial information articles so as to determine positive and negative public opinion results of the financial information articles; processing data of information sources, praise numbers, browsing numbers or comment numbers of published financial information articles so as to measure the importance degree of the financial information articles; and sending the importance degree of the information article and the positive and negative public opinion results to a public opinion model to obtain a quantifiable stock public opinion index.

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.