CN113505207B

CN113505207B - Machine reading understanding method and system for financial public opinion research report

Info

Publication number: CN113505207B
Application number: CN202110748656.1A
Authority: CN
Inventors: 成昊; 龚慧敏; 敖翔
Original assignee: Zhongke Suzhou Intelligent Computing Technology Research Institute
Current assignee: Zhongke Suzhou Intelligent Computing Technology Research Institute
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2024-02-20
Anticipated expiration: 2041-07-02
Also published as: CN113505207A

Abstract

The invention discloses a machine reading understanding method and a machine reading understanding system for financial public opinion research report, wherein the method mainly comprises the steps of data making and collecting, training data labeling, deep learning model construction and answer organization, specifically predefining a problem set of a user according to the requirements of the financial vertical field, and collecting public opinion data associated with the problem set; finding out data which is matched with the questions in the predefined question set from the public opinion data through keyword matching, screening sentences containing answers to the questions from the data by using a supervised model, and marking the data; acquiring a vector representation of a word by utilizing a BERT model pre-trained in the financial field, and then interacting data and problems through an attention mechanism in a natural language processing method to obtain a fusion vector representation which can be understood by a computer; and logically combining more than two answers fed back by the deep learning model. According to the technical scheme, the supervised model of the marked data is utilized, so that the accuracy rate and the processing efficiency of machine reading and understanding are improved.

Description

Machine reading understanding method and system for financial public opinion research report

Technical Field

The invention relates to a technology for solving article semantics and answering related questions by a computing mechanism, in particular to a machine reading understanding method and a machine reading understanding system in the financial field based on a supervised and deep learning algorithm.

Background

Machine-reading understanding (Machine Reading Comprehension, MRC) is a technique that uses algorithms to make a computer solve article semantics and answer related questions. Since both articles and questions are in the form of human language, machine reading understanding is within the category of Natural Language Processing (NLP), and is one of the most recent subjects. In recent years, with the development of machine learning, particularly deep learning, machine reading understanding research has been a great advance and in practical application the corner of the head is new.

Prior to 2016, more statistical learning methods were used, which involved a large number of feature engineering, which was time-consuming and labor-consuming. After 2016, after the release of the squiad dataset, some attention-mechanism based matching models, such as BiDAF, LSTM, etc., have emerged. This has followed by a variety of models of relatively complex network structures through which related work has attempted to capture matching relationships between questions and chapters. After 2018, with the appearance of various pre-trained language models, the reading and understanding of the model effect is further and greatly improved, because the capability of a representation layer becomes very strong, and the task related network structure starts to become simple.

In machine-reading understanding technology applications, there are four common tasks, which are separated as follows:

1. and (3) performing shape filling: given article C, one of the words or entities a (a ε C) is hidden as a problem to be filled, and the complete filling task requires filling with the correct word or entity a by maximizing the conditional probability P (a|C- { a }).

2. Multiple choices: given an article C, a question Q, and a series of candidate answer sets, a multiple choice task picks the correct answer question Q from the candidate answer set A by maximizing conditional probabilities.

3. Fragment extraction: given an article C (containing n words) and a question Q, the segment extraction task extracts consecutive sub-sequences from the article as correct answers to the question by maximizing the conditional probability P (a|c, Q).

4. Free answer: given an article C and a question Q, the correct answer a to the free answer may sometimes not be a subsequence of the article C, i.e. a ⊆ C or a b. The free-answer task predicts the correct answer a to answer question Q by maximizing the conditional probability P (a|c, Q).

The free question and answer is the most difficult of the four tasks and is the task of greatest interest and concern in the industry. The answer form of the free-answer task is very flexible, the understanding of natural language can be well tested, the method is closest to the practical application, but the data set of the task is relatively difficult to construct, and the effect of an effective evaluation model is required to be studied more deeply.

As shown in fig. 1, a typical machine-readable understanding system generally includes four modules of embedded coding, feature extraction, article-to-question interaction, and answer prediction, as follows:

and (3) embedded coding: this module converts articles and questions in the form of input natural language into vectors of fixed dimensions for subsequent processing by the machine. The earlier commonly used methods are traditional word representation methods, such as single-hot representation and distributed word vectors, and context-based word representation methods pre-trained from large-scale corpora in recent two years have also been widely used, such as ELMo, GPT, bert. Meanwhile, in order to better represent information such as semantic syntax, the word vector can be combined with language features such as part-of-speech tags, named entities, question types and the like to represent the word vector in a finer granularity.

Feature extraction: word vector representations of articles and questions encoded via the embedded encoding layer are then passed to a feature extraction module to extract more context information. Common neural network models in this module are Recurrent Neural Networks (RNNs), convolutional Neural Networks (CNNs), and Transformer structures based on a multi-headed self-attention mechanism.

Article-question interaction: the machine can use the interaction information between the article and the question to infer which parts of the article are more important for answering the question, and to achieve this goal, the article-question interaction module uses a unidirectional or bidirectional attention mechanism to emphasize parts of the original that are more relevant to the question. Meanwhile, in order to further mine the relationship between the article and the problem, the interaction process between the article and the problem may be performed for a plurality of times, so as to simulate the repeated reading behavior of the human being when the human being reads and understands.

Answer prediction: this module performs a final answer prediction based on the information accumulated by the three modules. The implementation of this module is highly task dependent since common machine-reading understanding tasks can be categorized by answer type.

However, the accuracy of the existing machine reading and understanding model cannot meet the relatively complex requirements of the industrial financial field, the response speed cannot meet the real-time question and answer requirements, and questions which cannot be answered cannot be identified, so that answers given under specific conditions are inconsistent with the questions or far from the questions, and the reference meaning is lacking.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention aims to provide a machine reading and understanding method and a system thereof for financial public opinion research report, which solve the problems of insufficient accuracy, practicability and low efficiency of machine reading and understanding in the financial field.

The technical solution for achieving the above purpose is as follows: the machine reading and understanding method for the financial public opinion research newspaper is characterized by comprising the following steps:

the method comprises the steps of formulating and collecting data, predefining a problem set of a user according to the requirements of the financial vertical field, and collecting public opinion data associated with the problem set;

training data annotation, namely finding out data which is matched with a problem in a predefined problem set from public opinion data through keyword matching, screening sentences containing answers to the problems from the data by using a supervised model, and performing data annotation;

deep learning model construction, namely acquiring vector representation of characters by utilizing a BERT model pre-trained in the financial field, and then interacting data and problems by an attention mechanism in a natural language processing method to obtain fusion vector representation which can be understood by a computer;

and (3) answer organization, namely logically combining more than two answers fed back by the deep learning model.

Another technical solution for achieving the above object of the present invention is: the machine reading understanding system of finance public opinion research report, its characterized in that includes:

the data formulating and collecting unit is used for predefining a problem set of a user corresponding to the requirements of the financial vertical field and collecting public opinion data associated with the problem set;

the training data labeling unit is used for finding out data which is matched with the problems in the predefined problem set from the public opinion data through keyword matching, screening sentences containing answers to the problems from the data by using the supervised model, and labeling the data;

the deep learning model construction unit is used for acquiring vector representation of characters by utilizing a BERT model pre-trained in the financial field, and then interacting data and problems through an attention mechanism in a natural language processing method to obtain fusion vector representation which can be understood by a computer;

and the answer organization unit is used for logically combining more than two answers fed back by the deep learning model.

The application of the novel technical solution of the target detection of the invention has obvious progress: the method and the system utilize a high-quality supervised model of the marked data, so that the accuracy of machine reading and understanding is improved; for thousands of words of input data, the processing speed is shortened to 500 ms/time, the emphasis is placed on judging whether content points which can be used for answering questions exist in collected data, and the effect of expert rule type question answering can be achieved by using lower cost.

Drawings

Fig. 1 is a schematic diagram of a typical machine reading understanding system topology.

Fig. 2 is a schematic diagram of the main steps of the machine-readable understanding method of the present invention.

Fig. 3 is a detailed flow diagram of a machine-readable understanding method of the present invention.

Detailed Description

The following detailed description of the invention is given with reference to the accompanying drawings, so that the technical scheme of the invention is easier to understand and grasp, and the protection scope of the invention is more clearly defined.

Aiming at the state of the art of current machine reading and understanding and the defect that the current machine reading and understanding state of the art cannot meet the related demands of the financial field, the invention innovatively provides a machine reading and understanding method and a system of the financial field based on a supervised deep learning algorithm, so as to solve the problems of insufficient accuracy, practicability and low efficiency of machine reading and understanding of the financial field

The machine reading and understanding method in the financial field is shown in fig. 2, and mainly comprises four main steps of data making and collecting, training data labeling, deep learning model building and answer organization. And the detailed flow implementation structure is shown in fig. 3.

In summary, each step is understood, wherein the data assignment and collection refers to the requirement of the financial vertical field, the questions possibly asked by the user are predefined, and two parts of key questions and common questions are screened out by setting a screening threshold related to the questioning quantity, and meanwhile, public opinion data such as news, research reports and the like related to the questions are searched by the web crawler.

The training data annotation refers to finding out data which is close to a predefined key problem from the collected public opinion data through keyword matching, and delivering the data to be manually annotated.

Where deep learning model construction refers to constructing a model that is appropriate and solves the above-described problems for the training data that is ready. Conventional machine learning models do not process such document data well, requiring deep learning models of large scale parameters and structures to process. The method comprises the steps of firstly, obtaining a vector representation of characters by utilizing a BERT (Bidirectional Encoder Representations from Transformers) model obtained by pre-training in the financial field, wherein the model is characterized by good character processing effect aiming at the financial field, small model and high efficiency; and secondly, interacting the data and the key problems through an Attention mechanism (Attention) in a natural language processing technology to obtain a fusion vector representation which can be understood by a computer.

And screening sentences containing answers of all key questions in the data by utilizing the stability of the deep learning model (with supervision function). It should be noted that, when there is no answer about the key question in a piece of data, the corresponding article is labeled as a zero answer set "noananswer", that is, no-label data, which is the key point for identifying the question which cannot be answered. Because this step has a great influence on the deep learning model, the labeling result of the data also needs to be manually screened so as not to generate errors.

The answer organization refers to the process of returning an answer aiming at a built public opinion database and a trained deep learning model, and the task of the model is reading and understanding, namely, inputting a (data and questions) form of input. This form does not follow the intuition of human comments or summaries, and requires the formulation of an answer organization strategy that logically combines multiple answers. The more specific answer organization flow is: i, selecting one of more than one keyword text similarity matching algorithm for recalling the first ten pieces of data of any problem; II, inquiring all sub-questions or keywords of the corresponding questions one by one through the constructed deep learning model for the first ten pieces of data, and obtaining optimal answers of all the sub-questions corresponding to each piece of data; III, optimally sequencing answers of the sub-questions, and comparing the answers with the sequencing of recall data; and IV, taking the splicing result of the first two non-empty answers of one of the sub-questions as a component part of the corresponding sub-question in the final answer. The answers obtained after the logical organization are more suitable for the reading look and feel of human beings.

The keyword text similarity matching algorithm has the possibility of diversity selection and is based on the problem word vector consulted by the userArticle word vector set contained in public opinion data +.>Where d represents the number of articles recalled and k represents the word vector dimension.

The optional keyword text similarity matching algorithm comprises the following steps: 1. calculating the Euclidean distance:

；

2. calculating a cosine distance:

；

3. calculating Jacquard similarity coefficients:

wherein Q represents the original text of the question and P represents the original text of the article;

4. pearson correlation coefficient:

。

corresponding to the machine-readable understanding method described above, the system implementation is by computer programming. The system architecture main body formed by specific programming comprises the following four parts: the data formulating and collecting unit is used for predefining a problem set of a user corresponding to the requirements of the financial vertical field and collecting public opinion data associated with the problem set; through the manual input interface of the computer, the user inputs the questions related to the financial field into the background database and stores the questions in a formatted mode, and a screening threshold value can be set for screening key questions and common questions from the predefined question set. The method comprises the steps of carrying out a first treatment on the surface of the And accessing Internet cloud data through a network input interface, collecting various information and research reports related to the problem set, and storing the information and the research reports in a separate database in the form of piece-by-piece data (different in length).

The training data labeling unit is used for finding out data which is close to important questions in the predefined question set from the public opinion data through keyword matching, screening sentences containing answers to the questions from the data by using the supervised model, and labeling the data. The massive data processed by the unit are labeled and classified, and support of higher granularity is provided for the machine learning process of the subsequent deep learning model.

The deep learning model building unit is used for realizing the following description of data and problem interaction:

the former part of the unit module obtains text vector representation through a BERT model pre-trained in the financial field, and comprises the following inputs: problem of user consultationThe method comprises the steps of carrying out a first treatment on the surface of the Related articles->Wherein->Is a collection of articles->The method comprises the steps of carrying out a first treatment on the surface of the And (3) outputting: question word vector representation ++>The method comprises the steps of carrying out a first treatment on the surface of the Article word vector representation ++>Wherein->Is a set of article word vectors,/">。

The process comprises the following steps: initializing the identifiers [ CLS ], [ SEP ], and executing the following program flow:

。

the latter part of the module of the unit interacts data and problems through the attention mechanism in the natural language processing method, and comprises the following steps of input: hidden layer output of BERTThe method comprises the steps of carrying out a first treatment on the surface of the And (3) outputting: in the article aboutAnswer start and end positions of the question>。

The process comprises the following steps: the output Q, P of the previous module is obtained and executed as follows:

。

the answer organization unit is used for logically combining more than two answers fed back by the deep learning model, and detailed logical organization process is omitted. And the result of the answer organization is presented through an interface of the computer for external output.

From a more intuitive, imaged example: the computer system applying the machine reading understanding method of the financial public opinion research report inputs a problem of 'big disk rising and falling conditions' in a problem input program. The public opinion data which can be collected through internet access is large in scale, ten pieces of most relevant data in the database are recalled through keyword matching algorithms such as 'large disc', 'trend', 'rise and fall', the ten pieces of data are respectively combined with the questions, and the ten pieces of data are used as data input for machine reading understanding of the built deep learning model, so that answers of each piece of data are obtained. Finally, the answer organization interface is utilized to combine the answer processing, so as to obtain the final answer suitable for human reading.

Similarly, the problems of 'financial network security', 'scientific board stock trend', and the like are all applicable to the machine reading understanding method operation implementation described in the previous section.

In summary, the present invention provides a machine-readable understanding method and system for financial public opinion research report with outstanding substantive features and significant improvements, as well as detailed description of the embodiments. The method and the system utilize a high-quality supervised model of the marked data, so that the accuracy of machine reading and understanding is improved; for thousands of words of input data, the processing speed is shortened to 500 ms/time, the emphasis is placed on judging whether content points which can be used for answering questions exist in collected data, and the effect of expert rule type question answering can be achieved by using lower cost.

In addition to the above embodiments, other embodiments of the present invention are possible, and all technical solutions formed by equivalent substitution or equivalent transformation are within the scope of the present invention as claimed.

Claims

1. A machine reading understanding method for financial public opinion research newspaper is characterized by comprising the following steps:

answer organization, which logically combines more than two answers fed back by the deep learning model, and the flow comprises:

i, selecting one of more than one keyword text similarity matching algorithm for recalling the first ten pieces of data of any problem;

II, inquiring all sub-questions or keywords of the corresponding questions one by one through the constructed deep learning model for the first ten pieces of data, and obtaining optimal answers of all the sub-questions corresponding to each piece of data;

III, optimally sequencing answers of the sub-questions, and comparing the answers with the sequencing of recall data;

and IV, taking the splicing result of the first two non-empty answers of one of the sub-questions as a component part of the corresponding sub-question in the final answer.

2. The machine-readable understanding method of financial public opinion research newspaper according to claim 1 wherein: and setting a screening threshold in the data formulation and collection, and screening key problems and common problems from the predefined problem set.

3. The machine-readable understanding method of financial public opinion research newspaper according to claim 1 wherein: in the training data labeling, the part of data which is not found to be relevant to the problems in the predefined problem set is labeled as a zero answer set.

4. A machine-readable understanding method of financial public opinion research newspaper according to claim 1 or 3, characterized in that: in the training data annotation, manual screening is further included on the annotated data.

5. A machine reading understanding system of financial public opinion research newspaper is characterized by comprising:

the answer organization unit is used for logically combining more than two answers fed back by the deep learning model, and the process comprises the following steps:

6. The machine-readable understanding system for financial public opinion research newspaper of claim 5 wherein: and a screening threshold value is arranged in the data formulating and collecting unit and is used for screening key questions and common questions for the predefined question set.

7. The machine-readable understanding system for financial public opinion research newspaper of claim 5 wherein: the training data labeling unit further comprises a labeling module for labeling the zero answer set of the partial data which is not found to be relevant to the problems in the predefined problem set.