CN115618875A

CN115618875A - Public opinion scoring method, system and storage medium based on named entity recognition

Info

Publication number: CN115618875A
Application number: CN202211327706.XA
Authority: CN
Inventors: 关雨欣
Original assignee: Dalian Branch China Construction Bank Co ltd
Current assignee: Dalian Branch China Construction Bank Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-01-17

Abstract

The invention provides a public opinion scoring method, a system and a storage medium based on named entity recognition, wherein the method comprises the following steps: acquiring text data in the financial field, and processing the acquired data to form a training set; constructing a named entity recognition model, and training the model; optimizing the model according to the evaluation index to obtain an optimal model, performing model prediction based on the obtained optimal model, and obtaining entities corresponding to each risk event from the text data in the financial field; and formulating a public opinion scoring rule, determining events related to public opinion scoring, determining event weight according to event severity, and finally settling entity comprehensive scores. The technical scheme of the invention can quantitatively judge the public sentiment of a specific organization in a limited field, solves the problems that the public sentiment scoring is difficult to be manually and comprehensively carried out in the financial risk assessment, the risk of a supervised organization is difficult to be quantified and the like, is favorable for analyzing the market trend and the market quotation and is favorable for making an operation decision.

Description

Public opinion scoring method, system and storage medium based on named entity recognition

Technical Field

The invention relates to the technical field of data analysis and natural language processing, in particular to a public opinion scoring method and system based on named entity recognition and a storage medium.

Background

Named entity identification is an important part of artificial intelligence natural language processing, specifically refers to extracting a required entity from an unstructured input text, and generally refers to entities with specific meaning or strong reference in the text, such as a person name, a place name, a loss event subject, an asset abnormal event subject and the like listed in the invention.

Public opinion data in the financial field mainly shows that related professional vocabularies in the financial field are more, and meanwhile, comments and news related to the financial field on the network are mostly long texts. The traditional named entity recognition mainly applies a convolutional neural network and the like, and the semantics can not be well understood by combining context information, particularly aiming at chapter structures.

Financial public opinion information on the network is numerous and distributed dispersedly, and it is difficult to obtain required public opinion data through manual analysis and arrangement. Meanwhile, only one-sided search cannot obtain comprehensive information, and the evaluated risk is difficult to quantify;

disclosure of Invention

According to the technical problems that the public opinion scoring is difficult to be comprehensively carried out manually and the risk of a regulatory organization is difficult to be quantified in financial risk assessment, the public opinion scoring method, the public opinion scoring system and the storage medium based on named entity recognition are provided. The invention can quantitatively judge the public sentiment of a specific mechanism in a limited field.

The technical means adopted by the invention are as follows:

a public opinion scoring method based on named entity recognition comprises the following steps:

acquiring text data in the financial field, and processing the acquired data to form a training set;

constructing a named entity recognition model, and training the model;

optimizing the model according to the evaluation index to obtain an optimal model, performing model prediction based on the obtained optimal model, and obtaining entities corresponding to each risk event from the text data in the financial field;

and (4) making a public opinion scoring rule, determining events related to public opinion scoring, determining event weight according to event severity, and finally settling entity comprehensive scores.

Further, the acquiring financial field text data and processing the acquired data to form a training set includes:

acquiring relevant texts in the financial field from various channels through a webpage crawler technology, wherein the texts comprise news, comments, forums and bulletins relevant to the financial field;

and removing HTML labels from the crawled financial field data, uniformly converting the crawled financial field data into UTF-8 coded text analysis, cleaning the data to remove punctuations, and manually labeling the processed data to be used as a training set.

Further, the establishing a named entity recognition model, training the model, optimizing the model according to the evaluation index, and obtaining an optimal model includes:

inputting the training set into a named entity recognition model, and pre-training text data in the financial field by BERT to obtain a word granularity vector matrix;

inputting the word granularity vector matrix into a BilSTM + GCN layer for feature extraction;

and inputting the extracted features into a globalpointer for decoding to obtain an optimal sequence.

Further, the inputting the training set into the named entity recognition model, and the BERT pre-training the text data in the financial field to obtain a word granularity vector matrix includes:

the pre-training model is finely adjusted in a full word mask mode, chinese is expressed by words, the named entity recognition model is segmented by taking characters as granularity, and the whole word is masked, so that the model recovers the whole word in a pre-training task of a mask language model, and the meaning of the word is more comprehensively retained.

Further, the inputting the word granularity vector matrix into the BiLSTM + GCN layer for feature extraction to obtain the category feature of each character includes:

bidirectional coding is carried out on the input text by utilizing a BilSTM layer;

the GCN layer is used to add long text context related semantic information prior to decoding.

Further, the inputting the extracted features into a globalpointer for decoding to obtain an optimal sequence includes:

and inputting the semantic vector containing the context information into a globalpointer for decoding, and outputting a label sequence with the maximum probability so as to obtain the category of each character.

Further, the model is adjusted and optimized according to the evaluation indexes to obtain an optimal model, model prediction is carried out based on the obtained optimal model, and entities corresponding to the risk events are obtained from the text data in the financial field, wherein the evaluation indexes comprise accuracy and recall rate.

Further, the making of the public opinion scoring rule, the determination of events related to the public opinion scoring, the determination of event weight according to event severity and the final settlement of entity comprehensive score comprise:

determining events related to public opinion scoring, and determining event weight scores according to the severity of the events;

and calculating the entity comprehensive score, wherein the calculation formula is as follows:

Y＝b ₁ x ₁ +b ₂ x ₂ +…+b _i x _i

wherein, b _i Representing the number of occurrences of an event, i representing a defined risk event, x _i Represents the score for each event and Y represents the entity score.

The invention also provides a public opinion scoring system based on named entity recognition based on the public opinion scoring method, which comprises the following steps:

the data processing module is used for acquiring text data in the financial field and processing the acquired data to form a training set;

the named entity recognition model construction training module is used for constructing a named entity recognition model and training the model;

the named entity recognition model tuning module is used for tuning the model according to the evaluation index, obtaining an optimal model, conducting model prediction based on the obtained optimal model, and obtaining entities corresponding to each risk event from the text data in the financial field;

and the public opinion scoring module is used for making a public opinion scoring rule, determining public opinion scoring events, determining event weight according to event severity and finally settling entity comprehensive scores.

A storage medium, the storage medium comprising a stored program, wherein the program is executed to perform the above public opinion scoring method based on named entity recognition.

Compared with the prior art, the invention has the following advantages:

1. the public opinion scoring method based on named entity recognition provided by the invention can quantitatively judge the public opinions of a specific organization in a limited field, and solves the problems that the public opinion scoring is difficult to be carried out manually and comprehensively in financial risk assessment, the risk of a supervised organization is difficult to be quantified and the like.

2. The public opinion scoring method based on named entity recognition provided by the invention can be used for scoring the financial market quotations, is favorable for analyzing the market trend and the market quotations, and is favorable for making operation decisions.

For the above reasons, the present invention can be widely applied to the fields of data analysis, natural language processing, and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of a named entity recognition model according to the present invention.

Fig. 3 is a schematic diagram of an optimal sequence provided in the embodiment of the present invention.

FIG. 4 is a block diagram of the system of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description. Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. Any specific values in all examples shown and discussed herein are to be construed as exemplary only and not as limiting. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the description of the present invention, it is to be understood that the directions or positional relationships indicated by the directional terms such as "front, rear, upper, lower, left, right", "lateral, vertical, horizontal" and "top, bottom", etc., are generally based on the directions or positional relationships shown in the drawings for the convenience of description and simplicity of description, and that these directional terms, unless otherwise specified, do not indicate and imply that the device or element so referred to must have a particular orientation or be constructed and operated in a particular orientation, and therefore should not be considered as limiting the scope of the invention: the terms "inner and outer" refer to the inner and outer relative to the profile of the respective component itself.

For ease of description, spatially relative terms such as "over 8230 \ 8230;,"' over 8230;, \8230; upper surface "," above ", etc. may be used herein to describe the spatial relationship of one device or feature to another device or feature as shown in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary terms "at 8230; \8230; above" may include both orientations "at 8230; \8230; above" and "at 8230; \8230; below". The device may be otherwise variously oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

It should be noted that the terms "first", "second", and the like are used to define the components, and are only used for convenience of distinguishing the corresponding components, and unless otherwise stated, the terms have no special meaning, and therefore, the scope of the present invention should not be construed as being limited.

As shown in fig. 1, the invention provides a public opinion scoring method based on named entity recognition, comprising:

s1, acquiring text data in the financial field, and processing the acquired data to form a training set;

s2, constructing a named entity recognition model, and training the model;

s3, optimizing the model according to the evaluation index to obtain an optimal model, performing model prediction based on the obtained optimal model, and obtaining entities corresponding to each risk event from the text data in the financial field;

and S4, making a public opinion scoring rule, determining public opinion scoring events, determining event weight according to event severity, and finally settling entity comprehensive scores.

In specific implementation, as a preferred embodiment of the present invention, in step S1, acquiring text data in the financial field, and processing the acquired text data to form a training set, the method includes:

and performing HTML label removal operation on the crawled financial field data, uniformly converting the crawled financial field data into UTF-8 coded text analysis, cleaning the data to remove punctuation marks, and manually labeling the processed data to be used as a training set. Sentence and word segmentation is performed so that they can be input into the following model.

In specific implementation, as a preferred embodiment of the present invention, in step S2, constructing a named entity recognition model as shown in fig. 2, training the model, and tuning the model according to the evaluation index to obtain an optimal model, the method includes:

s21, inputting the training set into a named entity recognition model, and pre-training text data in the financial field by BERT to obtain a word granularity vector matrix;

in this embodiment, the first layer pretrains the processed data, the invention pretrains financial text data by using BERTs (Bidirectional encoder expressions from transformations) to obtain a word granularity vector matrix, and the model can obtain Bidirectional semantic information, thereby effectively solving the problem of polysemy of a word in a text and achieving a good effect on NLP tasks. The pre-training of the BERT model includes two tasks, a Masked Language Model (MLM) and a next content prediction (NSP). The fine adjustment is carried out in the form of full word mask, chinese is expressed by words, BERT is segmented by taking words as granularity, mask is carried out on the whole words, compared with mask is carried out on single words, the mask method explicitly forces the model to recover the whole words in a pre-training task of a Mask Language Model (MLM), and the meaning of the words can be more comprehensively retained.

S22, inputting the word granularity vector matrix into a BilSTM + GCN layer for feature extraction; specifically, a BilSTM layer is utilized to carry out bidirectional coding on an input text; the GCN layer is used to add context-dependent semantic information before decoding.

In this embodiment, the second layer performs feature extraction for BiLSTM + GCN. Because a large part of captured text in the financial field appears in the form of chapters, bilSTM has a good effect in short text feature extraction, but for the text with extremely long chapters, a word at a certain position cannot be associated with a word in a remote context, and the association is often significant in named entity recognition. Therefore, the graph neural network GCN is introduced to train and extract features of the text, and the relation between words in the long text can be solved to a certain degree.

The BilSTM bidirectional long-and-short-term neural network has better performance in training data compared with the traditional Convolutional Neural Network (CNN) and the traditional Recurrent Neural Network (RNN). The BilSTM is formed by combining a forward LSTM and a backward LSTM, the LSTM can well solve the short-term memory problem of RNN, and the BilSTM can well solve the defect that the LSTM can only transmit information from front to back, so that the BilSTM can extract context information, the context information is stored, and the context relevance of feature vectors is enhanced.

The GCN graph convolution neural network can expand the features extracted by the BilSTM layer, is mainly used for the purpose that the BilSTM feature extraction effect is poor when a text is long, can input the features output by the BilSTM layer into the GCN layer at the moment, and refines the relation among all entities in the text.

And S23, inputting the extracted features into a globalpointer for decoding to obtain an optimal sequence. Specifically, the semantic vector including the context information is input into a globalpointer for decoding, and a tag sequence with the maximum probability is output, so that the category of each character is obtained.

In this embodiment, assuming that the text sequence to be recognized has a length n, assuming that only one entity is to be recognized, and assuming that each entity to be recognized is a continuous segment of the sequence, the length is not limited, and the entities to be recognized can be nested with each other (there is an intersection between two entities), the sequence will generate n (n + 1)/2 candidate entities, that is, the sequence with the length n has n (n + 1)/2 different continuous subsequences, which contain all possible entities, and in this embodiment, it is necessary to pick out the true entity from the n (n + 1)/2 candidate entities, which is the multi-tag classification problem of n (n + 1)/2 k. Similarly, if there are M entity types to be identified, then it becomes M (N + 1)/2 k-by-N multi-label classification problem, because multi-head identification nesting can be done at globalpointer, which will construct M × N matrix according to sentence length N and risk type number M. As shown in fig. 3.

In this embodiment, the third layer performs decoding for the globalpointer to solve the optimal sequence. The globalpointer is a decoding method based on span classification, and the model uses softmax + cross entropy as a loss function, and judges the beginning and the end as a whole, so that the network has more global view. Compared with the decoding of a Conditional Random Field (CRF), the model does not need to recursively calculate denominators like the CRF, and the test does not need dynamic programming and is completely parallel. Thus, globalpointer is more powerful and faster. And inputting the characteristics obtained by the GCN layer into the globalpointer for decoding. The main contribution of globalpointers is that entities can be more precisely matched to the risk types to which they belong.

In specific implementation, as a preferred embodiment of the present invention, in step S3, the model is optimized according to the evaluation index, an optimal model is obtained, model prediction is performed based on the obtained optimal model, and entities corresponding to each risk event are obtained from the text data in the financial field.

In a specific implementation manner, as a preferred embodiment of the present invention, the step S4 of formulating a public opinion scoring rule, determining events related to public opinion scoring, determining event weights according to event severity, and finally settling a comprehensive score of the entity includes:

as an implementation manner, in this embodiment, the extraction entity determines an event weight score for a loss event subject, the extraction entity determines an event weight score for a grading deterioration event subject, the extraction entity determines an event weight score for a property abnormal event subject, and the extraction entity determines an event weight score for a bankruptcy event subject; such as: and extracting an entity, recording 1 score for a loss event main body, recording 2 scores for a grading deterioration event main body, recording 3 scores for an asset abnormal event main body, and recording 4 scores for a bankruptcy event main body.

Y＝b ₁ x ₁ +b ₂ x ₂ +…+b _i x _i

As shown in fig. 4, in response to the public opinion scoring method based on named entity recognition in the present application, an embodiment of the present invention further provides a public opinion scoring system based on named entity recognition, including:

and the public opinion scoring module is used for making a public opinion scoring rule, determining events related to public opinion scoring, determining event weight according to event severity and finally settling entity comprehensive scores.

For the embodiments of the present invention, the description is simple because it corresponds to the above embodiments, and for the related similarities, please refer to the description in the above embodiments, and the detailed description is omitted here.

The embodiment of the application also discloses a computer-readable storage medium, wherein a computer instruction set is stored in the computer-readable storage medium, and when the computer instruction set is executed by a processor, the public opinion scoring method based on named entity recognition provided by any one of the above embodiments is implemented.

Example 1

When a bank carries out purchasing and cooperative company risk assessment, whether the institution has financial public opinion risk or not needs to be searched by the internet to serve as an important index of company scoring, if companies A, B and C need to be subjected to risk assessment, all information about the three companies in a certain period of time needs to be searched on the internet by searching keywords. And cleaning the data and then importing the data into the model to obtain a model result. Such as: company a involved 2 losses (weight 1), 1 asset anomaly (weight 2), 1 asset deduction (weight 1). Company a scores 5.

Example 2

When analyzing market trends or quotations in the financial industry, the emotional trends of a certain product and a certain market need to be known. In view of the above requirements, the public opinion scoring method based on named entity recognition provided in the above embodiments of the present invention can be used to quantify market trends and assist companies in making investments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A public opinion scoring method based on named entity recognition is characterized by comprising the following steps:

acquiring financial field text data, and processing the acquired data to form a training set;

constructing a named entity recognition model, and training the model;

2. The named entity recognition-based public opinion scoring method according to claim 1, wherein the obtaining financial domain text data and processing the obtained data to form a training set comprises:

acquiring relevant texts of the financial field from various channels through a webpage crawler technology, wherein the texts comprise news, comments, forums and announcements related to the financial field;

3. The public opinion scoring method based on named entity recognition according to claim 1, wherein the constructing a named entity recognition model, training the model, optimizing the model according to an evaluation index to obtain an optimal model comprises:

4. The public opinion scoring method based on named entity recognition as claimed in claim 3, wherein the inputting of the training set to the named entity recognition model and the pretraining of the financial domain text data by BERT to obtain the word granularity vector matrix comprises:

5. The public opinion scoring method based on named entity recognition as claimed in claim 3, wherein the inputting the word granularity vector matrix into a BilSTM + GCN layer for feature extraction to obtain the category feature of each character comprises:

6. The public opinion scoring method based on named entity recognition according to claim 3, wherein the inputting the extracted features into a globalpointer for decoding to obtain an optimal sequence comprises:

7. The public opinion scoring method based on named entity recognition according to claim 1, wherein the model is optimized according to evaluation indexes, an optimal model is obtained, model prediction is performed based on the obtained optimal model, and entities corresponding to each risk event are obtained from text data in a financial field, wherein the evaluation indexes include accuracy and recall rate.

8. A public opinion scoring method based on named entity recognition according to claim 1, wherein the public opinion scoring rules are formulated, events related to public opinion scoring are determined, event weights are determined according to event severity, and a final settlement entity comprehensive score is determined, and the method comprises the following steps:

determining public opinion scoring events, and determining event weight scores according to event severity;

Y＝b ₁ x ₁ +b ₂ x ₂ +…+b _i x _i

9. A public opinion scoring system based on named entity recognition and based on the public opinion scoring method as claimed in any one of claims 1 to 8, comprising:

the named entity recognition model tuning module is used for tuning the model according to the evaluation index to obtain an optimal model, performing model prediction based on the obtained optimal model, and obtaining entities corresponding to each risk event from the text data in the financial field;

and the public opinion scoring module is used for formulating a public opinion scoring rule, determining public opinion scoring events, determining event weight according to event severity and finally settling entity comprehensive scores.

10. A storage medium comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 8.