CN115392251A

CN115392251A - Real-time entity identification method for Internet financial service

Info

Publication number: CN115392251A
Application number: CN202211065582.2A
Authority: CN
Inventors: 陈平华; 匡翊政
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-11-25

Abstract

The invention discloses a real-time identification method for an entity of internet financial service, which comprises the following steps: step 1): performing data preprocessing on input financial text data X, and labeling a data set by using a BIO labeling system; step 2): segmenting a training set by using five-fold cutting, performing entity recognition on the processed text by using an ALBERT-CRF model to obtain an entity set, and then performing post-processing on the data by using frequent pattern mining to obtain an entity set corresponding to the financial text; step 3): and (3) constructing a financial field knowledge graph through the obtained entities and the obtained relations, then integrating the steps, calculating an evaluation score through Micro-Averaging, and finally obtaining an optimal entity set corresponding to the financial text. The method emphasizes that the entity in the financial text can be identified in real time for real-time financial text data in the Internet, and improves the real-time property of financial entity identification, thereby better providing information support for relevant institutions and individuals in the financial field.

Description

Real-time entity identification method for Internet financial service

Technical Field

The invention relates to the field of entity identification in a specific scene, in particular to a real-time entity identification method for internet financial services.

Background

With the rapid progress of the internet and the rapid development of the world financial industry, the internet financial entities have been explosively increased. The method is difficult to identify the required internet financial entity information in real time and accurately in the face of the updated internet financial information at all times. Therefore, real-time identification of the internet financial entity is an urgent social need, and the method for real-time identification of the entity under the internet financial service scene has important practical significance and use value.

By associating the text with the entity information of the financial business through named entity identification, better financial intelligent service can be provided for users. Compared with the named entity recognition of Chinese in the general field, the financial field is a field with high speciality, and the named entity recognition of the financial field also comprises financial entities belonging to the speciality field, such as financial company names, project names, product names and other entity names with strong speciality besides the recognition of the names of people and places. The named entity identification in the current financial field has the following three problems, one is that the text data volume is large, the noise is high and the updating is fast; secondly, a financial field data set with abundant entities and quality is lacked for experimental research; thirdly, a large number of entities with complex structures exist in the financial field, for example, the inner layers of the entities are nested more, and the boundaries are not easy to identify.

Named entity recognition was first proposed by the sixth semantic Understanding Conference (Message Understanding Conference) and is a fundamental task in natural language processing. Named entities generally refer to entities that are identified as having a particular meaning or strong reference from a large amount of text to be processed, and generally include names of people, places, organizations, proper nouns, dates and times, and the like. Named entity recognition tasks have been in a wide variety of vertical areas such as finance, e-commerce, social media, and the like. The named entity identification technology is to extract the entities from the formal text, and can identify more types of entities according to business requirements, such as project names, project funds and the like. Therefore, the concept of entity can be very wide, and any special text segment required by the service can be called as an entity. The named entity recognition technology lays a foundation for various natural language processing technologies such as information extraction, information retrieval, knowledge maps, text abstractions, machine translation, question-answering systems and the like.

Disclosure of Invention

Aiming at the problems of low identification speed and poor identification accuracy in entity identification in the existing financial field, the invention provides the entity real-time identification method for the internet financial business, which improves the real-time property of the financial entity identification and helps financial practitioners to acquire information more quickly and efficiently, so that the industry dynamics can be grasped in advance and the industry development trend can be tracked. Which comprises the following steps:

step 1, in a data preprocessing module, carrying out format judgment on input financial text data X, carrying out data preprocessing including data cleaning and data division if the format is incorrect, then defining a plurality of entity type labels, and labeling a data set by using a BIO labeling system;

and 2, in the entity set extraction module, segmenting the training set by using five-fold cutting to ensure the generalization of the model, performing real-time entity recognition on the text by using an ALBERT-CRF model to obtain an entity set, performing post-processing on the entity set obtained in the previous step, mining entities which may be missed by adopting a frequent mode, and filtering out entities which are mistakenly recognized, thereby obtaining the optimal entity corresponding to the financial text of the current training turn.

And 3, in the real-time processing module, constructing a knowledge map of the financial field through the entities and the relations obtained in the previous step, performing three rounds of fine-tuning on the data set by using an ALBERT-CRF model, and finally introducing two parameter reduction technologies to improve the real-time property of entity identification.

Further, in step 1, a specific method of the data preprocessing module includes:

step 1.1, aiming at the problems of noise, error labels and the like frequently occurring in financial texts, the method positions the noise and error label data by using a regular expression;

step 1.2, finding out all non-Chinese, non-English and non-digital symbols in the data set, such as some HTML (hypertext markup language) tags, special symbols, nonsense characters and the like, filtering and removing by using a regular expression to realize data cleaning, positioning error tags appearing in a text and cleaning the data aiming at an Internet financial text;

step 1.3, defining a plurality of entity type labels, such as a 'FIN' financial entity, a 'LOC' place name entity, an 'ORG' institution entity, a 'PER' person name entity and an 'O' non-named entity;

step 1.4, a BIO labeling system is adopted to subdivide the labels into 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-FIN', 'I-FIN', 'O';

step 1.5, directly adding a period number behind the text with the sentence length exceeding 510 or the text without the ending punctuation, then dividing the long text into a plurality of independent short texts by the priorities of comma, period number, exclamation mark and question mark, simultaneously saving the cutting index, and facilitating the later splicing.

Further, in step 2, a specific method of the entity set extraction module includes:

step 2.1, segmenting a training set by five-fold segmentation, dividing the training set into a training set and a verification set, and ensuring the generalization of the model by using the information of the training set in multiple dimensions;

step 2.2, encoding the text of the financial field to be processed by using an ALBERT pre-training language model to complete word embedding, and acquiring a dynamic word vector;

step 2.3, inputting the dynamic word vector of the previous step into a CRF layer and decoding,

let two sets of random variables X = (X) ₁ ，x ₂ ，...，x _n ) And Y = (Y) ₁ ，y ₂ ，...，y _n )，

The linear chain conditional random field is defined as follows: p (y) _i |X，y ₁ ，y ₂ ，...，y _i-1 ，y _y+1 ，...，y _n )＝p(y _i |X，y _i-1 ，y _i+1 )，i＝1，2，...，n

Wherein: x is the observed state and Y is the hidden state.

The score of the predicted tag sequence of the entity recognition model of the invention can be obtained by using the following discrimination calculation formula of CRF:

wherein: mask (X, y) represents the score of a predicted tag sequence y, P represents a score matrix obtained from an ALBERT layer, T represents a transition matrix obtained by learning CRF, and P (y | X) represents the corresponding probability of an input sequence and a tag sequence; y is _X Representing all possible character sequences to which the financial text data sequence X corresponds.

Step 2.4, further, obtaining an entity corresponding to the current sentence text according to the label sequence with the highest score, and calculating the logarithm probability of the maximum correct label sequence by using the following formula:

wherein X represents an input financial text data sequence X = (X) ₀ ，x ₁ ，...，x _n ) Y represents the predicted character tag sequence, Y _X All possible character sequences corresponding to the financial text data sequence X are represented, and mask (X, y) represents the score of the predicted tag sequence y.

And 2.5, decoding to obtain a prediction output sequence of the maximum value by using the following formula: y is _max = argmax (X, y')), and then entity boundary and classification recognition is completed by combining the predicted tag sequence and entity label information;

and 2.6, post-processing the obtained entity set, mining the missed entities by adopting a frequent mode, and filtering misjudged entities, thereby extracting an entity set corresponding to the financial text.

Further, in step 3, a specific method of the real-time processing module includes:

step 3.1, a financial knowledge graph is constructed through the obtained entities and the obtained relations and is stored by using a Dgraph database, the operation of the Dgraph database is efficient, and the real-time running of any complex query is supported;

3.2, a dictionary tree is built based on the knowledge graph built in the previous step to carry out benchmarking on the data, and then 3 rounds of fine-tuning training are carried out on the financial data set by using an ALBERT-CRF model, so that the recognition speed is improved;

3.3, in order to further reduce the training time and the reasoning time of the model, the invention adopts two methods, the first method is cross-layer parameter sharing, which is equivalent to that the model only learns the first layer parameter, and the layer parameter is reused in all other layers, thus reducing the parameter number and effectively improving the stability of the model; the second one is to decompose the embedded vector parameter factor, let W be the word vector size, H be the hidden layer size, W is identical to H in the pre-training language model such as BERT, roBERTA, etc., and the parameter scale is O (V × H). ALBERT uses factorization to reduce the number of parameters, and adds a matrix to complete dimension change after word embedding, and the number of parameters is reduced from O (V × H) to O (V × E + E × H), and the number of parameters is obviously reduced when H > E.

Step 3.4, the real-time processing module and the entity set extraction module are integrated, evaluation scores are calculated through a common index Micro-Averaging of named entity recognition, and an optimal entity set corresponding to the financial text is obtained, wherein the formula is as follows:

wherein n represents the number of financial texts, TP _i Representing the number of correctly recognized entities in the ith text, FP _i Representing the number of erroneously identified entities in the ith text, FN _i Representing the number of unrecognized entities in the ith text. Finally, through the steps, the real-time performance of financial entity identification can be effectively improved, and the quick gold finding is facilitatedAnd fusing the decision information.

The real-time entity identification method for the internet financial business provided by the invention has the advantages that the real-time entity identification of a specific field is realized, under the condition that an excellent entity identification model is lacked in the financial field, a high-speed and accurate named entity identification model is constructed, the real-time and accurate entity identification method is different from a traditional model taking BERT as an embedded layer, ALBERT is used as the embedded layer for fine adjustment, the upper and lower semantic features based on the financial field business are effectively learned, the real-time and accurate entity identification of input financial text sentences is realized, the real-time property of the financial entity identification is improved, the problem of difficulty in the entity identification of the financial field is solved, convenience is provided for a financial practitioner to efficiently obtain information and timely grasp industry dynamics, and thus information support is better provided for relevant institutions and individuals in the financial field.

Drawings

FIG. 1 is a flow chart of a method for real-time identification of entities in an Internet financial transaction in accordance with the present invention;

FIG. 2 is a flow chart of an entity set extraction model proposed by the present invention;

FIG. 3 is a cross-layer parameter sharing flow chart of the present invention.

Detailed Description

In order to make the purpose, technical solution and technical effect of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments of the present invention.

Aiming at the problems of low identification speed and poor identification accuracy in entity identification in the existing financial field, the invention provides a real-time entity identification method for Internet financial services, which comprises the following steps as shown in figure 1:

step 1, in a data preprocessing module, format judgment is carried out on input financial text data X, and if the format is incorrect, data preprocessing is carried out, wherein the data preprocessing comprises data cleaning and data division, and the method specifically comprises the following steps:

in step 1.1, the embodiment directly calls a data API (application programming interface) provided by the Xinlang microblog official through a requests library, obtains real-time financial field text data from the Xinlang microblog, and positions the noise and error label data by using a regular expression aiming at the problems of noise, error labels and the like of the obtained text;

in step 1.2, all non-Chinese, non-English and non-numeric symbols in the data set are found out, such as hyperlink "< a >" labels, paragraph labels "< p >", picture labels "< img >" and some url labels, and then regular expressions are used for filtering and clearing to realize data cleaning;

in step 1.3, a plurality of entity type tags are first defined, such as "FIN" financial entity, "LOC" place name entity, "ORG" organization entity, "PER" person name entity, "O" non-named entity;

in step 1.4, a BIO labeling system is adopted to subdivide the labels into 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-FIN', 'I-FIN', 'O';

in step 1.5, a period is directly added to the back of the text with the sentence length exceeding 510 in the sequence X or the text without the ending punctuation, then the long text is divided into a plurality of independent short texts by the priority of comma, period, exclamation mark and question mark, and the cutting index is also stored for convenient splicing.

Step 2, in the entity set extraction module, firstly defining a plurality of entity type labels, labeling a data set by using a BIO labeling system, then performing real-time entity recognition on a text by using an ALBERT-CRF model to obtain an entity set, performing post-processing on the entity set obtained in the previous step, mining entities which may be missed by adopting a frequent mode, and filtering out entities which are mistakenly recognized, thereby obtaining an optimal entity corresponding to the financial text of the current training turn, specifically:

in step 2.1, a training set is segmented by five-fold cutting and is divided into a training set and a verification set, and the information of the training set is utilized in multiple dimensions to ensure the generalization of the model;

in step 2.2, the financial text data sequence X to be processed is encoded by using the ALBERT pre-training language model to complete word embedding, and dynamic word vectors are obtained, for example, "internet finance has a trend of full outbreaks in recent years, and a group of data of" pay treasure "can be peeped at a spot. Taking the ant under the flag of the Alibaba in Hangzhou is a sudden leap forward. From the section, the user can identify the self-defined financial entity corresponding to the internet finance, the organizational entity corresponding to the payment bank, the Alibaba and the ant golden suit and the location entity corresponding to Hangzhou;

in step 2.3, the obtained dynamic word vector is input into a CRF layer and decoded, and then the score of the predictive tag sequence of the entity recognition model can be obtained by using the following CRF discrimination calculation formula:

In step 2.4, further, according to the label sequence with the highest score, obtaining an entity corresponding to the current sentence text, and calculating the logarithm probability of the maximized correct label sequence by using the following formula:

In step 2.5, the prediction output sequence of the maximum value is decoded using the following formula: y is _max = argmax (X, y')), and then entity boundary and classification recognition is completed by combining the predicted tag sequence and entity label information;

in step 2.6, the obtained entity set is post-processed, missing entities are mined in a frequent mode, and misjudged entities are filtered, for example, for incomplete entities such as "pay Baoji (gold)/(Shanghai) energy futures trading center", interpretation is performed according to prediction tags, and part of the incomplete entities is directly discarded, and part of the incomplete entities is completed according to suffixes, so that an entity set corresponding to financial texts is extracted.

And 3, in the real-time processing module, constructing a knowledge map of the financial field through the entities and the relations obtained in the previous step, performing three rounds of fine-tuning on the data set by using an ALBERT-CRF model, and finally introducing two parameter reduction technologies to improve the real-time property of entity identification, wherein the method specifically comprises the following steps of:

in step 3.1, a financial knowledge graph is constructed through the obtained entities and relations and stored by using a digraph database, the digraph database is efficient in operation and supports real-time operation of any complex query, the knowledge graph created by the digraph database is based on an attribute graph model, each entity has a unique identifier, each node is grouped by a label, each relation has a unique type, and the basic concept is as follows: entities, tags, attributes.

In step 3.2, a dictionary tree is built based on the knowledge graph built in the previous step to carry out benchmarking on the data, and then an ALBERT-CRF model is used for carrying out 3 rounds of fine-tuning training on the financial data set, so that the recognition speed is improved;

in step 3.3, in order to further reduce the model training time and the reasoning time, the invention adopts two methods, the first method is cross-layer parameter sharing, which is equivalent to that the model only learns the first layer parameter, and the layer parameter is reused in all other layers, thus reducing the parameter number and effectively improving the model stability; the second one is to decompose the embedded vector parameter factor, let W be the word vector size, H be the hidden layer size, W is identical to H in the pre-training language model such as BERT, roBERTA, etc., and the parameter scale is O (V × H). ALBERT uses factorization to reduce the number of parameters, and adds a matrix to complete dimension change after word embedding, and the number of parameters is reduced from O (V × H) to O (V × E + E × H), and the number of parameters is obviously reduced when H > E.

In step 3.4, the real-time processing module and the entity set extraction module are integrated, and the evaluation score is calculated by the named entity identification common index Micro-Averaging to obtain the optimal entity set corresponding to the financial text, wherein the formula is as follows:

wherein n represents the number of financial texts, TP _i Representing the number of correctly recognized entities in the ith text, FP _i Representing the number of erroneously identified entities in the ith text, FN _i Representing the number of entities not identified in the ith text. Finally, through the steps, the real-time performance of financial entity identification can be effectively improved, and the financial decision information can be conveniently and quickly found.

It should be understood that the described embodiments of the invention are only some of the described embodiments of the invention, and not all embodiments. The particular embodiments described above are illustrative only and not limiting. Various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for real-time identification of an entity of an Internet financial service, comprising the steps of:

step 2, in the entity set extraction module, segmenting a training set by five-fold cutting to ensure the generalization of a model, then carrying out real-time entity recognition on the text by using an ALBERT-CRF model to obtain an entity set, carrying out post-processing on the entity set obtained in the previous step, mining entities which are possibly missed by adopting a frequent mode, and filtering out entities which are mistakenly recognized, thereby obtaining an optimal entity corresponding to the financial text of the current training turn;

and 3, in the real-time processing module, constructing a financial field knowledge graph through the entities and the relations obtained in the previous step, performing three rounds of fine-tuning on the data set by using an ALBERT-CRF model, and finally introducing two parameter reduction technologies to improve the real-time property of entity identification.

2. The method for real-time identification of an entity of an internet financial service as claimed in claim 1, wherein said step 1 specifically comprises:

step 1.1, aiming at the problems of noise, error labels and the like frequently occurring in financial texts, the method uses a regular expression to position the noise and error label data;

step 1.4, adopting a BIO labeling system to subdivide the label into 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-FIN', 'I-FIN', 'O';

3. The method for real-time identification of an entity of an internet financial transaction as claimed in claim 1, wherein said step 2 specifically comprises:

let two sets of random variables X = (X) ₁ ,x ₂ ,...,x _n ) And Y = (Y) ₁ ,y ₂ ,...,y _n )，

The linear chain conditional random field is defined as follows: p (y) _i |X,y ₁ ,y ₂ ,...,y _i-1 ,y _y+1 ,...,y _n )＝p(y _i |X,y _i-1 ,y _i+1 ),i＝1,2,...,n

Wherein: x is an observation state and Y is a hidden state;

wherein: mask (X, Y) represents the score of the predicted tag sequence Y, P represents the score matrix obtained from the ALBERT layer, T represents the transition matrix obtained from the learning of CRF, P (Y | X) represents the corresponding probability of the input sequence and the tag sequence, Y _X Representing all possible character sequences corresponding to the financial text data sequence X;

wherein X represents an input financial text data sequence X = (X) ₀ ,x ₁ ,...,x _n ) Y represents a predicted character tag sequence;

4. The method as claimed in claim 1, wherein the step 3 specifically comprises:

3.3, in order to further reduce the training time and the reasoning time of the model, the invention adopts two methods, the first method is cross-layer parameter sharing, which is equivalent to that the model only learns the first layer parameter, and the layer parameter is reused in all other layers, thus reducing the parameter number and effectively improving the stability of the model; the second one is decomposing embedded vector parameter factor, setting W as word vector size and H as hidden layer size, W is equal to H in pretrained language models such as BERT, roBERTA and the like, and the parameter scale is O (V multiplied by H); the ALBERT adopts a factorization method to reduce the parameter quantity, a matrix is added after words are embedded to complete dimension change, the parameter quantity is reduced from O (V multiplied by H) to O (V multiplied by E + E multiplied by H), and the parameter quantity is obviously reduced when H > E;

wherein n represents the number of financial texts, TP _i Representing the number of correctly recognized entities in the ith text, FP _i Representing the number of erroneously identified entities in the ith text, FN _i The number of entities which are not identified in the ith text is represented, and finally through the steps, the real-time property of financial entity identification can be effectively improved, and the financial decision information can be found out quickly.