CN116186223A

CN116186223A - Financial text processing method, device, equipment and storage medium

Info

Publication number: CN116186223A
Application number: CN202310190221.9A
Authority: CN
Inventors: 陈嘉裕; 陈淑华; 陈加杰; 范有文
Original assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Current assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-30

Abstract

The invention discloses a financial text processing method, a financial text processing device, computer equipment and a storage medium, which are used for solving the problem of low efficiency in the process of extracting a financial text recognition entity and a relationship. The method comprises the following steps: replacing numerical value information related to the task to be processed in the financial text with a numerical value mask to obtain a mask text; constructing a task problem of a numerical mask according to a task to be processed; splicing the mask text, the task problem and the task prompt vector to obtain a target text; and predicting a target answer of the task question through a language model according to the target text.

Description

Financial text processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a financial text processing method, apparatus, computer device, and storage medium.

Background

With the development of the financial industry, the related information demand in the financial field is increasing. The information text in the financial field is helpful for related personnel to acquire effective information such as money and development trend of related industries. However, the increasingly more financial field information is more and more disordered, and the difficulty of acquiring effective information by related personnel is increased. Therefore, it is increasingly important to rapidly analyze and process financial texts, intelligently extract effective information from massive financial texts, and help related personnel to make high-quality decisions.

When extracting effective information of the financial text, entity information is not only needed to be identified, but also corresponding relations among entities are needed to be extracted, wherein the entity information comprises but is not limited to numerical value information, time information, company information and the like. Since financial texts generally have rich entity information, especially a lot of numerical value and time information, various information interactions are tight, and thus, the financial texts have obvious field characteristics compared with common texts. In the prior art, when entity identification and relation extraction are performed on common texts, the identified entities are generally combined in pairs, so that the corresponding relation among the entities is extracted, and the method is used for entity identification and relation extraction on financial texts, so that the efficiency of information extraction is low because the field characteristics of the financial texts are not considered.

Disclosure of Invention

The embodiment of the invention provides a financial text processing method, a financial text processing device, computer equipment and a storage medium, which are used for solving the problem of low efficiency in the process of extracting a financial text recognition entity and a relationship.

In a first aspect of the present invention, there is provided a financial text processing method, including:

replacing numerical value information related to the task to be processed in the financial text with a numerical value mask to obtain a mask text;

Constructing a task problem of the numerical mask according to the task to be processed;

splicing the mask text, the task questions and the task prompt vector to obtain a target text;

and predicting a target answer of the task question through a language model according to the target text.

In one possible design, before the concatenating the mask text, the task question, and the task hint vector, the method further includes:

splicing the mask text, the task problem and the initial task prompt vector to obtain a training text;

inputting the training text into a language model, and predicting an initial answer of the task question;

and iteratively adjusting matrix parameters of the initial task prompt vector according to the difference between the expected answer and the initial answer until the initial answer accords with the expected answer, so as to obtain the task prompt vector.

In one possible design, the replacing the numerical information related to the task to be processed in the financial text with a numerical mask, and before obtaining the mask text, the method further includes:

identifying time information in the financial text;

judging whether the time information is complete or not;

and if the time information is incomplete, complementing the time information.

In one possible design, the task hint vectors include a location hint vector, a first task hint vector, a second task hint vector, and a third task hint vector, and the concatenating the mask text, the task question, and the initial task hint vector results in a training text, including:

before the numerical mask of the mask text, inserting the position prompt vector to obtain a first spliced text;

before the task problem, inserting the first task prompt vector to obtain a second spliced text;

inserting the second task prompt vector after the second spliced text to obtain a third spliced text;

after the third spliced text, the first spliced text is inserted to obtain a fourth spliced text;

and after the fourth spliced text, inserting the third task prompt vector to obtain the target text.

In one possible design, the predicting, according to the target text, the target answer of the task question through a language model includes:

inputting the target text into a language model, and outputting a first sequence value of a target answer language sequence of the task question;

screening a preset fragment area in the context of the first sequence value in the target text to obtain a target fragment area with the maximum probability, and taking the probability corresponding to the target fragment area as a fragment prediction probability;

The word corresponding to the target prediction probability is used as a second sequence value of a prediction result language sequence output by the language model;

and the language model outputs a second sequence value after the first sequence value to obtain a target answer of the task question.

In one possible design, the screening the preset segment area in the context of the first sequence value in the target text to obtain the target segment area with the highest probability, and taking the probability corresponding to the target segment area as the segment prediction probability includes:

respectively calculating the similarity between the first sequence value and each text region in the preset fragment region to obtain a similarity score of each text region;

adding the similarity scores of all the text regions in the preset fragment region in the context to be used as the region probability of the preset fragment region;

and screening out the largest region probability as the segment prediction probability.

In one possible design, the word corresponding to the target prediction probability is used as a second sequence value of a predicted result language sequence output by the language model, and the method includes:

if the target prediction probability is the first prediction probability, the ending marker is used as the second sequence value;

If the target prediction probability is the second prediction probability, the segmentation marker is used as the second sequence value;

if the target prediction probability is the text prediction probability, taking the text word which obtains the text prediction probability as the second sequence value;

if the target prediction probability is the vocabulary prediction probability, obtaining a vocabulary word of the vocabulary prediction probability as the second sequence value;

and if the target prediction probability is the segment prediction probability, using a starting word of a preset segment region corresponding to the segment prediction probability as the second sequence value.

In a third aspect of the present invention, there is provided a training apparatus for financial task prompt vectors, comprising:

the replacing module is used for replacing numerical value information related to the task to be processed in the financial text with a numerical value mask to obtain a mask text;

the construction module is used for constructing the task problem of the numerical mask according to the task to be processed;

the splicing module is used for splicing the mask text, the task problem and the initial task prompt vector to obtain a training text;

the prediction module is used for inputting the training text into a language model and predicting an initial answer of the task question;

And the output module is used for iteratively adjusting the matrix parameters of the initial task prompt vector according to the difference between the expected answer and the initial answer until the initial answer accords with the expected answer, so as to obtain the task prompt vector.

In a fourth aspect of the invention, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the methods described above when executing the computer program.

In a fifth aspect of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the methods described above.

The training method, the decoding method, the financial text processing method, the device, the computer equipment and the storage medium identify the numerical information related to the task to be processed in the financial text according to the field characteristics of the financial text, wherein the task to be processed comprises but is not limited to entity identification, relation extraction and the like. Then, the recognized numerical value information is replaced with a numerical value mask, and a mask text is obtained. The operation can identify all the numerical information related to the task to be processed in the financial text at one time, so that the entity identification and relation extraction of all the numerical information can be conveniently carried out at one time according to all the numerical information, and the processing efficiency of the financial text is greatly improved. And then, the entity identification and relation extraction of the financial text are converted into text question-answering tasks through constructing the task problem of the numerical mask, the number of relation categories is not required to be set in advance, the relation category names are obtained directly through a subsequent language model, and the expansibility of a relation extraction method is improved due to the fact that the number of categories is set in advance in a wireless mode, so that the relation extraction method can be applied to other tasks and fields. And finally, splicing the mask text, the task questions and the initial task prompt vector to obtain a target text, and predicting target answers of the task questions, namely entity identification and relation extraction results of the financial text, according to the target text through a language model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of the methods according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training method of financial task prompt vectors according to an embodiment of the invention;

FIG. 3 is a flow chart of a method for processing financial text according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a training apparatus for financial task alert vectors according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The financial text processing method and the training method provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein terminal equipment of a client communicates with a server in a server through a network. The terminal equipment collects financial texts, transmits the collected financial texts to the server, and the server identifies and processes the financial texts. The terminal device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 3, a training method for a financial task prompt vector is provided, and the method is applied to a server in fig. 1, and includes the following steps:

s10: and identifying numerical information related to the task to be processed in the financial text.

The method comprises the steps that financial texts are collected through terminal equipment of a client, after the financial texts are transmitted to a server, a server identifies numerical information related to tasks to be processed in the financial texts, wherein the financial texts comprise but are not limited to enterprise research reports, financial news and the like, the mode of collecting the financial texts comprises but is not limited to collecting through a crawler tool, collecting user input through an interface and the like, the tasks to be processed comprise but are not limited to entity identification, relation extraction and the like, and the numerical information comprises but is not limited to a duty ratio, a price, a ring ratio and the like. Methods of identifying numerical information include, but are not limited to, processing financial text using regular expressions.

For example, the financial text acquired by the terminal equipment is 2.9% of the four-quarter GDP, 3.0% of the annual GDP, and enough space is provided for 2023 to rise forcefully, 2023 GDP is 5.5% or more than expected, and numerical information of 2.9%, 3.0% and 5.5% in the text is identified.

S20: and replacing the numerical value information in the financial text with a numerical value mask to obtain a mask text.

After the numerical value information in the financial text is obtained, the numerical value information is replaced by a numerical value mask, and a mask text with the numerical value mask is obtained. Where the numerical mask refers to a marker of the numerical value that needs to be analyzed, including but not limited to a unique identifier, a type identifier, etc.

For example, the numerical information "2.9%" in the financial text "four-quarter GDP comparably increases by 2.9%", and the numerical information is replaced by < Number >, so as to obtain the mask text as "four-quarter GDP comparably increases by < Number >".

S30: and constructing the task problem of the numerical mask according to the task to be processed.

After the mask text is obtained, a task problem related to the numerical mask is constructed according to the task to be processed. The task questions include, but are not limited to, blank questions, question answers, etc., while the scope of the task questions includes, but is not limited to, asking for the corresponding type of the mask text, asking for the subject information of the mask text, etc.

For example, if the obtained mask text is "four-quarter GDP comparably increased < Number >", and the current task to be processed is a relationship type identifying a numerical mask, the task to be processed is constructed as a gap-filling question, that is, a task problem: the relationship type of "< Number > is __".

S40: and splicing the mask text, the task problem and the initial task prompt vector to obtain a training text.

After the task problem of the numerical mask is constructed, the mask text, the task problem and the initial task prompt vector are spliced to obtain the training text. The method for splicing the training texts comprises, but is not limited to, splicing the mask texts, the task questions and the initial task prompt vectors into the training texts according to any positions. The number of the initial task prompt vectors is one or more, the number of the initial task prompt vectors is not limited, and the plurality of task prompt vectors can be split and spliced at different positions so as to prompt the language model, so that the downstream task of the language model can be adjusted, namely matrix parameters of each task prompt vector are respectively adjusted, the downstream task is adapted to the language model, and finally, the text of the input language model is modified through the adjustment of the task prompt vectors, so that the language model obtains the expected answer of the task question.

For example, the mask text is a, the task question is B, the initial task prompt vectors are 99, the initial task prompt vectors are divided into 3 parts, C, D, E, 33 vectors in each part, and the finally spliced training text is CBDAE.

S50: and inputting the training text into a language model to predict an initial answer of the task question.

After the training text is spliced, the training text is input into a pre-training Language Model (Language Model), and an initial answer of the task question is predicted. The Pre-Training language model refers to modeling sentence probability distribution of Training text, and includes, but is not limited to, CPM-2 (Large-Scale efficient Pre-Training language model, which is called as Large-Scale Cost-Effective Pre-Trained Language Models), GPT-2 (Large-Scale unsupervised language model, which is called as generating Pre-Training), and the like. The initial answer of the task question refers to the answer of the pre-training language model to the task question constructed in step S30.

For example, the present embodiment inputs the trained text into CPM-2, resulting in a sequence output for each instant of CPM-2.

S60: and iteratively adjusting matrix parameters of the initial task prompt vector according to the difference between the expected answer and the initial answer until the initial answer accords with the expected answer, so as to obtain the task prompt vector.

Judging whether the answer accords with the expected answer according to the initial answer output by the pre-training language model, if the answer does not accord with the expected answer, adjusting matrix parameters of the initial task prompt vector in the training text, inputting the training text carrying the adjusted task prompt vector into the pre-training language model again until the answer of the pre-training language model accords with the expected answer, and then training the task prompt vector in the text to be the task prompt vector required by the step S60.

For example, if the expected answer is a and the initial answer is B, it is determined whether the initial answer B is the expected answer a. When the initial answer B is not equal to the expected answer a, matrix parameters of the task prompt vector in the training text are adjusted, the language model is input again until the output result of the language model is a, and the task prompt vector at this time is the task prompt vector required in step S60.

It should be noted that, according to the field characteristics of the financial text, the present embodiment identifies the numerical information related to the task to be processed in the financial text, where the task to be processed includes, but is not limited to, entity identification, relationship extraction, and the like. Then, the recognized numerical value information is replaced with a numerical value mask, and a mask text is obtained. The operation can identify all the numerical information related to the task to be processed in the financial text at one time, so that the entity identification and relation extraction of all the numerical information can be conveniently carried out at one time according to all the numerical information, and the processing efficiency of the financial text is greatly improved. And then, the entity identification and relation extraction of the financial text are converted into text question-answering tasks through constructing the task problem of the numerical mask, the number of relation categories is not required to be set in advance, the relation category names are obtained directly through a subsequent language model, and the expansibility of a relation extraction method is improved due to the fact that the number of categories is set in advance in a wireless mode, so that the relation extraction method can be applied to other tasks and fields. And finally, splicing the mask text, the task questions and the initial task prompt vector to obtain a training text, inputting the training text into a language model, predicting an initial answer of the task questions, and carrying out iterative adjustment on matrix parameters of the initial task prompt vector according to the difference between an expected answer and the initial answer until the initial answer accords with the expected answer to obtain the task prompt vector. The training method of the task prompt vector is based on prompt learning completion, freezes model parameters of a language model, and completes training only by adjusting matrix parameters of the task prompt vector, and because the scale of the task prompt vector is far smaller than that of the language model, the training method of the task prompt vector saves training time and improves training efficiency.

In one embodiment, as shown in fig. 2, a financial text processing method is provided, and the method is applied to a server in the server side of fig. 1, and includes the following steps:

s21: and replacing the numerical value information related to the task to be processed in the financial text with a numerical value mask to obtain a mask text.

After the financial text is transmitted to the server, the server recognizes numerical information in the financial text according to a task to be processed, and replaces the numerical information with the numerical mask to obtain a mask text, wherein the financial text comprises but is not limited to enterprise research reports, financial news and the like, the mode of collecting the financial text comprises but is not limited to collection through a crawler tool, collection of user input through an interface and the like, the task to be processed comprises but is not limited to entity identification, relation extraction and the like, and the numerical information comprises but is not limited to a duty ratio, a price, a ring ratio and the like. The numerical mask refers to a marker of the numerical value to be analyzed, including but not limited to a unique identifier, a type identifier, etc

For example, the financial text "the accounts receivable of the common debtor is 30,676.99 ten thousand yuan, 46,124.28 ten thousand yuan, 46,290.36 ten thousand yuan and 46,945.89 ten thousand yuan" acquired by the terminal device, the accounts of the financial text are 0.35%, 0.36%, 0.27% and 0.26% of the total assets, the relevant numerical value of the task to be processed in the text is identified as "0.35", and the numerical values are replaced by < NUMBER >. The mask text is therefore "the accounts receivable of the common debtor is 30,676.99 ten thousand yuan, 46,124.28 ten thousand yuan, 46,290.36 ten thousand yuan and 46,945.89 ten thousand yuan respectively, and the proportions of the total assets are < NUMBER >, 0.36%, 0.27% and 0.26%", respectively.

S22: and constructing the task problem of the numerical mask according to the task to be processed.

And constructing related task problems of the numerical mask according to the task to be processed. The task questions include, but are not limited to, blank questions, question answers, etc., while the scope of the task questions includes, but is not limited to, asking for the corresponding type of the mask text, asking for the subject information of the mask text, etc.

In this embodiment, the task to be processed is a task identification entity and a relationship extraction, so when a task problem is constructed, the task problem of the entity identification entity and the task problem of the relationship extraction can be respectively constructed, or the entity identification entity and the relationship extraction can be constructed into the same task problem, so that the subsequent language model can predict the answer of the task problem.

For example, the mask text is "the accounts receivable of the common debtor is 30,676.99 ten thousand yuan, 46,124.28 ten thousand yuan, 46,290.36 ten thousand yuan and 46,945.89 ten thousand yuan respectively, and the proportion of the mask text to the total assets is < NUMBER >, 0.36%, 0.27% and 0.26%", and at this time, the entity and the relation to be identified are extracted and constructed into the same filling problem, that is, the task problem: the type, time, relationship, duty cycle field, subject corresponding to "< NUMBER > is ___".

S23: and splicing the mask text, the task problem and the task prompt vector to obtain a target text, wherein the task prompt vector is trained according to the training method of the financial task prompt vector in the steps.

And modifying the data input into the language model so that the language model obtains correct task question answers through the input data, namely splicing the task questions, masking texts and task prompt vectors trained by the financial task prompt vectors.

S24: and predicting a target answer of the task question through a language model according to the target text.

The input data modified in step S23 and the above-mentioned language model prediction method are used to obtain the predicted answers of the task questions, where the predicted answers include, but are not limited to, the recognition results of the entities, the extraction results of the relationships, and the like.

For example, the mask text is "the accounts receivable of the common debtor is 30,676.99 ten thousand yuan, 46,124.28 ten thousand yuan, 46,290.36 ten thousand yuan and 46,945.89 ten thousand yuan, the proportion of the total assets is < NUMBER >, 0.36%, 0.27% and 0.26%", the task problem is "< NUMBER > corresponding type, time, relation, duty field and main body are ___". And (3) splicing and reforming the mask text, the task questions and the task prompt vectors, inputting the mask text, the task questions and the task prompt vectors into a language model, and obtaining a predicted answer of the final task questions as "< START > duty ratio type < SEP >2019 < SEP > accounts receivable < SEP > total asset < SEP > common debtor < END >", wherein < START > is a START marker, < SEP > is a separation marker and < END > is an END marker. From the predicted answer, the results of entity identification and relationship extraction can be seen.

It should be noted that, in this embodiment, the entity identification and the relation extraction are converted into the text generation task, that is, the answer of the task question is obtained through the language model, so that the relation class set is not required to be defined, the fixed number of relation classes is not required to be set in the model, and the names of the relation classes and the entity types, that is, the answer of the language model, are directly generated. When the relation type and the corresponding data are newly added, the task problem is only needed to be constructed as in the embodiment without changing the model structure and the original data, and the language model is input, so that the results of the actual identification and relation extraction can be obtained.

In one embodiment, before step S21, the numerical value information of the financial text is replaced by a numerical mask, and before the mask text is obtained, the training method further includes the following steps:

s11: time information in the financial text is identified.

S12: and judging whether the time information is complete.

S13: and if the time information is incomplete, complementing the time information.

In one embodiment, all time information in the financial text is completed.

Specifically, in step S11, first, all time information in the financial text is identified, where the time information refers to all text information that can represent a time period, including but not limited to year, month, date shorthand, date abbreviation, a time period, a certain time period, etc., for example: "2 nd month of 2019 2 nd day", "last quarter", "present month", etc.

Next, in steps S12-S13, it is determined whether all the time information identified in step S11 is complete time information, and the complete time information refers to complete time information that can be directly identified by the machine, including, but not limited to, time descriptions that are not complete in time entity, do not include a specific time point, and the like. If the time information is incomplete, the time information is complemented, and the method for complementing the time information comprises, but is not limited to, complementing the missing time, replacing the time description with a specific time point, and the like.

For example, the financial text includes the following: the first quarter accumulated ticket houses in 2019-2021 and 2022 are 6116.7 ten thousand, 5601.2 ten thousand, 5001.4 ten thousand and 400 ten thousand respectively. The number of imported movies in the last three years and the first period is 136, 62, 73 and 59 parts respectively. "2019-2021" is identified as the missing time information of the time entity, and "the last three years and one period" is the time description which does not contain the specific time point. After the identification result is obtained, in step S13, the time information is complemented: "2019-2021" is complemented with "2019, 2020, 2021", while the last three years refer to 2020-2022, so "last three years and one period" is complemented with "2020, 2021, 2022 and 2023 one quarter.

It should be noted that, the specificity of the financial text results in that the text contains a lot of time information which is similar to that of the machines of the last three years or 2019-2023 and is easy to miss, but the time information is very important in financial analysis, so as to prevent information from being missed when entity identification or relation extraction is caused by ambiguity of the time information, the embodiment can complement the missed information through a certain rule, and effectively solve the technical problems of incomplete entity and relation extraction and easy omission.

In one embodiment, before step S41, that is, before inserting the position marker into the numerical mask of the mask text, the training method of the financial task prompt vector further includes recording the subscript position of the numerical information related to the task to be processed in the financial text. The subscript position is the position where the position marker needs to be inserted.

In an embodiment, in step S23, that is, the splicing the mask text, the task question, and the task prompt vector to obtain the target text, the method specifically includes the following steps:

s41: and inserting the position prompt vector before the numerical mask of the mask text to obtain a first spliced text.

S42: and before the task problem, inserting the first task prompt vector to obtain a second spliced text.

S43: and inserting the second task prompt vector after the second spliced text to obtain a third spliced text.

S44: and after the third spliced text, inserting the first spliced text to obtain a fourth spliced text.

S45: and after the fourth spliced text, inserting the third task prompt vector to obtain the target text.

In one embodiment, the mask text, task questions, and task hint vectors are stitched.

Specifically, a plurality of task hint vectors including, but not limited to, a location hint vector, a first task hint vector, a second task hint vector, and a third task hint vector are randomly initialized, and the location hint vector is inserted before a numerical mask of the mask text to mark a numerical location that needs language model analysis. And then, splicing the first task prompt vector, the task problem, the second task prompt vector, the mask text inserted with the position prompt vector and the third task prompt vector in sequence to obtain the training text. The number of each of the position prompt vector, the first task prompt vector, the second task prompt vector and the third task prompt vector is not limited. The position prompt vector is mainly used for marking the position of the mask text so as to facilitate the accurate and convenient recognition of the mask text by the subsequent language model. The first task prompt vector, the second task prompt vector and the third task prompt vector are all randomly initialized hidden vectors, one or more hidden vectors can be used for modifying training texts of the input language model, so that the training texts of the input language model can be finally input, and an accurate target answer can be obtained without adjusting the language model.

For example, 100 task cue vectors are randomly initialized in this embodiment, among which 33 first task cue vectors a, 33 second task cue vectors B, 33 third task cue vectors C, and 1 position cue vector D. And inserting the position prompt vector into the numerical mask of the mask text E to obtain a first spliced text F, splicing the rest prompt vectors, the first spliced text F and the task problem Q, and finally obtaining a training text AQBFC.

It should be noted that, in this embodiment, based on prompt learning, a plurality of task prompt vectors are randomly initialized, and data of an input model is modified, that is, the task prompt vectors are spliced in the input data so as to train a downstream task of a language model subsequently, thereby obtaining a target prediction answer of a task answer by adjusting matrix parameters of the task prompt vectors. The whole training process is free from adjusting the parameters of the language model and the structure of the language model, and the language model is usually a large-scale model. Therefore, the task prompt vector is directly trained, instead of training the language model, so that the training time is effectively saved, and the training efficiency is improved.

In one embodiment, in step S24, that is, predicting the target answer of the task question through a language model according to the target text, the method specifically includes the following steps:

s70: and replacing the words to be predicted in the text to be processed with mask marks to obtain mask text.

And acquiring a text to be processed through terminal equipment of the client, identifying words to be predicted in the text to be processed, and replacing the words to be predicted with mask marks to obtain mask text. The text to be processed refers to a text where a word or a word to be analyzed is located, including but not limited to an industry research report, a news article and the like, the mode of collecting the text to be processed includes but not limited to collecting through a crawler tool, collecting user input through an interface and the like, and the word to be predicted refers to a word or a word to be analyzed by a language model, including but not limited to a word, a number and the like. Mask marking refers to marking of a word or word to be analyzed, including but not limited to unique identifiers, type identifiers, and the like.

For example, the terminal device collects the text to be processed, "i like watching movie", and if the word to be predicted is "movie", the word to be predicted in the text to be processed is replaced by the mask mark < NOUN >, and the mask text "i like watching < NOUN >" isobtained.

S80: and adding the mask mark into a task vocabulary.

After the mask text is constructed, the mask mark is added to the task vocabulary. The task word list refers to a list preservation result for vectorizing all words to be predicted, including but not limited to the words to be predicted, mask marks of the words to be predicted, and vectors of the words to be predicted.

S90: and inputting the mask text into a language model, and outputting a first sequence value of the predicted result language sequence.

The mask text is input into a language model, the language model will obtain a prediction result of the mask text, the prediction result will be sequentially output in the form of a language sequence, including but not limited to words, numbers, start markers, end markers, separation markers, and the like, the prediction result language sequence includes but not limited to a first sequence value, a second sequence value, and the like, wherein in the embodiment, the first sequence value and the second sequence value are sequentially ordered according to the prediction result language sequence, that is, a sequence output first in the prediction result is the first sequence value, and a sequence output later is the second sequence value.

For example, "i like to see < noise >, because < noise > is __", the language model will derive a predicted answer for the text, the predicted answer being "< START > is interesting < END >", so the language model will sequentially output the language sequences of the predicted answer, i.e., sequentially output the four language sequences of "< START >", "interesting" and "< END >", in accordance with the order of the language sequences of the predicted results, "< START >" is the current output sequence value of the predicted result, i.e., the first sequence value, and "have" is the sequence value that the predicted result is about to output, i.e., the second sequence value. In this embodiment, in step S90, after the language model outputs "< START >" the prediction of the next sequence value is performed according to "< START >", that is, steps S100 to S160.

S100: and obtaining the probability of the appearance of the end marker after the first sequence value, obtaining a first prediction probability, and obtaining the probability of the appearance of the separation marker after the first sequence value, obtaining a second prediction probability.

Since the predicted result output by the language model must include a start marker and an end marker, the first sequence value of the predicted result must be the start marker. In addition, since each input text may need to output predicted answers to a plurality of questions, each predicted answer may be separated using a separation marker, so that all the predicted answers are accurately output at one time.

Therefore, the probability that the end marker and the separation marker appear after the first sequence value needs to be preset, and in step S100, the probability that the end marker and the separation marker appear after the current first sequence value, that is, the first prediction probability and the second prediction probability, is obtained according to the preset probability.

S110: and respectively predicting the probability that each word in the task vocabulary appears after the first sequence value, and taking the maximum probability as vocabulary prediction probability.

And outputting the first sequence value through the language model, respectively predicting the probability of each word appearing after the first sequence value in the task vocabulary, obtaining a plurality of expected probabilities, taking the maximum expected probability as the vocabulary prediction probability, and taking the word with the maximum expected probability as the vocabulary word.

S120: and respectively predicting the probability that each word in the text to be processed appears after the first sequence value, and taking the maximum probability as the text prediction probability.

And outputting the first sequence value through the language model, respectively predicting the probability that each word in the text to be processed appears after the first sequence value, obtaining a plurality of expected probabilities, taking the maximum expected probability as the text prediction probability, and taking the word with the maximum expected probability as the text word.

S130: and screening a preset fragment area in the context of the first sequence value in the text to be processed to obtain a target fragment area with the maximum probability, and taking the probability corresponding to the target fragment area as the fragment prediction probability.

The segment regions in the context of the first sequence value are preset, including but not limited to specifying a preset number of segment regions, specifying a preset range of segment regions of text, and the like. Then, the prediction probability of each segment region is calculated, and the segment region with the largest prediction probability is screened out and used as the segment prediction probability. The probability calculation method of the preset fragment area includes, but is not limited to, similarity calculation with the first sequence value, compactness calculation with the first sequence value and the like.

For example, the text to be processed is "ABCD", and the current language model outputs to C the sequence value, so C is taken as the first sequence value, at which time the segment regions in the context of the preset first threshold are "AB" and "D", where the similarity of "AB" to "C" is 30% and the similarity of "C" to "D" is 50%. Therefore, the segment region "D" is selected as the target segment region, and the predicted probability value thereof is 50% as the segment predicted probability, that is, the segment predicted probability is 50%.

S140: and selecting the maximum probability among the first prediction probability, the second prediction probability, the text prediction probability, the word list prediction probability and the fragment prediction probability as the target prediction probability.

The maximum probability is selected from steps S100-S130, that is, the probability with the maximum probability value is selected from the first prediction probability, the second prediction probability, the text prediction probability, the vocabulary prediction probability, and the segment prediction probability, as the target prediction probability. Wherein the target prediction probability is the probability of the sequence value that most appears after the first sequence value.

For example, the first prediction probability is 2%, the second prediction probability is 9%, the text prediction probability is 14%, the vocabulary prediction probability is 7%, and the segment prediction probability is 6%, and at this time, the probability value of the text prediction probability is 14% of the maximum value, and the text prediction probability is finally regarded as the target prediction probability, that is, 14%.

S150: and taking the word corresponding to the target prediction probability as a second sequence value of the language model.

And after the target prediction probability is obtained, taking the word corresponding to the target prediction probability as a second sequence value output by the language model.

For example, the first prediction probability is 2%, the second prediction probability is 9%, the text word with the largest prediction probability in the text to be processed is A, the text prediction probability is 14%, the vocabulary word with the largest prediction probability in the task vocabulary is B, the vocabulary prediction probability is 7%, the corresponding word in the area with the largest prediction probability in the preset segment area is C, and the segment prediction probability is 6%, at this time, the text prediction probability is taken as the target prediction probability because the probability value of the text prediction probability is 14% of the maximum value. The word corresponding to the target prediction probability is a text word, namely A. Therefore, the second sequence value output by the language model is a.

S160: and the language model outputs a second sequence value after the first sequence value until the second sequence value is the end marker, so as to obtain a prediction result of the language model.

The language model will output a second sequence value after the first sequence value, repeat the steps of steps S100-160 with the second sequence value as the first sequence value until the second sequence value is an end marker. At this time, the final result output by the language model is its final predicted result.

For example, the language model will output "< START >", then "< START >" as the first sequence value, predict that the second sequence value is "yes", then the language model will output the second sequence value after the first sequence value, i.e., the language model will output "< START > yes", then "yes" as the first sequence value, predict that the second sequence value is "interesting", the language model will output "interesting" after "having" and the currently output result is "< START > interesting". The loop is repeated until the second sequence value is the END marker < END >, and the prediction result "< START > of the language model is interesting < END >".

It should be noted that, in the prediction method of the language model limited decoding provided in the present embodiment in steps S100 to S160, when a single entity is extracted, as in step 130, the rest of the segments of the context of the single entity are searched, so as to implement cross-region extraction of the entity segments, thereby implementing extraction of discontinuous entities in the text to be processed, and effectively solving the problem of extraction omission caused by the discontinuous field characteristics of the text when the entity is extracted in the financial text.

In one embodiment, in step S150, the word corresponding to the target prediction probability is used as the second sequence value of the language model, and specifically includes the following steps:

S151: and if the target prediction probability is the first prediction probability, the ending marker is taken as the second sequence value.

S152: and if the target prediction probability is the second prediction probability, the segmentation marker is used as the second sequence value.

S153: and if the target prediction probability is the text prediction probability, taking the text word which obtains the text prediction probability as the second sequence value.

S154: and if the target prediction probability is the vocabulary prediction probability, obtaining the vocabulary words of the vocabulary prediction probability as the second sequence value.

S155: and if the target prediction probability is the segment prediction probability, using a starting word of a preset segment region corresponding to the segment prediction probability as the second sequence value.

In one embodiment, a word corresponding to the target prediction probability is selected.

Specifically, in step S151-152, if the target prediction probability is the first prediction probability or the second prediction probability, the corresponding end marker or separation marker is directly selected as the second sequence value.

In steps S153-S154, if the target prediction probability is a text prediction probability or a vocabulary prediction probability, finding out a text word corresponding to the maximum prediction probability in the corresponding text to be processed or a vocabulary word corresponding to the maximum prediction probability in the task vocabulary, as the second sequence value.

In step S155, when the target prediction probability is the segment prediction probability, the start word in the corresponding preset segment region, that is, the first word in the segment region is used as the second sequence value. For example, if the text to be processed is "we are different", and if the current first sequence value is "no", the preset segment region in the context is "we" and "same", where the prediction probability of "we" is 10%, and the prediction probability of "same" is 50%, then the starting word in the segment region of "same" is selected as the second sequence value, that is, "one" is used as the second sequence value.

It should be noted that, in this embodiment, by calculating the prediction probability, all the possibilities of the next sequence value can be completely considered, and then the word with the highest possibility is selected as the second sequence value, so that the accuracy of the language model prediction result is effectively improved.

In an embodiment, in step S130, a target segment region with the highest probability is obtained by screening a preset segment region in the context of the first sequence value, and the probability corresponding to the target segment region is used as a segment prediction probability, which specifically includes the following steps:

S131: respectively calculating the similarity between the first sequence value and each text region in the preset fragment region to obtain a similarity score of each text region;

s132: adding the similarity scores of all the text regions in the preset fragment region in the context to be used as the region probability of the preset fragment region;

s133: and screening out the largest region probability as the segment prediction probability.

In one embodiment, a method for calculating a segment prediction probability is provided.

Specifically, in step S131, since the preset segment region will include one or more text regions, the similarity between the first sequence value and each text region is calculated, and the similarity score of each text region is obtained. Among them, the method for calculating the similarity includes, but is not limited to, cosine similarity, pearson correlation coefficient, and the like. In step S132-133, the calculation results of all text regions in the same preset fragment region are added as the region probability of the preset region. And selecting the largest region probability as the segment prediction probability.

For example, the text to be processed is "nonyin year, moon, twenty-eight", the first sequence value is "month", and the preset fragment area is the above area "nonyin year, wax" and the below area "twenty-eight". The above region includes four text regions of "non", "yin", "year" and "wax", and the similarity between the four text regions and the first sequence value "month" is calculated by cosine similarity, so as to obtain a similarity score of each text region, and all the scores are added to obtain a region probability of 34% for the above region "non yin year wax". By adopting the same method, the region probability of the following region twenty-eight is 54%. The largest region probability of 54% is screened out, and 54% is taken as the segment prediction probability.

It should be noted that, by calculating the similarity between the first sequence value and the context preset fragment area, the preset fragment area most relevant to the first sequence value can be extracted, and the starting word in the preset fragment area is used as the next sequence value, so that on one hand, the accuracy of the prediction result of the language model is effectively improved, and on the other hand, the problem of information omission caused by discontinuous entity extraction is solved because the context area of the first sequence value is searched and the relevance is calculated, even if the answer of the task question is discontinuous in the text to be processed.

In summary, the numerical information of the financial text is replaced by the mask mark, and different task problems are constructed according to the mask mark, so that a plurality of related entities of different types can be extracted at one time, the efficiency of extracting the entities and the relations is effectively improved, on the other hand, the input data of the language model is changed through the trained task prompt vector, so that a large-scale language model is not required to be trained, only the training of the small-scale task prompt vector is required, the language model can obtain the answers of the task problems, namely, the language model can accurately extract the entities and the relations, and the efficiency of entity identification and relation extraction is effectively improved. Next, the embodiment provides a prediction method of a language model by modifying output decoding of the language model, and the prediction method can realize cross-region entity fragment extraction according to entity context fragments when extracting a single entity, thereby effectively solving the problem of incomplete entity extraction caused by entity discontinuity in a text in entity extraction. Then, the invention complements the time information in the financial text according to a certain rule according to the characteristics of the financial field, thereby relieving the problem of entity and relationship extraction omission caused by incomplete information when extracting the entity and relationship. Finally, the invention provides a technical scheme for converting entity identification and relation extraction into text generation tasks, which is different from the traditional technical scheme, does not need to set a relation category set and the number of relation categories like the traditional technical scheme, but directly carries out question and answer on the relation and the entity through a language model, and directly obtains the relation and the entity name through constructing task problems. Because the relation type set and the relation type number are not required to be set, the method can be flexibly applied to a plurality of fields and different models, and has good expansibility.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a financial text processing device is provided, which corresponds to the financial text processing method in the above embodiment one by one. As shown in fig. 4, the financial text processing apparatus includes a substitution module 10, a construction module 20, a splicing module 30, and an output module 50. The functional modules are described in detail as follows:

a replacing module 10, configured to replace numerical information related to a task to be processed in a financial text with a numerical mask, so as to obtain a mask text;

a construction module 20, configured to construct a task problem of the numerical mask according to the task to be processed;

the splicing module 30 is configured to splice the mask text, the task question, and the task prompt vector to obtain a target text;

and the output module 40 is used for predicting a target answer of the task question through a language model according to the target text.

For specific limitations of the financial text processing apparatus, reference may be made to the above limitations of the financial text processing method, and no further description is given here. The various modules in the financial document processing device described above may be implemented in whole or in part in software, hardware, and combinations thereof.

The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server or a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for text data collected by the terminal device and data generated in the method embodiments. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the methods of the method embodiments described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include Read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of financial text processing comprising:

2. The financial text processing method of claim 1, wherein prior to said concatenating said mask text, task question, and task hint vector, said method further comprises:

3. The method for processing financial text according to claim 1, wherein the replacing numerical information related to the task to be processed in the financial text with a numerical mask, before obtaining the mask text, further comprises:

identifying time information in the financial text;

judging whether the time information is complete or not;

and if the time information is incomplete, complementing the time information.

4. The method of claim 1, wherein the task hint vector includes a location hint vector, a first task hint vector, a second task hint vector, and a third task hint vector, and the concatenating the mask text, the task question, and the task hint vector to obtain the target text includes:

5. The method of claim 1, wherein predicting a target answer to the task question from the target text by a language model comprises:

6. The method of claim 5, wherein the screening the preset segment regions in the context of the first sequence value in the target text to obtain the target segment region with the highest probability, and taking the probability corresponding to the target segment region as the segment prediction probability comprises:

7. The method of claim 5, wherein the outputting the word corresponding to the target prediction probability as the second sequence value of the predicted result language sequence output by the language model comprises:

8. A financial document processing apparatus, comprising:

the splicing module is used for splicing the mask text, the task problem and the task prompt vector to obtain a target text;

and the output module is used for predicting a target answer of the task question through a language model according to the target text.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.