CN113821635A

CN113821635A - Text abstract generation method and system for financial field

Info

Publication number: CN113821635A
Application number: CN202110879065.8A
Authority: CN
Inventors: 龙雄伟
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-12-21

Abstract

The invention discloses a method and a system for generating a text abstract in the financial field, wherein the method comprises the following steps: collecting a text data set of a financial field, and constructing a field corpus of the financial industry; preprocessing a text data set according to a domain corpus to generate a training sample, a test sample and a verification sample; constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises a bidirectional LSTM network and adopts single-hot coding, the decoding structure comprises a unidirectional LSTM network, a zooming translation layer is added, and the zooming translation layer adopts a pointer generating network and is also provided with a covering vector; training the text abstract generation model by using the training sample, the test sample and the verification sample to obtain a target text abstract generation model; inputting the text to be detected into a target text abstract generation model, and outputting the text abstract. The invention can improve the precision and accuracy of the current automatic text abstract and improve the text abstract generating efficiency.

Description

Text abstract generation method and system for financial field

Technical Field

The invention relates to the technical field of deep learning, in particular to a text abstract generation method and system for the financial field.

Background

With the explosive growth of text information in recent years, people can be exposed to massive text information such as news, blogs, chatting, microblogs, reports, papers, microblogs and the like every day. It has become an urgent need to extract important content from a large amount of text information, and the automatic text digest provides an efficient solution.

Despite the enormous demand for automatic text summarization, the development in this area is relatively slow. Generating summaries is a challenging task for computers. And generating a qualified abstract from one or more texts, requiring a computer to understand the content of the original text after reading the original text, and performing accepting, cutting and splicing the content according to the degree of urgency and urgency to finally generate a smooth short text. Therefore, an industrial, low-cost and high-precision automatic text summarization system is developed, and the development of various fields such as news headline generation, thesis summary generation and report summarization can be facilitated.

The summary generation methods in the prior art include the following methods:

1. drawing type abstract

The abstract mode is mainly to use an algorithm to collect and extract sentences of threads from source documents as abstract sentences. Generally, the smoothness is better than the generative abstract, but too much redundant information is introduced, and the characteristics of the abstract cannot be reflected.

The extraction method selects key sentences from the original text to form an abstract. The method has natural low error rate in grammar and syntax and ensures certain effect. The traditional abstraction method uses graph method, clustering and other ways to complete unsupervised abstraction. One of the popular methods with supervised summarization is to extract words and various features at sentence level, such as the length and position of a sentence, and TF-IDF values of words in the sentence, and then extract the sentence by using a machine learning algorithm. Or the abstraction based on the neural network is used for modeling the problem into two tasks of sequence marking and sentence sequencing.

The traditional method for extracting the abstract mainly comprises the following steps: lead-3, TextRank, clustering.

Lead-3: usually, the author of an article will indicate the topic at the beginning of the title and article, so the simplest method is to extract the first few sentences in the article as an abstract. Lead-3 extracts the first three sentences of the article as the abstract of the article. The Lead-3 method, while simple and straightforward, is a very effective method.

TextRank: the TextRank algorithm is similar to PageRank, sentences are used as nodes, similarity among the sentences is used, and undirected weighted edges are constructed. And (4) iteratively updating the node values by using the weight values on the edges, and finally selecting the nodes with the highest scores in a specific number as the abstract.

Clustering: the sentences in the article are regarded as a node, and the abstract is finished according to a clustering method. The method comprises the steps of converting sentences in an article into sentence-level vector representations, obtaining a specific number of categories by using a clustering algorithm, and finally selecting the sentences closest to a centroid from each category to obtain N sentences as final abstracts.

The method for extracting the abstract based on deep learning comprises the following steps:

the extraction abstract can be modeled as a sequence labeling task to be processed, and the core idea is as follows: and (3) marking a two-classification label (0 or 1) for each sentence in the original text, wherein 0 represents that the sentence does not belong to the abstract, and 1 represents that the sentence belongs to the abstract. The final summary consists of all sentences labeled 1.

The training of the model requires supervision data, and the existing data set often has no corresponding sentence-level labels, so that the model needs to be acquired through heuristic rules. The specific method comprises the following steps: firstly, selecting a sentence with the highest route calculation score in the original text and the standard abstract to be added into the candidate set, and then continuously selecting from the original text to ensure that the route score of the selected abstract set is increased until the condition can not be met. The sentences corresponding to the obtained candidate abstract set are set as 1 label, and the rest are 0 labels.

2. Generative abstract

The generative abstract is based on NLG technology, and natural language description is generated by the algorithm model according to the content of the source document instead of extracting sentences of the original text. Many current work on generative summarization is based on a Seq2Seq model in deep learning, and an additional attention mechanism is added to improve the effect. After the advent of a large number of training models including Bert, there are many NLG tasks based on pre-training models, but the use of an autoregressive model represented by GPT and an autorecoding model represented by Bert are different. In addition, because the real-world environment often lacks labeled summary data, much work has been focused on unsupervised generation summarization using autoencoders or other ideas.

3. Compression abstract

The compression abstract is somewhat similar in schema to the generative abstract, but is different in purpose. The main objective of the compression abstract is to filter redundant information in a source document, and compress an original text to obtain corresponding abstract content.

The abstract generation method in the prior art has the technical problems of poor readability and high repeatability of the abstract.

The prior art is therefore still subject to further development.

Disclosure of Invention

In view of the above technical problems, embodiments of the present invention provide a method and a system for generating a text abstract in the financial field, which can solve the technical problems of poor readability and high repeatability of the abstract in the abstract generation method in the prior art.

The first aspect of the embodiments of the present invention provides a method for generating a text summary in the financial field, including:

acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;

dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;

constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;

training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;

evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;

testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;

inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.

Optionally, the acquiring a text data set of a financial field and constructing a field corpus of a financial industry according to the text data set includes:

acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;

and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.

Optionally, the generating the training sample, the testing sample, and the verification sample after preprocessing the training data set, the testing data set, and the verification data set according to the domain corpus and performing PADDING lengthening operation includes:

performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;

and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.

Optionally, before the constructing the text abstract generating model, the method further includes:

and modifying the structure of the text abstract generation model through a unified language model mechanism.

Optionally, the training of the text summarization generation model by using the training samples and the updating of the text summarization generation model by using the Adam optimizer and the cross entropy loss function include:

inputting a training sample into a constructed text abstract generation model;

decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;

and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.

A second aspect of an embodiment of the present invention provides a system for generating a text summary in a financial field, where the system includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

Optionally, the computer program when executed by the processor further implements the steps of:

inputting a training sample into a constructed text abstract generation model;

A third aspect of embodiments of the present invention provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the one or more processors may be configured to execute the above-mentioned method for generating a text abstract for a financial field.

According to the technical scheme provided by the embodiment of the invention, a text data set in the financial field is collected, and a field corpus of the financial industry is constructed; preprocessing a text data set according to a domain corpus to generate a training sample, a test sample and a verification sample; constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises a bidirectional LSTM network and adopts single-hot coding, the decoding structure comprises a unidirectional LSTM network, a zooming translation layer is added, and the zooming translation layer adopts a pointer generating network and is also provided with a covering vector; training the text abstract generation model by using the training sample, the test sample and the verification sample to obtain a target text abstract generation model; inputting the text to be detected into a target text abstract generation model, and outputting the text abstract. The implementation of the invention can improve the precision and accuracy of the current automatic text summarization, improve the text summarization generation efficiency and assist the application development of the automatic text summarization in the financial field.

Drawings

Fig. 1 is a flowchart illustrating an embodiment of a method for generating a text abstract in the financial field according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a self-attention mechanism of an embodiment of a method for generating a text summary in the financial field according to the present invention;

FIG. 3 is a schematic diagram of a pointer generation network according to an embodiment of a method for generating a text abstract in the financial field;

fig. 4 is a schematic hardware configuration diagram of another embodiment of a text summary generation system for the financial field according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a method for generating a text abstract in a financial field according to the present invention. As shown in fig. 1, includes:

s100, collecting a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;

step S200, dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after PADDING (dynamic extension library) lengthening operation;

step S300, constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a single-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate the network, and the zooming translation layer is also provided with a covering vector;

s400, training the text abstract generation model by using a training sample, and updating the text abstract generation model by using an Adam optimizer and a cross entropy loss function;

s500, evaluating the updated text abstract generating model through a verification sample, acquiring a text abstract generating model with an optimal evaluation result, and recording as an optimized text abstract generating model;

s600, testing the optimized text abstract generating model through a test sample, and recording the tested optimized text abstract generating model as a target text abstract generating model;

and S700, inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.

In specific implementation, in step S100, data such as news updates, news communication, news unicom manuscripts, and public company bulletin texts in the financial field are collected, and the collected electronic versions of the data are combed to obtain a text data set in the financial field. A domain corpus of the financial industry is constructed based on text data sets of the financial domain, wherein the text data sets include but are limited to texts and corresponding abstracts thereof.

In step S200, a text structuring process is performed on the database text using natural language. The raw data is divided into a training data set, a testing data set and a verification data set, and each data set is divided into different batches for processing. Based on the established field corpus of the financial industry, performing 'text to index position' operation on the sample data text and the abstract in each batch to respectively obtain a list corresponding to the text and the abstract; wherein the text is "plain converted", i.e. without adding < START > and < END > identifiers; the abstract uses a "special transformation" in which < START > and < END > identifiers are added to the beginning and END of a sentence. And performing PADDING reinforcement processing on the list obtained by processing the sample data in each batch.

Based on the corpus which is automatically constructed, a neural network model is trained by means of a Tensorflow framework to perform automatic text summarization. And introducing a mask mechanism to process the sequence which completes the complement operation.

In step S300, a tensrflow framework is used to train the neural network model to make an automatic text summary. Introducing a mask mechanism, processing the sequence which completes the lengthening operation, adopting a unified language model optimization and training language model, building a Seq2Seq coding and decoding structure, and creating an embedded layer and a bidirectional LSTM network in a coding stage; the decoding phase creates an embedded layer and a unidirectional LSTM network; capturing the characteristics of the text in the encoder by adopting a multi-head attention mechanism through the interaction of an encoding stage and a mask stage; the text is processed by adopting one-hot coding, a zooming translation layer is added, and a pointer generation network is introduced to extract contents from the original text while keeping generating new words so as to improve the accuracy of automatic text summarization; a coverage mechanism is introduced to track and control the original text, so that repeated characters are prevented from being generated, and repetition is effectively eliminated; and (5) completing automatic text summarization by adopting an Adam optimizer and a cross entropy loss function training model.

Wherein, the one-hot encoding is a process of converting the category variable into a form which is easy to utilize by a machine learning algorithm; one-hot encoding, also known as one-bit-efficient encoding, mainly uses a multi-bit status register to encode multiple states, each state being represented by its own independent register bit and having only one bit available at any time; one-hot encoding is the representation of categorical variables as binary vectors. This first requires mapping the classification values to integer values. Each integer value is then represented as a binary vector, which is a zero value, except for the index of the integer, which is marked as 1. One-hot encoding outputs a vocabulary-sized vector to mark whether the word appears in the article.

Example (c):

given a sentence "government procurement time implementation regulation 3.1.days", assuming that the sentence constitutes a whole set of words, the steps of encoding each word using unique hot coding are as follows:

a vocabulary is created as in table 1 below:

TABLE 1

Word	Political affairs administration	Mansion	Mining	Shopping device	Time of flight	Hair-like device	Fruit of Chinese wolfberry	Applying (a) to	Strip for packaging articles	Example (b)	3	Moon cake	1	Day(s)	Line of
																Index	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14

The one-hot code of each word is obtained by the following steps:

firstly, an all-zero vector with one dimension being the total length of the vocabulary is established

And secondly, setting the index dimension of each word in the vocabulary table as 1, and keeping other dimensions unchanged to obtain the final one-hot coded vector.

Taking "implementation of example 3/1/month at government procurement" as an example, table 2 gives a one-hot coded representation of each word, resulting in a vector representation of each word.

TABLE 2

Word	Index	One-hot encoding
			Political affairs administration	0	(100000000000000)
Mansion	1	(010000000000000)
			Mining	2	(001000000000000)
Shopping device	3	(000100000000000)
			Time of flight	4	(000010000000000)
Hair-like device	5	(000001000000000)
			Fruit of Chinese wolfberry	6	(000000100000000)
Applying (a) to	7	(000000010000000)
			Strip for packaging articles	8	(000000001000000)
Example (b)	9	(000000000100000)
			3	10	(000000000010000)
Moon cake	11	(000000000001000)
			1	12	(000000000000100)
Day(s)	13	(000000000000010)
			Line of	14	(000000000000001)

The zoom translation layer specifically includes: introducing a priori knowledge generally accelerates convergence. Since both the input and output languages are chinese, the embedded layers of the encoder and decoder can share parameters (i.e., use the same set of word vectors). This allows the amount of parameters of the model to be reduced substantially.

In addition, there is a useful a priori knowledge: most of the words in the abstract appear in the article (just appearing, not necessarily appearing consecutively, let alone that the abstract is contained in the article, or else becomes a common sequence tagging problem). Therefore, the word set in the article can be used as a prior distribution and added to the classification model in the decoding process, so that the model is more inclined to select the existing words in the article when the model is decoded and output.

The original classification scheme: in each step of prediction, a total vector X is obtained and then accessed to a full connection layer, and finallyObtaining a vector y ═ y (y) of size | V |₁，y₂，......，y_|V|) Where | V | is the number of words of the vocabulary. And y is normalized to obtain the original probability:

scheme of introducing prior distribution: for each article, we get an 0/1 vector of size | V |, x ═ x₁，x₂，......，x_|V|) Wherein represents x_iThe word 1 appears in the article, otherwise it equals 0. Subjecting such an 0/1 vector to a scaling translation layer yields a vector

Where s and t are training parameters. Then apply this vector

The average is taken with the original y and then normalized as follows:

and the unified language model mechanism is adopted to modify the structure of the pre-training language model, so that the text generation task can be completed. The unified language model mechanism does this directly with Seq2Seq as the sentence complement. In the task of automatically generating the abstract, the input is a text, and the output is the abstract, then the unified language model mechanism can combine the text and the abstract into one: [ CLS ] text [ SEP ] abstract [ SEP ]. After such a conversion, a language model can be trained to input "[ CLS ] text [ SEP ]", to predict "abstracts" word by word until "[ SEP ]" appears. The mask for the "text" portion may be removed, taking into account that what is really predicted is only the "abstract" portion. In this way, the attention of the input part is bidirectional, the attention of the output part is unidirectional, the requirement of Seq2Seq is met, no additional constraint exists, the mask language model of Bert can be used for pre-training the weight, and the convergence is faster.

In the self-attention mechanism, each word has three vectors, respectively an expression vector (Q) for the current word, an expression vector (K) for a word preceding the current word in the decoder, and an expression vector (V) for a word preceding the current word in the decoder, all 64 in length. The three vectors are obtained by multiplying the embedded vector X by three different weight matrices through the different weight matrices, and the dimensions of the three different weight matrices are all 512X 64.

The calculation process of the self-attention mechanism can be divided into the following steps:

(1) converting the input word into an embedded vector;

(2) q, K, V three vectors are obtained according to the embedded vectors;

(3) multiplying the vectors Q and K to obtain a fraction corresponding to each vector;

(4) dividing the corresponding fraction of each vector by a scaling factor to make the gradient of the vector more stable;

(5) normalizing the processed fraction;

(6) multiplying the normalized fraction by the vector V to obtain a weighted fraction V of each input vector;

(7) the weighted scores V of each input vector are summed and a vector Z is output that takes into account the self-attention mechanism.

The self-attention mechanism is shown in fig. 2:

the multi-head attention mechanism is equivalent to the integration of a plurality of different self-attention mechanisms. The model is divided into a plurality of heads and placed into a self-attention mechanism to form a plurality of subspaces, so that the model can pay attention to information in different aspects, and finally output results are combined.

The structure of the pointer generation network is shown in fig. 3:

hypothesis weavingWhen the decoder needs to predict the next word of the 'heavy', the state vector at the moment is coupled with the hidden layer vector of the encoder at each moment based on a multi-head attention mechanism to obtain the attention degree of the current time step of the decoder to all time steps of the encoder, and the attention probability distribution a is obtained after normalization^t. The hidden state of each time step of the encoder is multiplied by the attention probability distribution, and the products are summed to obtain the context vector

After being spliced with the hidden state of the current step, the probability distribution P of the predicted word is obtained through a full connection layer and normalization_vocab。

In pointer generation networks, a generation probability p is proposed_gen∈[0，1]The calculation formula is as follows:

wherein the content of the first and second substances,

as a context vector, s_tIs the state vector of the current step, x_tIs the input vector of the current step. P_genIt is determined whether the current step copies a word from the source text as a predicted word or generates a word as a predicted word.

δ is the transpose of the corresponding weight matrix, δ is the activation function. Integrating the copied probability and the generated probability to obtain the predicted words in the current step:

the covering mechanism is specifically as follows: adding a coverage vector c^tIts value is the sum of the decoder's past time step attention distributions:

this vector represents the distribution of the degree of coverage of the source word by the past time step. When the attention degree of the current time step of the decoder to all the time steps of the encoder is calculated through a multi-head attention mechanism, the distribution c is divided into^tTaking into consideration the degree of attention of the t-th time step of the decoder

C is to^tCarry over calculation

In the equation (2), when the attention of the current time step to the source text is calculated, the attention of the previous time step is also taken into consideration (if the sum of the attention of the previous time step to a word in the source text is already high, the attention of the current step to the word is less), so as to reduce repeated generation.

Calculate out

Then, through normalization, the probability distribution a of attention can be obtained^t。

In addition, at each time step, a penalty term needs to be added to the previously generated word, and the form of the penalty term is as follows:

if the degree of attention received by a word in the past is already large, that is, the word is focused on

Is great and the attention to this time is also great, i.e.

Also, it is large, and this situation is penalized faster. Finally, the loss function is of the form:

at each time step, there is a loss value loss_tThe sum and the derivative are added, and then the back propagation can be carried out.

In step S400, based on the constructed database and neural network structure, an Adam optimizer and a cross entropy loss function are used to train the model. Inputting a large-scale text and a model built by abstract training corresponding to the text, decoding by adopting a cluster searching mode, only keeping a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with the input abstract, and calculating a cross entropy loss function. And (4) based on the minimization of the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the obtained model and corresponding parameters in the whole training process.

In step S500, the automatically generated summary of the text is compared with the corresponding summary, and the evaluation index value is calculated, where a higher score indicates a better effect of the automatic text summary. In addition, on the basis of algorithm evaluation, the established financial industry field expert group carries out manual examination and evaluation on the automatic text summarization result based on the accumulation of the professional industry field of the expert group, carries out iterative optimization on the model, and finally outputs the automatic text summarization model with the highest precision.

The overall term of the ROUGE index is (Recall-organized Understudy for Gisting Evaluation), the summary is evaluated mainly based on the co-occurrence information of N-grams in the summary, and the method is an Evaluation method Oriented to the Recall rate of the N-grams. The ROUGE is an automatic article abstract evaluation index, manual abstracts are respectively generated by a plurality of experts to form a standard abstract set, the automatically generated abstract and the standard abstract set are compared and calculated, and corresponding scores are obtained by counting basic units overlapped between the automatically generated abstract and the standard abstract set so as to measure the similarity between the automatically generated abstract and a reference abstract and evaluate the quality of the abstract.

ROUGE-N: the recall rate is calculated on an N-gram language model, where N can take numbers such as 1, 2, etc. The formula is shown below

Wherein N represents the length of N-element, N-gram refers to N continuous words, { referenceSummaries } represents a reference summary, i.e. a standard summary, Count, obtained in advance_match(gram_N) Represents the number of simultaneous N consecutive words in the candidate summary and the reference summary, Count (gram)_N) It represents the number of N consecutive words that appear in the reference abstract.

The numerator is the number of N-elements shared by the reference summary and the automatically generated summary of the model, and the denominator is the number of all N-elements in the standard summary. The automatically generated summary of the model is assumed to be: "recombination after stock stop", the reference abstract is "recombination stoppable only for issuing stocks", when N takes 1 and 2, the obtained result ROUGE-1 is 4/7, and ROUGE-2 is 2/6. The calculation process is shown in table 3:

TABLE 3

N	Total number of sequences	Matching situation	Number of matches	Score of
					1	7	Stop card reorganization	4	4/7
2	6	Stop card reorganization	2	2/6

Step S600 is to test the optimized text abstract generating model by using the test sample pair, and mark the tested optimized text abstract generating model as a target text abstract generating model.

In step S700, based on the established dictionary of the financial field and the trained automatic text summarization model, an intelligent automatic text summarization is implemented, and a corresponding summary is output for a newly input text. The specific process is as follows: inputting a text, performing data preprocessing on the text, segmenting the text based on a dictionary, performing character-to-index position processing on the text obtained by segmenting the text, predicting a summary based on a model obtained by training, and outputting the summary corresponding to the text.

Further, collecting a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set, comprising:

Specifically, data such as news, news Unicom manuscript, the company of going to market bulletin original text of gathering in the financial field, comb the data electronic edition that collects, finally obtain data: the text and the abstract corresponding to the text.

And constructing a field corpus of the financial industry based on the text of the financial field and the corresponding abstract thereof. The corpus comprises three modules, wherein keys of the first module are characters appearing in the corpus, and values corresponding to the keys are the times of the characters appearing in texts or abstracts; the key of the second module is a character appearing in the corpus, and the value corresponding to the key is the index position of the character; the key of the third module is an index position, the key corresponds to a character appearing in the value corpus, and the content of the key value pair of the third module is just opposite to that of the key value pair of the second module. In addition, the third module retains 4 null bits. The position of index 0 corresponds to PADDING, and the function of PADDING is to perform complementary operation on sequences with different lengths so as to keep the sequences with the same length; the position of the index 1 corresponds to UNKNOWN, and UNKNOWN represents a rare word which does not appear in the dictionary; the position of index 2 corresponds to START, and a < START > identifier is added in the text to represent the starting position of the text; the position of index 3 corresponds to END, and an < END > identifier is added to the text, indicating that the text is here at the END position.

Further, after preprocessing the training data set, the testing data set and the verification data set according to the domain corpus and performing PADDING lengthening operation, generating a training sample, a testing sample and a verification sample, including:

In specific implementation, the text structuring processing is carried out on the database text by adopting natural language. The raw data is divided into a training data set, a testing data set and a verification data set, and each data set is divided into different batches for processing. Based on the constructed dictionary, performing 'text to index position' operation on the sample data text and the abstract in each batch to respectively obtain a list corresponding to the text and the abstract; wherein the text is "plain converted", i.e. without adding < START > and < END > identifiers; the abstract uses a "special transformation" in which < START > and < END > identifiers are added to the beginning and END of a sentence. And performing PADDING reinforcement processing on the list obtained by processing the sample data in each batch.

The general transformation scheme is as follows: the first specific number of characters of the text are extracted, assuming that the number of characters to be extracted is 512. And performing the following operations on each extracted character: and searching and returning a corresponding value of the character, namely an index corresponding to the character, of the key from a second module in the financial industry corpus. If not found, 1 is returned. Each character gets a value and all values constitute a list of length 512.

The procedure for the special transformation is as follows: the first 510 characters of the abstract are extracted, and each character performs the following operations: searching and returning a key as a corresponding value of the character from a second module in the financial industry corpus, namely an index corresponding to the character; if not found, 1 is returned. Each character gets a value and all values form a list of length 510. The first position in the list is complemented by < START > and the last position by < END >.

The PADDING growing operation is: assuming that the scale of the data processed in each batch is 128, 128 pieces of text data are extracted, and after the text data is processed by the word-to-index position, each piece of text data can obtain a list, wherein the length of the list corresponds to the number of characters in the text. 128 pieces of text data correspond to 128 lists with different lengths. Since the list with unequal lengths is easy to lose part of data when model training is performed, the following processing is performed on the list:

calculating the length corresponding to the integer list corresponding to each text; extracting the maximum length of the 128 lists, and recording the maximum length as ML; based on the maximum length ML, the length of the 128 lists which is less than the maximum length is filled with 0, and the length of the 128 lists is kept consistent and is equal to the maximum length.

Further, before constructing the text abstract generation model, the method further comprises:

When the method is specifically implemented, the structure of the pre-training language model is modified by adopting a uniform language model mechanism, so that the method is beneficial to completing text generation tasks. The unified language model mechanism does this directly with Seq2Seq as the sentence complement. In the task of automatically generating the abstract, the input is a text, and the output is the abstract, then the unified language model mechanism can combine the text and the abstract into one: [ CLS ] text [ SEP ] abstract [ SEP ]. After such a conversion, a language model can be trained to input "[ CLS ] text [ SEP ]", to predict "abstracts" word by word until "[ SEP ]" appears. The mask for the "text" portion may be removed, taking into account that what is really predicted is only the "abstract" portion. In this way, the attention of the input part is bidirectional, the attention of the output part is unidirectional, the requirement of Seq2Seq is met, no additional constraint exists, the mask language model of Bert can be used for pre-training the weight, and the convergence is faster.

Further, training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function, wherein the method comprises the following steps of:

inputting a training sample into a constructed text abstract generation model;

In specific implementation, based on the constructed database and the neural network structure, an Adam optimizer and a cross entropy loss function are adopted to train the model. Inputting a large-scale text and a model built by abstract training corresponding to the text, decoding by adopting a cluster searching mode, only keeping a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with the input abstract, and calculating a cross entropy loss function. And (4) based on the minimization of the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the obtained model and corresponding parameters in the whole training process.

The specific process of cluster searching is as follows:

assuming that the beamwidth of the bundled search is K, when predicting the next word based on the current word, the following operation is performed: and selecting K words with the maximum current conditional probability as the first word of the candidate output sequence in the first time step. And the second time step length, combining the output sequence based on the first step length with all the words in the word list to obtain a plurality of sequences, calculating the score of each sequence and selecting K sequences with the highest scores as a new current sequence. And repeating the above processes every time step till an end symbol is met or the maximum length is reached, and finally outputting the K sequences with the highest scores.

As can be seen from the above method embodiments, compared with the prior art, the embodiments of the present invention have the following advantages:

a domain corpus of the financial industry is constructed. Currently, there are many corpora in general fields and the application is also wide. However, the corpus of the financial industry is relatively overwhelmed, the application of artificial intelligence technology is relatively few, and the deficient corpus hinders the intelligent reform of the financial industry. The corpus of the financial industry constructed by the embodiment of the invention can assist the intelligent question answering, intelligent search and other multi-application ground of the industry.

A multi-head attention mechanism is introduced in automatic text summarization. Current feed-forward and loop networks have computational power and optimization algorithm limitations. Limitation of computing power: when remembering much "information", the model becomes more complex, however, the computational power is still a bottleneck limiting the development of neural networks. Limitation of the optimization algorithm: although optimization operations such as local connection, weight sharing, pooling and the like can simplify the neural network, the contradiction between the complexity and the expression capability of the model is effectively relieved; but the information "memory" capability of the existing recurrent neural network is not high. The embodiment of the invention introduces a multi-head attention mechanism, dynamically generates weights of different connections by means of human brain processing information overload, thereby processing a lengthened information sequence, improving the processing capability of a neural network and greatly improving the precision of automatically generating the abstract of the long text.

A pointer generation network mechanism is introduced. Problems in the current abstract generation model: erroneous information may be generated. The model of the embodiment of the invention can copy information from the source text and can generate new information at the same time. Avoiding the generation of erroneous information.

The invention introduces a pointer generation network mechanism, can conveniently copy words from the original text by pointing, improves the accuracy and processes the excessive vocabulary (OOV) and simultaneously reserves the capacity of generating new words. This network can be considered as taking the balance of the generative digest method and the decimated digest method.

An overlay mechanism is introduced. Problems in the current abstract generation model: generating repetitive information is especially noticeable when multiple sentences are generated. The embodiment of the invention introduces an overlay mechanism for tracking and controlling the original text. An overlay vector is maintained, the vector is the sum of the attention values of the decoding stage, is a distribution covering the words of the original text, and represents the coverage degree of the words generated by the attention mechanism, so that the current conclusion of the attention mechanism is ensured to be obtained through the previous information, the repeated attention of the words at the same position is avoided, the generation of repeated words is avoided, and the repetition is effectively eliminated.

A unified language model mechanism is introduced. Language model and training is a technique to "teach" machine learning systems to contextualize text representations by letting them predict words according to context, however current pre-trained models are bi-directional in relation (use of left and right context of words to form predictions) and are not suitable for generating natural language tasks with a large number of modifications.

The embodiment of the invention also introduces a unified language model mechanism to preprocess a large amount of texts and optimize the language model. Different self-attention masks are used for configuration, so that contexts are aggregated for different types of language models, and in addition, due to the uniformity of the pre-training property, the network can share parameters, so that the learned text representation is more universal, and the overfitting to any single task is reduced.

It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

In the above description of the method for generating a text excerpt for the financial field in the embodiment of the present invention, a system for generating a text excerpt for the financial field in the embodiment of the present invention is described below, please refer to fig. 4, fig. 4 is a schematic hardware structure diagram of another embodiment of a system for generating a text excerpt for the financial field in the embodiment of the present invention, as shown in fig. 4, the system 10 includes: a memory 101, a processor 102 and a computer program stored on the memory and executable on the processor, the computer program realizing the following steps when executed by the processor 101:

The specific implementation steps are the same as those of the method embodiments, and are not described herein again.

Optionally, the computer program when executed by the processor 101 further implements the steps of:

inputting a training sample into a constructed text abstract generation model;

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100-S700 in fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described in embodiments of the invention are intended to comprise one or more of these and/or any other suitable types of memory.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating a text abstract in the financial field is characterized by comprising the following steps:

2. The method for generating the text abstract for the financial field according to claim 1, wherein the collecting text data sets of the financial field and constructing a field corpus of the financial industry according to the text data sets comprises:

3. The method for generating the text abstract for the financial field according to claim 2, wherein the generating of the training samples, the testing samples and the verification samples after preprocessing the training data set, the testing data set and the verification data set according to the field corpus and PADDING's complement operation comprises:

4. The method for generating a text abstract for the financial field as claimed in claim 1, wherein before constructing the text abstract generating model, the method further comprises:

5. The method for generating the text abstract for the financial field according to claim 1, wherein the training of the text abstract generation model by using the training samples and the updating of the text abstract generation model by using an Adam optimizer and a cross entropy loss function comprises:

inputting a training sample into a constructed text abstract generation model;

6. A system for generating a text summary for use in a financial field, the system comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

7. The system for generating a text excerpt for the financial domain of claim 6, wherein the computer program, when executed by the processor, further performs the steps of:

8. The system for generating a text excerpt for the financial domain of claim 7, wherein the computer program, when executed by the processor, further performs the steps of:

9. The system for generating a text excerpt for the financial domain of claim 6, wherein the computer program, when executed by the processor, further performs the steps of:

inputting a training sample into a constructed text abstract generation model;

10. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method for generating a text excerpt for a financial field of any one of claims 1-5.