CN113821635A - Text abstract generation method and system for financial field - Google Patents

Text abstract generation method and system for financial field Download PDF

Info

Publication number
CN113821635A
CN113821635A CN202110879065.8A CN202110879065A CN113821635A CN 113821635 A CN113821635 A CN 113821635A CN 202110879065 A CN202110879065 A CN 202110879065A CN 113821635 A CN113821635 A CN 113821635A
Authority
CN
China
Prior art keywords
text
abstract
data set
generating
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110879065.8A
Other languages
Chinese (zh)
Inventor
龙雄伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110879065.8A priority Critical patent/CN113821635A/en
Publication of CN113821635A publication Critical patent/CN113821635A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for generating a text abstract in the financial field, wherein the method comprises the following steps: collecting a text data set of a financial field, and constructing a field corpus of the financial industry; preprocessing a text data set according to a domain corpus to generate a training sample, a test sample and a verification sample; constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises a bidirectional LSTM network and adopts single-hot coding, the decoding structure comprises a unidirectional LSTM network, a zooming translation layer is added, and the zooming translation layer adopts a pointer generating network and is also provided with a covering vector; training the text abstract generation model by using the training sample, the test sample and the verification sample to obtain a target text abstract generation model; inputting the text to be detected into a target text abstract generation model, and outputting the text abstract. The invention can improve the precision and accuracy of the current automatic text abstract and improve the text abstract generating efficiency.

Description

Text abstract generation method and system for financial field
Technical Field
The invention relates to the technical field of deep learning, in particular to a text abstract generation method and system for the financial field.
Background
With the explosive growth of text information in recent years, people can be exposed to massive text information such as news, blogs, chatting, microblogs, reports, papers, microblogs and the like every day. It has become an urgent need to extract important content from a large amount of text information, and the automatic text digest provides an efficient solution.
Despite the enormous demand for automatic text summarization, the development in this area is relatively slow. Generating summaries is a challenging task for computers. And generating a qualified abstract from one or more texts, requiring a computer to understand the content of the original text after reading the original text, and performing accepting, cutting and splicing the content according to the degree of urgency and urgency to finally generate a smooth short text. Therefore, an industrial, low-cost and high-precision automatic text summarization system is developed, and the development of various fields such as news headline generation, thesis summary generation and report summarization can be facilitated.
The summary generation methods in the prior art include the following methods:
1. drawing type abstract
The abstract mode is mainly to use an algorithm to collect and extract sentences of threads from source documents as abstract sentences. Generally, the smoothness is better than the generative abstract, but too much redundant information is introduced, and the characteristics of the abstract cannot be reflected.
The extraction method selects key sentences from the original text to form an abstract. The method has natural low error rate in grammar and syntax and ensures certain effect. The traditional abstraction method uses graph method, clustering and other ways to complete unsupervised abstraction. One of the popular methods with supervised summarization is to extract words and various features at sentence level, such as the length and position of a sentence, and TF-IDF values of words in the sentence, and then extract the sentence by using a machine learning algorithm. Or the abstraction based on the neural network is used for modeling the problem into two tasks of sequence marking and sentence sequencing.
The traditional method for extracting the abstract mainly comprises the following steps: lead-3, TextRank, clustering.
Lead-3: usually, the author of an article will indicate the topic at the beginning of the title and article, so the simplest method is to extract the first few sentences in the article as an abstract. Lead-3 extracts the first three sentences of the article as the abstract of the article. The Lead-3 method, while simple and straightforward, is a very effective method.
TextRank: the TextRank algorithm is similar to PageRank, sentences are used as nodes, similarity among the sentences is used, and undirected weighted edges are constructed. And (4) iteratively updating the node values by using the weight values on the edges, and finally selecting the nodes with the highest scores in a specific number as the abstract.
Clustering: the sentences in the article are regarded as a node, and the abstract is finished according to a clustering method. The method comprises the steps of converting sentences in an article into sentence-level vector representations, obtaining a specific number of categories by using a clustering algorithm, and finally selecting the sentences closest to a centroid from each category to obtain N sentences as final abstracts.
The method for extracting the abstract based on deep learning comprises the following steps:
the extraction abstract can be modeled as a sequence labeling task to be processed, and the core idea is as follows: and (3) marking a two-classification label (0 or 1) for each sentence in the original text, wherein 0 represents that the sentence does not belong to the abstract, and 1 represents that the sentence belongs to the abstract. The final summary consists of all sentences labeled 1.
The training of the model requires supervision data, and the existing data set often has no corresponding sentence-level labels, so that the model needs to be acquired through heuristic rules. The specific method comprises the following steps: firstly, selecting a sentence with the highest route calculation score in the original text and the standard abstract to be added into the candidate set, and then continuously selecting from the original text to ensure that the route score of the selected abstract set is increased until the condition can not be met. The sentences corresponding to the obtained candidate abstract set are set as 1 label, and the rest are 0 labels.
2. Generative abstract
The generative abstract is based on NLG technology, and natural language description is generated by the algorithm model according to the content of the source document instead of extracting sentences of the original text. Many current work on generative summarization is based on a Seq2Seq model in deep learning, and an additional attention mechanism is added to improve the effect. After the advent of a large number of training models including Bert, there are many NLG tasks based on pre-training models, but the use of an autoregressive model represented by GPT and an autorecoding model represented by Bert are different. In addition, because the real-world environment often lacks labeled summary data, much work has been focused on unsupervised generation summarization using autoencoders or other ideas.
3. Compression abstract
The compression abstract is somewhat similar in schema to the generative abstract, but is different in purpose. The main objective of the compression abstract is to filter redundant information in a source document, and compress an original text to obtain corresponding abstract content.
The abstract generation method in the prior art has the technical problems of poor readability and high repeatability of the abstract.
The prior art is therefore still subject to further development.
Disclosure of Invention
In view of the above technical problems, embodiments of the present invention provide a method and a system for generating a text abstract in the financial field, which can solve the technical problems of poor readability and high repeatability of the abstract in the abstract generation method in the prior art.
The first aspect of the embodiments of the present invention provides a method for generating a text summary in the financial field, including:
acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;
constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;
training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;
evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;
testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;
inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
Optionally, the acquiring a text data set of a financial field and constructing a field corpus of a financial industry according to the text data set includes:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
Optionally, the generating the training sample, the testing sample, and the verification sample after preprocessing the training data set, the testing data set, and the verification data set according to the domain corpus and performing PADDING lengthening operation includes:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
Optionally, before the constructing the text abstract generating model, the method further includes:
and modifying the structure of the text abstract generation model through a unified language model mechanism.
Optionally, the training of the text summarization generation model by using the training samples and the updating of the text summarization generation model by using the Adam optimizer and the cross entropy loss function include:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
A second aspect of an embodiment of the present invention provides a system for generating a text summary in a financial field, where the system includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:
acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;
constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;
training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;
evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;
testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;
inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
Optionally, the computer program when executed by the processor further implements the steps of:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
Optionally, the computer program when executed by the processor further implements the steps of:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
Optionally, the computer program when executed by the processor further implements the steps of:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
A third aspect of embodiments of the present invention provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the one or more processors may be configured to execute the above-mentioned method for generating a text abstract for a financial field.
According to the technical scheme provided by the embodiment of the invention, a text data set in the financial field is collected, and a field corpus of the financial industry is constructed; preprocessing a text data set according to a domain corpus to generate a training sample, a test sample and a verification sample; constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises a bidirectional LSTM network and adopts single-hot coding, the decoding structure comprises a unidirectional LSTM network, a zooming translation layer is added, and the zooming translation layer adopts a pointer generating network and is also provided with a covering vector; training the text abstract generation model by using the training sample, the test sample and the verification sample to obtain a target text abstract generation model; inputting the text to be detected into a target text abstract generation model, and outputting the text abstract. The implementation of the invention can improve the precision and accuracy of the current automatic text summarization, improve the text summarization generation efficiency and assist the application development of the automatic text summarization in the financial field.
Drawings
Fig. 1 is a flowchart illustrating an embodiment of a method for generating a text abstract in the financial field according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a self-attention mechanism of an embodiment of a method for generating a text summary in the financial field according to the present invention;
FIG. 3 is a schematic diagram of a pointer generation network according to an embodiment of a method for generating a text abstract in the financial field;
fig. 4 is a schematic hardware configuration diagram of another embodiment of a text summary generation system for the financial field according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a method for generating a text abstract in a financial field according to the present invention. As shown in fig. 1, includes:
s100, collecting a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
step S200, dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after PADDING (dynamic extension library) lengthening operation;
step S300, constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a single-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate the network, and the zooming translation layer is also provided with a covering vector;
s400, training the text abstract generation model by using a training sample, and updating the text abstract generation model by using an Adam optimizer and a cross entropy loss function;
s500, evaluating the updated text abstract generating model through a verification sample, acquiring a text abstract generating model with an optimal evaluation result, and recording as an optimized text abstract generating model;
s600, testing the optimized text abstract generating model through a test sample, and recording the tested optimized text abstract generating model as a target text abstract generating model;
and S700, inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
In specific implementation, in step S100, data such as news updates, news communication, news unicom manuscripts, and public company bulletin texts in the financial field are collected, and the collected electronic versions of the data are combed to obtain a text data set in the financial field. A domain corpus of the financial industry is constructed based on text data sets of the financial domain, wherein the text data sets include but are limited to texts and corresponding abstracts thereof.
In step S200, a text structuring process is performed on the database text using natural language. The raw data is divided into a training data set, a testing data set and a verification data set, and each data set is divided into different batches for processing. Based on the established field corpus of the financial industry, performing 'text to index position' operation on the sample data text and the abstract in each batch to respectively obtain a list corresponding to the text and the abstract; wherein the text is "plain converted", i.e. without adding < START > and < END > identifiers; the abstract uses a "special transformation" in which < START > and < END > identifiers are added to the beginning and END of a sentence. And performing PADDING reinforcement processing on the list obtained by processing the sample data in each batch.
Based on the corpus which is automatically constructed, a neural network model is trained by means of a Tensorflow framework to perform automatic text summarization. And introducing a mask mechanism to process the sequence which completes the complement operation.
In step S300, a tensrflow framework is used to train the neural network model to make an automatic text summary. Introducing a mask mechanism, processing the sequence which completes the lengthening operation, adopting a unified language model optimization and training language model, building a Seq2Seq coding and decoding structure, and creating an embedded layer and a bidirectional LSTM network in a coding stage; the decoding phase creates an embedded layer and a unidirectional LSTM network; capturing the characteristics of the text in the encoder by adopting a multi-head attention mechanism through the interaction of an encoding stage and a mask stage; the text is processed by adopting one-hot coding, a zooming translation layer is added, and a pointer generation network is introduced to extract contents from the original text while keeping generating new words so as to improve the accuracy of automatic text summarization; a coverage mechanism is introduced to track and control the original text, so that repeated characters are prevented from being generated, and repetition is effectively eliminated; and (5) completing automatic text summarization by adopting an Adam optimizer and a cross entropy loss function training model.
Wherein, the one-hot encoding is a process of converting the category variable into a form which is easy to utilize by a machine learning algorithm; one-hot encoding, also known as one-bit-efficient encoding, mainly uses a multi-bit status register to encode multiple states, each state being represented by its own independent register bit and having only one bit available at any time; one-hot encoding is the representation of categorical variables as binary vectors. This first requires mapping the classification values to integer values. Each integer value is then represented as a binary vector, which is a zero value, except for the index of the integer, which is marked as 1. One-hot encoding outputs a vocabulary-sized vector to mark whether the word appears in the article.
Example (c):
given a sentence "government procurement time implementation regulation 3.1.days", assuming that the sentence constitutes a whole set of words, the steps of encoding each word using unique hot coding are as follows:
a vocabulary is created as in table 1 below:
TABLE 1
Word Political affairs administration Mansion Mining Shopping device Time of flight Hair-like device Fruit of Chinese wolfberry Applying (a) to Strip for packaging articles Example (b) 3 Moon cake 1 Day(s) Line of
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The one-hot code of each word is obtained by the following steps:
firstly, an all-zero vector with one dimension being the total length of the vocabulary is established
And secondly, setting the index dimension of each word in the vocabulary table as 1, and keeping other dimensions unchanged to obtain the final one-hot coded vector.
Taking "implementation of example 3/1/month at government procurement" as an example, table 2 gives a one-hot coded representation of each word, resulting in a vector representation of each word.
TABLE 2
Word Index One-hot encoding
Political affairs administration 0 (100000000000000)
Mansion 1 (010000000000000)
Mining 2 (001000000000000)
Shopping device 3 (000100000000000)
Time of flight 4 (000010000000000)
Hair-like device 5 (000001000000000)
Fruit of Chinese wolfberry 6 (000000100000000)
Applying (a) to 7 (000000010000000)
Strip for packaging articles 8 (000000001000000)
Example (b) 9 (000000000100000)
3 10 (000000000010000)
Moon cake 11 (000000000001000)
1 12 (000000000000100)
Day(s) 13 (000000000000010)
Line of 14 (000000000000001)
The zoom translation layer specifically includes: introducing a priori knowledge generally accelerates convergence. Since both the input and output languages are chinese, the embedded layers of the encoder and decoder can share parameters (i.e., use the same set of word vectors). This allows the amount of parameters of the model to be reduced substantially.
In addition, there is a useful a priori knowledge: most of the words in the abstract appear in the article (just appearing, not necessarily appearing consecutively, let alone that the abstract is contained in the article, or else becomes a common sequence tagging problem). Therefore, the word set in the article can be used as a prior distribution and added to the classification model in the decoding process, so that the model is more inclined to select the existing words in the article when the model is decoded and output.
The original classification scheme: in each step of prediction, a total vector X is obtained and then accessed to a full connection layer, and finallyObtaining a vector y ═ y (y) of size | V |1,y2,......,y|V|) Where | V | is the number of words of the vocabulary. And y is normalized to obtain the original probability:
Figure BDA0003191386560000081
scheme of introducing prior distribution: for each article, we get an 0/1 vector of size | V |, x ═ x1,x2,......,x|V|) Wherein represents xiThe word 1 appears in the article, otherwise it equals 0. Subjecting such an 0/1 vector to a scaling translation layer yields a vector
Figure BDA0003191386560000082
Figure BDA0003191386560000083
Where s and t are training parameters. Then apply this vector
Figure BDA0003191386560000085
The average is taken with the original y and then normalized as follows:
Figure BDA0003191386560000084
and the unified language model mechanism is adopted to modify the structure of the pre-training language model, so that the text generation task can be completed. The unified language model mechanism does this directly with Seq2Seq as the sentence complement. In the task of automatically generating the abstract, the input is a text, and the output is the abstract, then the unified language model mechanism can combine the text and the abstract into one: [ CLS ] text [ SEP ] abstract [ SEP ]. After such a conversion, a language model can be trained to input "[ CLS ] text [ SEP ]", to predict "abstracts" word by word until "[ SEP ]" appears. The mask for the "text" portion may be removed, taking into account that what is really predicted is only the "abstract" portion. In this way, the attention of the input part is bidirectional, the attention of the output part is unidirectional, the requirement of Seq2Seq is met, no additional constraint exists, the mask language model of Bert can be used for pre-training the weight, and the convergence is faster.
In the self-attention mechanism, each word has three vectors, respectively an expression vector (Q) for the current word, an expression vector (K) for a word preceding the current word in the decoder, and an expression vector (V) for a word preceding the current word in the decoder, all 64 in length. The three vectors are obtained by multiplying the embedded vector X by three different weight matrices through the different weight matrices, and the dimensions of the three different weight matrices are all 512X 64.
The calculation process of the self-attention mechanism can be divided into the following steps:
(1) converting the input word into an embedded vector;
(2) q, K, V three vectors are obtained according to the embedded vectors;
(3) multiplying the vectors Q and K to obtain a fraction corresponding to each vector;
(4) dividing the corresponding fraction of each vector by a scaling factor to make the gradient of the vector more stable;
(5) normalizing the processed fraction;
(6) multiplying the normalized fraction by the vector V to obtain a weighted fraction V of each input vector;
(7) the weighted scores V of each input vector are summed and a vector Z is output that takes into account the self-attention mechanism.
The self-attention mechanism is shown in fig. 2:
the multi-head attention mechanism is equivalent to the integration of a plurality of different self-attention mechanisms. The model is divided into a plurality of heads and placed into a self-attention mechanism to form a plurality of subspaces, so that the model can pay attention to information in different aspects, and finally output results are combined.
The structure of the pointer generation network is shown in fig. 3:
hypothesis weavingWhen the decoder needs to predict the next word of the 'heavy', the state vector at the moment is coupled with the hidden layer vector of the encoder at each moment based on a multi-head attention mechanism to obtain the attention degree of the current time step of the decoder to all time steps of the encoder, and the attention probability distribution a is obtained after normalizationt. The hidden state of each time step of the encoder is multiplied by the attention probability distribution, and the products are summed to obtain the context vector
Figure BDA0003191386560000091
After being spliced with the hidden state of the current step, the probability distribution P of the predicted word is obtained through a full connection layer and normalizationvocab
In pointer generation networks, a generation probability p is proposedgen∈[0,1]The calculation formula is as follows:
Figure BDA0003191386560000092
wherein the content of the first and second substances,
Figure BDA0003191386560000093
as a context vector, stIs the state vector of the current step, xtIs the input vector of the current step. PgenIt is determined whether the current step copies a word from the source text as a predicted word or generates a word as a predicted word.
Figure BDA0003191386560000094
δ is the transpose of the corresponding weight matrix, δ is the activation function. Integrating the copied probability and the generated probability to obtain the predicted words in the current step:
Figure BDA0003191386560000095
the covering mechanism is specifically as follows: adding a coverage vector ctIts value is the sum of the decoder's past time step attention distributions:
Figure BDA0003191386560000096
this vector represents the distribution of the degree of coverage of the source word by the past time step. When the attention degree of the current time step of the decoder to all the time steps of the encoder is calculated through a multi-head attention mechanism, the distribution c is divided intotTaking into consideration the degree of attention of the t-th time step of the decoder
Figure BDA0003191386560000097
C is totCarry over calculation
Figure BDA0003191386560000098
In the equation (2), when the attention of the current time step to the source text is calculated, the attention of the previous time step is also taken into consideration (if the sum of the attention of the previous time step to a word in the source text is already high, the attention of the current step to the word is less), so as to reduce repeated generation.
Calculate out
Figure BDA0003191386560000101
Then, through normalization, the probability distribution a of attention can be obtainedt
In addition, at each time step, a penalty term needs to be added to the previously generated word, and the form of the penalty term is as follows:
Figure BDA0003191386560000102
if the degree of attention received by a word in the past is already large, that is, the word is focused on
Figure BDA0003191386560000103
Is great and the attention to this time is also great, i.e.
Figure BDA0003191386560000104
Also, it is large, and this situation is penalized faster. Finally, the loss function is of the form:
Figure BDA0003191386560000105
at each time step, there is a loss value losstThe sum and the derivative are added, and then the back propagation can be carried out.
In step S400, based on the constructed database and neural network structure, an Adam optimizer and a cross entropy loss function are used to train the model. Inputting a large-scale text and a model built by abstract training corresponding to the text, decoding by adopting a cluster searching mode, only keeping a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with the input abstract, and calculating a cross entropy loss function. And (4) based on the minimization of the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the obtained model and corresponding parameters in the whole training process.
In step S500, the automatically generated summary of the text is compared with the corresponding summary, and the evaluation index value is calculated, where a higher score indicates a better effect of the automatic text summary. In addition, on the basis of algorithm evaluation, the established financial industry field expert group carries out manual examination and evaluation on the automatic text summarization result based on the accumulation of the professional industry field of the expert group, carries out iterative optimization on the model, and finally outputs the automatic text summarization model with the highest precision.
The overall term of the ROUGE index is (Recall-organized Understudy for Gisting Evaluation), the summary is evaluated mainly based on the co-occurrence information of N-grams in the summary, and the method is an Evaluation method Oriented to the Recall rate of the N-grams. The ROUGE is an automatic article abstract evaluation index, manual abstracts are respectively generated by a plurality of experts to form a standard abstract set, the automatically generated abstract and the standard abstract set are compared and calculated, and corresponding scores are obtained by counting basic units overlapped between the automatically generated abstract and the standard abstract set so as to measure the similarity between the automatically generated abstract and a reference abstract and evaluate the quality of the abstract.
ROUGE-N: the recall rate is calculated on an N-gram language model, where N can take numbers such as 1, 2, etc. The formula is shown below
Figure BDA0003191386560000106
Wherein N represents the length of N-element, N-gram refers to N continuous words, { referenceSummaries } represents a reference summary, i.e. a standard summary, Count, obtained in advancematch(gramN) Represents the number of simultaneous N consecutive words in the candidate summary and the reference summary, Count (gram)N) It represents the number of N consecutive words that appear in the reference abstract.
The numerator is the number of N-elements shared by the reference summary and the automatically generated summary of the model, and the denominator is the number of all N-elements in the standard summary. The automatically generated summary of the model is assumed to be: "recombination after stock stop", the reference abstract is "recombination stoppable only for issuing stocks", when N takes 1 and 2, the obtained result ROUGE-1 is 4/7, and ROUGE-2 is 2/6. The calculation process is shown in table 3:
TABLE 3
N Total number of sequences Matching situation Number of matches Score of
1 7 Stop card reorganization 4 4/7
2 6 Stop card reorganization 2 2/6
Step S600 is to test the optimized text abstract generating model by using the test sample pair, and mark the tested optimized text abstract generating model as a target text abstract generating model.
In step S700, based on the established dictionary of the financial field and the trained automatic text summarization model, an intelligent automatic text summarization is implemented, and a corresponding summary is output for a newly input text. The specific process is as follows: inputting a text, performing data preprocessing on the text, segmenting the text based on a dictionary, performing character-to-index position processing on the text obtained by segmenting the text, predicting a summary based on a model obtained by training, and outputting the summary corresponding to the text.
Further, collecting a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set, comprising:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
Specifically, data such as news, news Unicom manuscript, the company of going to market bulletin original text of gathering in the financial field, comb the data electronic edition that collects, finally obtain data: the text and the abstract corresponding to the text.
And constructing a field corpus of the financial industry based on the text of the financial field and the corresponding abstract thereof. The corpus comprises three modules, wherein keys of the first module are characters appearing in the corpus, and values corresponding to the keys are the times of the characters appearing in texts or abstracts; the key of the second module is a character appearing in the corpus, and the value corresponding to the key is the index position of the character; the key of the third module is an index position, the key corresponds to a character appearing in the value corpus, and the content of the key value pair of the third module is just opposite to that of the key value pair of the second module. In addition, the third module retains 4 null bits. The position of index 0 corresponds to PADDING, and the function of PADDING is to perform complementary operation on sequences with different lengths so as to keep the sequences with the same length; the position of the index 1 corresponds to UNKNOWN, and UNKNOWN represents a rare word which does not appear in the dictionary; the position of index 2 corresponds to START, and a < START > identifier is added in the text to represent the starting position of the text; the position of index 3 corresponds to END, and an < END > identifier is added to the text, indicating that the text is here at the END position.
Further, after preprocessing the training data set, the testing data set and the verification data set according to the domain corpus and performing PADDING lengthening operation, generating a training sample, a testing sample and a verification sample, including:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
In specific implementation, the text structuring processing is carried out on the database text by adopting natural language. The raw data is divided into a training data set, a testing data set and a verification data set, and each data set is divided into different batches for processing. Based on the constructed dictionary, performing 'text to index position' operation on the sample data text and the abstract in each batch to respectively obtain a list corresponding to the text and the abstract; wherein the text is "plain converted", i.e. without adding < START > and < END > identifiers; the abstract uses a "special transformation" in which < START > and < END > identifiers are added to the beginning and END of a sentence. And performing PADDING reinforcement processing on the list obtained by processing the sample data in each batch.
The general transformation scheme is as follows: the first specific number of characters of the text are extracted, assuming that the number of characters to be extracted is 512. And performing the following operations on each extracted character: and searching and returning a corresponding value of the character, namely an index corresponding to the character, of the key from a second module in the financial industry corpus. If not found, 1 is returned. Each character gets a value and all values constitute a list of length 512.
The procedure for the special transformation is as follows: the first 510 characters of the abstract are extracted, and each character performs the following operations: searching and returning a key as a corresponding value of the character from a second module in the financial industry corpus, namely an index corresponding to the character; if not found, 1 is returned. Each character gets a value and all values form a list of length 510. The first position in the list is complemented by < START > and the last position by < END >.
The PADDING growing operation is: assuming that the scale of the data processed in each batch is 128, 128 pieces of text data are extracted, and after the text data is processed by the word-to-index position, each piece of text data can obtain a list, wherein the length of the list corresponds to the number of characters in the text. 128 pieces of text data correspond to 128 lists with different lengths. Since the list with unequal lengths is easy to lose part of data when model training is performed, the following processing is performed on the list:
calculating the length corresponding to the integer list corresponding to each text; extracting the maximum length of the 128 lists, and recording the maximum length as ML; based on the maximum length ML, the length of the 128 lists which is less than the maximum length is filled with 0, and the length of the 128 lists is kept consistent and is equal to the maximum length.
Further, before constructing the text abstract generation model, the method further comprises:
and modifying the structure of the text abstract generation model through a unified language model mechanism.
When the method is specifically implemented, the structure of the pre-training language model is modified by adopting a uniform language model mechanism, so that the method is beneficial to completing text generation tasks. The unified language model mechanism does this directly with Seq2Seq as the sentence complement. In the task of automatically generating the abstract, the input is a text, and the output is the abstract, then the unified language model mechanism can combine the text and the abstract into one: [ CLS ] text [ SEP ] abstract [ SEP ]. After such a conversion, a language model can be trained to input "[ CLS ] text [ SEP ]", to predict "abstracts" word by word until "[ SEP ]" appears. The mask for the "text" portion may be removed, taking into account that what is really predicted is only the "abstract" portion. In this way, the attention of the input part is bidirectional, the attention of the output part is unidirectional, the requirement of Seq2Seq is met, no additional constraint exists, the mask language model of Bert can be used for pre-training the weight, and the convergence is faster.
Further, training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function, wherein the method comprises the following steps of:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
In specific implementation, based on the constructed database and the neural network structure, an Adam optimizer and a cross entropy loss function are adopted to train the model. Inputting a large-scale text and a model built by abstract training corresponding to the text, decoding by adopting a cluster searching mode, only keeping a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with the input abstract, and calculating a cross entropy loss function. And (4) based on the minimization of the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the obtained model and corresponding parameters in the whole training process.
The specific process of cluster searching is as follows:
assuming that the beamwidth of the bundled search is K, when predicting the next word based on the current word, the following operation is performed: and selecting K words with the maximum current conditional probability as the first word of the candidate output sequence in the first time step. And the second time step length, combining the output sequence based on the first step length with all the words in the word list to obtain a plurality of sequences, calculating the score of each sequence and selecting K sequences with the highest scores as a new current sequence. And repeating the above processes every time step till an end symbol is met or the maximum length is reached, and finally outputting the K sequences with the highest scores.
As can be seen from the above method embodiments, compared with the prior art, the embodiments of the present invention have the following advantages:
a domain corpus of the financial industry is constructed. Currently, there are many corpora in general fields and the application is also wide. However, the corpus of the financial industry is relatively overwhelmed, the application of artificial intelligence technology is relatively few, and the deficient corpus hinders the intelligent reform of the financial industry. The corpus of the financial industry constructed by the embodiment of the invention can assist the intelligent question answering, intelligent search and other multi-application ground of the industry.
A multi-head attention mechanism is introduced in automatic text summarization. Current feed-forward and loop networks have computational power and optimization algorithm limitations. Limitation of computing power: when remembering much "information", the model becomes more complex, however, the computational power is still a bottleneck limiting the development of neural networks. Limitation of the optimization algorithm: although optimization operations such as local connection, weight sharing, pooling and the like can simplify the neural network, the contradiction between the complexity and the expression capability of the model is effectively relieved; but the information "memory" capability of the existing recurrent neural network is not high. The embodiment of the invention introduces a multi-head attention mechanism, dynamically generates weights of different connections by means of human brain processing information overload, thereby processing a lengthened information sequence, improving the processing capability of a neural network and greatly improving the precision of automatically generating the abstract of the long text.
A pointer generation network mechanism is introduced. Problems in the current abstract generation model: erroneous information may be generated. The model of the embodiment of the invention can copy information from the source text and can generate new information at the same time. Avoiding the generation of erroneous information.
The invention introduces a pointer generation network mechanism, can conveniently copy words from the original text by pointing, improves the accuracy and processes the excessive vocabulary (OOV) and simultaneously reserves the capacity of generating new words. This network can be considered as taking the balance of the generative digest method and the decimated digest method.
An overlay mechanism is introduced. Problems in the current abstract generation model: generating repetitive information is especially noticeable when multiple sentences are generated. The embodiment of the invention introduces an overlay mechanism for tracking and controlling the original text. An overlay vector is maintained, the vector is the sum of the attention values of the decoding stage, is a distribution covering the words of the original text, and represents the coverage degree of the words generated by the attention mechanism, so that the current conclusion of the attention mechanism is ensured to be obtained through the previous information, the repeated attention of the words at the same position is avoided, the generation of repeated words is avoided, and the repetition is effectively eliminated.
A unified language model mechanism is introduced. Language model and training is a technique to "teach" machine learning systems to contextualize text representations by letting them predict words according to context, however current pre-trained models are bi-directional in relation (use of left and right context of words to form predictions) and are not suitable for generating natural language tasks with a large number of modifications.
The embodiment of the invention also introduces a unified language model mechanism to preprocess a large amount of texts and optimize the language model. Different self-attention masks are used for configuration, so that contexts are aggregated for different types of language models, and in addition, due to the uniformity of the pre-training property, the network can share parameters, so that the learned text representation is more universal, and the overfitting to any single task is reduced.
It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.
In the above description of the method for generating a text excerpt for the financial field in the embodiment of the present invention, a system for generating a text excerpt for the financial field in the embodiment of the present invention is described below, please refer to fig. 4, fig. 4 is a schematic hardware structure diagram of another embodiment of a system for generating a text excerpt for the financial field in the embodiment of the present invention, as shown in fig. 4, the system 10 includes: a memory 101, a processor 102 and a computer program stored on the memory and executable on the processor, the computer program realizing the following steps when executed by the processor 101:
acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;
constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;
training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;
evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;
testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;
inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
The specific implementation steps are the same as those of the method embodiments, and are not described herein again.
Optionally, the computer program when executed by the processor 101 further implements the steps of:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
The specific implementation steps are the same as those of the method embodiments, and are not described herein again.
Optionally, the computer program when executed by the processor 101 further implements the steps of:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
The specific implementation steps are the same as those of the method embodiments, and are not described herein again.
Optionally, the computer program when executed by the processor 101 further implements the steps of:
and modifying the structure of the text abstract generation model through a unified language model mechanism.
The specific implementation steps are the same as those of the method embodiments, and are not described herein again.
Optionally, the computer program when executed by the processor 101 further implements the steps of:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
The specific implementation steps are the same as those of the method embodiments, and are not described herein again.
Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100-S700 in fig. 1 described above.
By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described in embodiments of the invention are intended to comprise one or more of these and/or any other suitable types of memory.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for generating a text abstract in the financial field is characterized by comprising the following steps:
acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;
constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;
training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;
evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;
testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;
inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
2. The method for generating the text abstract for the financial field according to claim 1, wherein the collecting text data sets of the financial field and constructing a field corpus of the financial industry according to the text data sets comprises:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
3. The method for generating the text abstract for the financial field according to claim 2, wherein the generating of the training samples, the testing samples and the verification samples after preprocessing the training data set, the testing data set and the verification data set according to the field corpus and PADDING's complement operation comprises:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
4. The method for generating a text abstract for the financial field as claimed in claim 1, wherein before constructing the text abstract generating model, the method further comprises:
and modifying the structure of the text abstract generation model through a unified language model mechanism.
5. The method for generating the text abstract for the financial field according to claim 1, wherein the training of the text abstract generation model by using the training samples and the updating of the text abstract generation model by using an Adam optimizer and a cross entropy loss function comprises:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
6. A system for generating a text summary for use in a financial field, the system comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:
acquiring a text data set of a financial field, and constructing a field corpus of the financial industry according to the text data set;
dividing a text data set into a training data set, a testing data set and a verification data set, preprocessing the training data set, the testing data set and the verification data set according to a domain corpus, and generating a training sample, a testing sample and a verification sample after carrying out PADDING (dynamic extension complement for training) operation;
constructing a text abstract generating model, wherein the text abstract generating model consists of a coding structure and a decoding structure, the coding structure comprises an embedded layer and a bidirectional LSTM network, the coding structure adopts a one-hot coding processing mode, the decoding structure comprises an embedded layer and a unidirectional LSTM network, a zooming translation layer is added in the decoding structure, the zooming translation layer adopts a pointer to generate a network, and the zooming translation layer is also provided with a covering vector;
training the text abstract generation model by using a training sample, and updating the text abstract generation model by adopting an Adam optimizer and a cross entropy loss function;
evaluating the updated text abstract generating model through the verification sample to obtain a text abstract generating model with the optimal evaluation result, and recording as an optimized text abstract generating model;
testing the optimized text abstract generation model through the test sample, and recording the tested optimized text abstract generation model as a target text abstract generation model;
inputting the text to be detected into a target text abstract generation model, and outputting the text abstract.
7. The system for generating a text excerpt for the financial domain of claim 6, wherein the computer program, when executed by the processor, further performs the steps of:
acquiring a text data set in the financial field, and obtaining a text and an abstract corresponding to the text according to the text data set;
and constructing a field corpus of the financial industry based on the text and the abstract corresponding to the text.
8. The system for generating a text excerpt for the financial domain of claim 7, wherein the computer program, when executed by the processor, further performs the steps of:
performing character indexing operation on texts and abstracts in a training data set, a testing data set and a verification data set according to a domain corpus to respectively obtain a text list and an abstract list;
and after PADDING complementing operation is carried out on the text list and the abstract list, generating a training sample, a testing sample and a verification sample.
9. The system for generating a text excerpt for the financial domain of claim 6, wherein the computer program, when executed by the processor, further performs the steps of:
inputting a training sample into a constructed text abstract generation model;
decoding by adopting a cluster searching mode, only reserving a specific number of candidate results each time, generating an abstract of the text, comparing the abstract with an input abstract, and calculating a cross entropy loss function;
and on the basis of minimizing the loss function as a training target, repeating the training process by adopting an Adam optimizer, and storing the model and the corresponding parameters obtained in the whole training process to obtain an updated text abstract generation model.
10. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method for generating a text excerpt for a financial field of any one of claims 1-5.
CN202110879065.8A 2021-08-02 2021-08-02 Text abstract generation method and system for financial field Withdrawn CN113821635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110879065.8A CN113821635A (en) 2021-08-02 2021-08-02 Text abstract generation method and system for financial field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110879065.8A CN113821635A (en) 2021-08-02 2021-08-02 Text abstract generation method and system for financial field

Publications (1)

Publication Number Publication Date
CN113821635A true CN113821635A (en) 2021-12-21

Family

ID=78924199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110879065.8A Withdrawn CN113821635A (en) 2021-08-02 2021-08-02 Text abstract generation method and system for financial field

Country Status (1)

Country Link
CN (1) CN113821635A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398478A (en) * 2022-01-17 2022-04-26 重庆邮电大学 Generating type automatic abstracting method based on BERT and external knowledge
CN115062596A (en) * 2022-06-07 2022-09-16 南京信息工程大学 Method and device for generating weather report, electronic equipment and storage medium
CN115795028A (en) * 2023-02-09 2023-03-14 山东政通科技发展有限公司 Intelligent document generation method and system
CN116595164A (en) * 2023-07-17 2023-08-15 浪潮通用软件有限公司 Method, system, equipment and storage medium for generating bill abstract information
CN117313892A (en) * 2023-09-26 2023-12-29 上海悦普网络科技有限公司 Training device and method for text processing model

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398478A (en) * 2022-01-17 2022-04-26 重庆邮电大学 Generating type automatic abstracting method based on BERT and external knowledge
CN115062596A (en) * 2022-06-07 2022-09-16 南京信息工程大学 Method and device for generating weather report, electronic equipment and storage medium
CN115795028A (en) * 2023-02-09 2023-03-14 山东政通科技发展有限公司 Intelligent document generation method and system
CN116595164A (en) * 2023-07-17 2023-08-15 浪潮通用软件有限公司 Method, system, equipment and storage medium for generating bill abstract information
CN116595164B (en) * 2023-07-17 2023-10-31 浪潮通用软件有限公司 Method, system, equipment and storage medium for generating bill abstract information
CN117313892A (en) * 2023-09-26 2023-12-29 上海悦普网络科技有限公司 Training device and method for text processing model

Similar Documents

Publication Publication Date Title
CN108733792B (en) Entity relation extraction method
CN113821635A (en) Text abstract generation method and system for financial field
Gasmi et al. LSTM recurrent neural networks for cybersecurity named entity recognition
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110532395B (en) Semantic embedding-based word vector improvement model establishing method
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
Jian et al. [Retracted] LSTM‐Based Attentional Embedding for English Machine Translation
CN111209362A (en) Address data analysis method based on deep learning
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN115687609A (en) Zero sample relation extraction method based on Prompt multi-template fusion
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
CN114239584A (en) Named entity identification method based on self-supervision learning
Manias et al. An evaluation of neural machine translation and pre-trained word embeddings in multilingual neural sentiment analysis
CN112231476A (en) Improved graph neural network scientific and technical literature big data classification method
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN115017260A (en) Keyword generation method based on subtopic modeling
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
Sejwal et al. Sentiment Analysis Using Hybrid CNN-LSTM Approach
Zhu English lexical analysis system of machine translation based on simple recurrent neural network
Wu et al. A Text Emotion Analysis Method Using the Dual‐Channel Convolution Neural Network in Social Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211221