CN113268586A

CN113268586A - Text abstract generation method, device, equipment and storage medium

Info

Publication number: CN113268586A
Application number: CN202110559023.6A
Authority: CN
Inventors: 于凤英; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-17
Also published as: WO2022241950A1

Abstract

The application relates to the technical field of artificial intelligence, and particularly discloses a text abstract generating method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be processed, inputting the text to be processed into a selector in the abstract generation model for keyword extraction, thereby obtaining a keyword sequence; inputting the keyword sequence into a first encoder in the abstract generation model for encoding processing to obtain a first vector; inputting the text to be processed into a second encoder in the abstract generation model for encoding processing to obtain a second vector; obtaining a hidden vector according to the first vector and the second vector; and inputting the hidden vector into a decoder in the abstract generation model for decoding to obtain a target abstract. Because important key words in the text to be processed are preferentially extracted, the model can more effectively identify the key information of the text to be processed, and the accuracy of the generated target abstract is ensured.

Description

Text abstract generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a text abstract.

Background

Currently, in the task of text summarization, a Sequence-to-Sequence (seq2seq) model is a commonly used summarization model, which uses two recurrent neural networks to convert one language Sequence into another language Sequence, and satisfies the condition that the lengths of input and output sequences are different.

However, in the field related to the generation of the medical image report, such as the generation process of the medical image report, the summary content generated by the seq2seq model is not easy to read because of the many proprietary words involved. For example, when a proprietary vocabulary is cut incorrectly, the word sense changes and the resulting abstract is ambiguous, leading to the delivery of incorrect information to the reader.

Disclosure of Invention

The application provides a text abstract generating method, a text abstract generating device, text abstract generating equipment and a storage medium, and aims to solve the problem that an abstract generated by a current abstract generating model is not accurate enough when an input text has too many proprietary vocabularies.

In order to achieve the above object, the present application provides a text summary generating method, including:

acquiring a text to be processed, inputting the text to be processed into a selector in the abstract generation model for keyword extraction, thereby obtaining a keyword sequence;

inputting the keyword sequence into a first encoder in the abstract generation model for encoding processing to obtain a first vector;

inputting the text to be processed into a second encoder in the abstract generation model for encoding processing to obtain a second vector;

obtaining a hidden vector according to the first vector and the second vector;

and inputting the hidden vector into a decoder in the abstract generation model for decoding to obtain a target abstract.

In order to achieve the above object, the present application further provides a text summary generating device, including:

the keyword extraction module is used for acquiring a text to be processed and inputting the text to be processed into a selector in the abstract generation model for keyword extraction so as to obtain a keyword sequence;

the first vector acquisition module is used for inputting the keyword sequence into a first encoder in the abstract generation model for encoding processing to obtain a first vector;

the second vector acquisition module is used for inputting the text to be processed into a second encoder in the abstract generation model for encoding processing to obtain a second vector;

the hidden vector acquisition module is used for obtaining a hidden vector according to the first vector and the second vector;

and the output module is used for inputting the hidden vector into a decoder in the abstract generation model for decoding to obtain the target abstract.

In addition, to achieve the above object, the present application also provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is used for executing the computer program and realizing the text abstract generating method provided by the embodiment of the application when the computer program is executed.

In addition, to achieve the above object, the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, causes the processor to implement any one of the text abstract generating methods provided by the embodiments of the present application.

According to the text abstract generating method, the text abstract generating device, the text abstract generating equipment and the storage medium, the keyword sequence is obtained by extracting the keywords of the text to be processed, the first vector is obtained according to the keyword sequence, the second vector is obtained according to the text to be processed, and the hidden vector is obtained according to the first vector and the second vector, so that a decoder can obtain the target abstract according to the hidden vector. Because the hidden vector can represent the independent semantics of each keyword and the context semantics of the vocabulary in the text to be processed, the target abstract generated according to the hidden vector not only can embody the summary of the text to be processed, but also can highlight the importance of the keywords in the target abstract. In addition, the keywords in the text to be processed are preferentially extracted, so that the model can more effectively identify the key information of the text to be processed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic block diagram of a summary generation model provided by an embodiment of the present application;

fig. 2 is a flowchart of a text summary generation method according to an embodiment of the present application;

fig. 3 is a schematic block diagram of a text summary generation apparatus provided in an embodiment of the present application;

fig. 4 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.

The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In the text summarization task, a sequence-to-sequence (seq2seq) network is a mature model that can generate readable summary content, however, it cannot effectively identify the key vocabulary of the source article, resulting in poor readability of the generated summary.

Therefore, the application provides a text abstract generating method, a text abstract generating device, computer equipment and a storage medium, so as to solve the problems. The text abstract generating method can be applied to scenes with excessive special vocabularies, such as medicine, finance, law and the like, of course, the application scene of the scheme is not limited to the scenes, and the scheme can be applied to scenes needing to perform relevant processing on texts in actual application. The method provided by the application identifies the special vocabulary in the text to be processed, extracts important special vocabulary, inputs the extracted special vocabulary and the text to be processed into a preset seq2seq model for abstract generation, and generates the abstract of the extracted special vocabulary in a direct copying mode, so that the situation that the generated abstract is mistakenly identified because of the special vocabulary is reduced, and the generated abstract is convenient to read.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a summary generation model according to an embodiment of the present disclosure.

As shown in fig. 1, the digest generation model 100 includes a selector 101, a first encoder 102, a second encoder 103, a filter layer 104, and a decoder 105. The selector 101 is configured to extract keywords in a text to be processed to obtain a keyword sequence, the first encoder 102 is configured to encode the keyword sequence to obtain a corresponding first vector, the second encoder 103 is configured to encode the text to be processed to obtain a corresponding second vector, the filter layer 104 is configured to obtain a hidden vector according to the first vector and the second vector, and the decoder 105 is configured to obtain a target abstract according to the hidden vector.

In some embodiments, the selector 101 is obtained by embedding a BERT (Bidirectional Encoder retrieval from transducers) model into a Bi-LSTM (Bi-directional Long Short-Term Memory) model, and training the model by combining a soft-max activation function and a maximum likelihood estimation algorithm. The BERT model uses a Transformer model as a main frame, and comprises three embedded layers which are respectively as follows: the method comprises the following steps of Token embedding, Segment embedding and Position embedding, wherein the Token embedding is used for converting input words into vectors with fixed dimensions, and the Segment embedding is used for marking two sample words in a spliced text so as to distinguish the two sample words; position embedding is used to encode the order of characters in the stitched text. Thus, the BERT model builds semantic vectors for the stitched text based on the information in the three embedded layers. The Soft-max activation function maps the input vocabulary to the interval of (0, 1), and the specific value in the interval is the probability that the vocabulary is reserved. The maximum likelihood estimation algorithm is based on the known sample result information, and the model parameter values most likely to cause these sample results to appear are deduced back.

The selector 101 may extract words from an unknown input value according to a dictionary obtained during the training of the digest generation model 100, predict a probability value that each word is retained to the digest, and select a word within a preset probability value as a keyword sequence.

The keywords extracted by the selector 101 are words that can be directly copied to the target abstract, and the correctness of the keywords appearing in the target abstract is ensured by directly copying the selected keywords to the target abstract.

After a keyword sequence corresponding to a text to be processed is obtained through the selector 101, the keyword sequence is input to a preset seq2seq model for processing, so that the preset seq2seq model outputs a target abstract in an iterative manner.

The Seq2Seq model is a model adopted when the output length is uncertain, belongs to a structure of a coding layer and a decoding layer, and is a model formed by two cyclic neural networks, wherein one cyclic neural network is used as the coding layer, and the other cyclic neural network is used as the decoding layer. The coding layer is responsible for compressing an input sequence into a vector of a specified length, which can be regarded as the semantics of the sequence, and this process is called coding. The decoding layer is responsible for generating the specified sequence from the semantic vector, a process also referred to as decoding.

As shown in fig. 1, the preset seq2seq model adopted in the embodiment of the present application is provided with two independent encoders at an encoding layer, and the two independent encoders are respectively trained in a training phase. It comprises a first encoder 102 for semantically encoding the keyword sequence and a second encoder 103 for semantically encoding the text to be processed. By respectively training the two encoders, the keyword sequence and the encoding information corresponding to the text to be processed can be better obtained, so that the accuracy of the context information of the corresponding keywords is ensured while the keywords are more prominent in the generated abstract. Meanwhile, a filter layer 104 is arranged between the encoding layer and the decoding layer, and the filter layer 104 is used for splicing according to the vectors output by the first encoder 102 and the second encoder 103 and controlling the information flow from the encoding stage to the decoding stage so as to achieve the effect of selecting the core information. Decoder 105 then decodes from the output of filter layer 104 to obtain an output sequence.

In the embodiment of the present application, a training process of the abstract generation model will be described in detail, where the specific training process includes:

step A: and acquiring the sample text and the sample abstract of the sample text.

Optionally, the sample text and the sample abstract of the sample text may be text data obtained by crawling from a network in advance and manually cleaning by means of a crawler and the like, so that a training function of generating a model of the abstract to be trained can be realized.

It can be understood that, in different text application scenarios, the sample text has different contents, and the content of the sample text is not specifically limited in the embodiments of the present application. Optionally, the number of the sample texts is one or more, and a plurality of sample texts may constitute one sample text set.

After obtaining each sample text, labeling the sample text by a user, and marking a sample abstract of each sample text, wherein the sample abstract is also the real abstract information of each sample text. And meanwhile, marking out keywords appearing in the sample text and the corresponding sample abstract, wherein the keywords refer to words representing key information in the text. If the sample text includes statement a: "the heart is one of the vital parts in the human body", the user adds identifiers such as [ CLSB ] and [ CLSE ] in the sentence a, and the sentence a becomes: "[ CLSB ] the heart [ CLSE ] is one of the most important parts in the human body, wherein the identifier [ CLSB ] is used for indicating the starting position of a keyword, and the identifier [ CLSE ] is used for indicating the ending position of a keyword.

And B: and inputting the sample text into an initial abstract generating model, and acquiring a predicted abstract of the sample text through the initial abstract generating model.

The process of obtaining the prediction summary may refer to a specific implementation process of obtaining the target summary in the following embodiments, which is not described herein again.

And C: and determining a loss function value of the training process based on the sample abstract of the sample text and the prediction abstract.

In some embodiments, the loss function value of any one training process includes a statement loss value for representing an error between the prediction summary and the important information of the sample text and a generation loss value for representing an error between the prediction summary and the sample summary.

Illustratively, the sentence loss value includes a current sentence loss value and a previous sentence loss value, the current sentence loss value is used for representing an error between a keyword contained in the a-th sentence in the prediction summary and a keyword contained in the a-th sentence in the sample summary, and the previous sentence loss value represents an error between the keyword contained in the a-th sentence in the prediction summary and a keyword contained in a-1 sentence before the sample summary.

During training, the model can be encouraged to focus on corresponding keywords when generating the current abstract sentence based on the current sentence loss value, and the model can be encouraged to consider the keywords contained in the previous abstract sentence when generating the current abstract sentence based on the above sentence loss value, so that the error between the current abstract sentence and the sentence at the corresponding position of the sample abstract is reduced, and the semantic meaning which cannot be repeated with the previous sentence is avoided.

For example, the generated loss value may use a negative log-likelihood loss function, a cross-entropy loss function, or other common loss functions to evaluate the degree of inconsistency between the prediction digest and the sample digest.

Step D: and when the loss function value meets the condition of stopping training, stopping training to obtain the trained abstract generation model.

In the process, if the loss function value is smaller than the loss threshold value or the iteration times are larger than the target times, the condition of stopping training is determined to be met, the training is stopped, and the initial abstract generation model of the last iteration process is determined as the abstract generation model. Otherwise, if the loss function value is larger than or equal to the loss threshold value and the iteration number is smaller than or equal to the target number, determining that the loss function value does not meet the training stopping condition, adjusting the parameters of the initial abstract generating model based on the back propagation algorithm, putting the initial abstract generating model after the parameters are adjusted into the next iteration process, and iteratively executing the training step.

In the course of training the abstract generating model 100, a preset dictionary is established according to the sample text and the vocabulary labeled in the sample abstract, and the dictionary at least includes a lexicon corresponding to the keywords. For example, when the training data is a medical report, the corresponding dictionary may include a disease thesaurus, a drug thesaurus, a relational thesaurus, and the like, wherein the corresponding thesaurus of various types of vocabularies is further divided into keywords and non-keywords, for example, the disease thesaurus and the drug thesaurus are keywords, and the relational thesaurus is non-keywords. It is understood that the specific inclusion ranges of the keywords and the non-keywords may be set according to actual situations, for example, when the generated target abstract needs to highlight the disease vocabulary in the medical report, the dictionary may be further classified as the disease thesaurus being the keywords, and the medicine thesaurus and the relation thesaurus being the non-keywords.

The embodiments of the present application are further described below based on a summary generation model.

Referring to fig. 2, fig. 2 is a flowchart illustrating a text summary generating method according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps S101-S105;

s101, obtaining a text to be processed, inputting the text to be processed into a selector in the abstract generation model for keyword extraction, and obtaining a keyword sequence.

And performing vocabulary recognition on the text to be processed according to the selector to take the recognized vocabulary as candidate words, calculating the reserved probability corresponding to each candidate word, and screening out the keyword sequence according to the reserved probability. Wherein, preserving means that the vocabulary is directly copied into the target abstract. By preferentially extracting the key words which can be directly copied to the target abstract, the key information of the text to be processed is identified, and the correctness of the key information in the target abstract is ensured.

In some embodiments, the extracting keywords from the selector in the text input summary generation model to be processed to obtain the keyword sequence includes:

preprocessing a text to be processed to obtain a vocabulary set;

calling a pre-configured dictionary to identify the vocabulary set to obtain a candidate word set;

and calculating the retention probability corresponding to each candidate word in the candidate word set, and screening the candidate word set according to the retention probability to obtain a keyword sequence.

Specifically, a text to be processed is processed in a preset preprocessing mode to obtain a vocabulary set, then, vocabularies which are possibly copied to a target abstract in the vocabulary set are identified according to a dictionary to obtain a candidate word set, the retention probability of each candidate word in the candidate word set is calculated, and candidate words in a preset probability are selected to obtain a keyword sequence.

In some embodiments, preprocessing the text to be processed to obtain a vocabulary set, includes:

extracting a plurality of continuous characters in a text to be processed to form a vocabulary preparation unit;

and performing duplicate removal processing and stop word removal processing on the vocabulary preparation unit to obtain a vocabulary set.

The method comprises the steps of recognizing a plurality of continuous characters of a text to be processed, namely performing word segmentation processing on the text to be processed to obtain a vocabulary preparation unit, and then filtering useless words in the vocabulary preparation unit to obtain a vocabulary set.

Further, the preset word segmentation modes include but are not limited to: through a third-party word segmentation tool or a word segmentation algorithm, and the like.

Common third-party word segmentation tools include, but are not limited to: the system comprises a Stanford NLP word segmentation device, an ICTCLAS word segmentation system, an ansj word segmentation tool, a HanLP Chinese word segmentation tool and the like.

The word segmentation algorithm includes, but is not limited to: maximum positive Matching (MM) algorithm, Reverse Direction Maximum Matching Method (RMM) algorithm, Bi-directional Maximum Matching Method (BM) algorithm, Hidden Markov Model (HMM) and N-gram Model, etc.

The method has the advantages that words are extracted from the text to be processed in a word segmentation mode, some nonsense words can be filtered, and the method is favorable for limiting the range of key information extraction according to the obtained word set.

The method comprises the steps that a word set is obtained after a text to be processed is preprocessed, a selector identifies candidate words contained in the word set according to a dictionary, the dictionary comprises a preset keyword lexicon, the preset keyword lexicon comprises a plurality of preset keywords defined as key information, and the preset keywords are words marked by a user in a sample text and a sample abstract when an abstract generation model is trained. And comparing each vocabulary in the vocabulary set through a preset keyword lexicon of the dictionary, and selecting the vocabulary matched with the preset keywords in the preset keyword lexicon to serve as candidate words to obtain a candidate word set.

And then calculating the retention probability corresponding to each candidate word in the candidate word set, wherein the greater the retention probability is, the greater the probability that the corresponding vocabulary appears in the target abstract is.

The specific steps for obtaining the retention probability include:

generating a feature vector corresponding to each candidate word based on a BERT model and a Bi-LSTM model:

the BERT model reconstructs each word through the context of the word by using an auto-encoder to obtain a vector representation of each word, and thus, the BERT model refers to the context information of each word for the semantic vector output by the candidate word. And then taking the word vector as the input of the Bi-LSTM model, calculating the input word vector by using the Bi-LSTM model, and obtaining the context feature vector with memory advantage after the obtained output is calculated by a nonlinear activation layer, thus obtaining the feature vector corresponding to the output of the Bi-LSTM model.

Then, inputting the word vector into the full-connection layer, and calculating the prediction probability of each word through a soft-max function to obtain the corresponding retention probability y_iThe probability calculation formula is as follows:

y_i＝softmax(W_n+b_n)

wherein, W_nIs a weight value, b_nAnd i is the ith vocabulary in the candidate word set. The input is mapped to a value of (0, 1) by the soft-max function, and the sum of these values is 1, so that the mapped value can be taken as a probability value. It can be understood that W_nAnd b_nThe specific value of (2) is a final parameter value obtained after model corresponding parameters are continuously modified in the training process of the abstract generation model so as to make the abstract generation model converged.

And performing probability marking on the corresponding candidate words according to the obtained probability values to obtain the corresponding retention probability of each candidate word.

And then, filtering the candidate words according to the retention probability to obtain a keyword sequence which can be retained. For example, the candidate words may be ranked according to the probability of each candidate word, and the top N candidate words are extracted as keywords, so as to obtain a keyword sequence. And a preset probability value can be set, candidate words larger than the preset probability value are reserved, and candidate words smaller than the preset probability value are filtered to obtain a keyword sequence.

S102, inputting the keyword sequence into a first encoder in the abstract generating model for encoding processing to obtain a first vector.

After the keyword sequence is input into a first encoder to be processed, the keywords in the sequence are mapped into a matrix vector, and a corresponding word vector sequence is generated to obtain a first vector

Wherein l is the number of keywords in the keyword sequence,

and the word vector corresponding to the ith keyword in the keyword sequence.

In some embodiments, when the first encoder obtains the feature of each keyword in the keyword sequence, the first encoder may obtain the feature of the keyword by performing an embedding process on the keyword. In other words, the first encoder inputs the original text information (i.e., the character sequence of the keyword) to the character embedding layer, performs embedding processing on each character in the keyword based on the character embedding layer, to obtain a vector of each character, and determines a vector sequence formed by the vector of each character as a feature of the keyword, that is, to obtain a word vector corresponding to the keyword.

In some embodiments, a one-hot (one-hot) method may also be used to obtain a word vector corresponding to the keyword, and the obtaining method is not specifically limited in this embodiment of the application.

In the process, based on the embedding processing technology, the characters in the character sequence can be converted into a vector form which is easy to understand by a computer from a natural language, namely, the characters are compressed to the embedding space from the sparse space, and a dense vector is obtained, so that the subsequent encoding process is facilitated, the encoding efficiency is improved, and the calculated amount during encoding is reduced.

And S103, inputting the text to be processed into a second encoder in the abstract generating model for encoding to obtain a second vector.

Inputting the text to be processed into a second encoderAfter processing, mapping each vocabulary of the corresponding vocabulary set in the text into a matrix vector, and generating a corresponding word vector sequence to obtain a second vector h ═ { h ═ h₁，h₂，...，h_nWhere n is the number of words in the set of words, h_iAnd the word vector corresponding to the ith word in the word set.

In some embodiments, when the second encoder obtains the feature of each vocabulary in the vocabulary set, the second encoder may perform an embedding process on the index information of the at least one vocabulary to obtain the vocabulary feature, and the index information is used to indicate the position of the at least one vocabulary in the text to be processed. In other words, the second encoder inputs the character sequence of the text to be processed into the sentence embedding layer, and the word vectors of the words are obtained by respectively embedding the index information of the words where the characters are located in the character sequence based on the sentence embedding layer.

In the process, based on the embedding processing technology, the index information in the text to be processed in which each vocabulary is located can be converted from the natural language into a word vector form which is easy to understand by a computer, so that the phenomenon that the alignment information of the sentence level that the current vocabulary belongs to the second sentence in the text to be processed is forgotten in the subsequent encoding and decoding process can be avoided, the expression capability of the vocabulary characteristics and the global characteristics obtained by encoding can be improved, and the information acquisition accuracy can be improved.

In some embodiments, a one-hot (one-hot) method may also be used to obtain a word vector corresponding to each vocabulary in the vocabulary set, and the obtaining method is not specifically limited in this embodiment of the application.

In some embodiments, during the encoding process of the second encoder, each word in the set of words is input to at least one neuron in the encoding layer, and each word in the set of words is forward-encoded and backward-encoded by the at least one neuron, respectively, to obtain a word vector.

In the process, each vocabulary in the vocabulary set is bidirectionally encoded, the context information is repeatedly considered in the forward encoding process, and the context information is fully considered in the reverse encoding process, so that the encoding effect of combining the context information is achieved, and the accuracy of character characteristics is improved. In some embodiments, each vocabulary in the vocabulary set may also be forward encoded only, so as to reduce the amount of computation in the encoding process and save the computation resources. And S104, obtaining a hidden vector according to the first vector and the second vector.

And the filter layer is spliced and filtered according to the first vector and the second vector obtained by the coding layer to obtain a hidden vector. The filtering layer mainly filters the output of each encoder by performing global encoding so as to redefine the representation of each time step taking into account the global context, controlling the flow of information from the encoding stage to the decoding stage, so as to achieve the effect of selecting core information.

In some embodiments, deriving the concealment vector from the first vector and the second vector comprises:

determining a plurality of dimensions of the first vector and a scalar corresponding to the dimensions;

determining a plurality of dimensions of the second vector and a scalar corresponding to the dimensions;

superposing each dimension of the first vector and each dimension of the second vector, and filling scalars of corresponding dimensions in the superposed dimensions to obtain a spliced vector;

and filtering the spliced vector according to the activation function to obtain a hidden vector.

Specifically, a plurality of dimensions of a first vector and scalars respectively corresponding to each dimension are determined, a plurality of dimensions of a second vector and scalars respectively corresponding to each dimension are determined, then each dimension included in the first vector and each dimension included in the second vector are superposed, and scalars corresponding to the dimensions are filled in the superposed dimensions, so that a corresponding splicing vector is obtained. In this way, the spliced vector is decoded, and words in the text can be represented by integrating multiple dimensions, so that individual semantic vectors of corresponding words can be represented, and context meanings of contexts of the corresponding words can be represented at the same time.

Because the expression ability of the linear model is not enough, the nonlinear factor is added by using the activation function to enhance the expression ability and the learning ability of the network, so that the network approaches to any function, specifically a calculation formula;

h′＝h_i⊙F_i

σ is a sigmoid activation function that pairs the received first vector h^oSplicing with a second vector h to obtain an expression matrix F_i，W_hIs the weight vector and b is the bias term. Wherein, during training, the weight vector is adjusted according to the actual output and the expected output, and then the weight vector W which enables the actual output to be consistent with the expected output is obtained_hThe bias term b is used for parameters that are adjusted to better achieve the goal.

The sigmoid activation function outputs a vector, the value of each dimension of the vector is between 0 and 1, if the value is close to 0, most information on the corresponding dimension of the source representation is deleted, and if the value is close to 1, most information is reserved, so that the selection of the core information is realized, and the hidden vector h' is obtained.

And S105, inputting the hidden vector into a decoder in the abstract generation model for decoding to obtain a target abstract.

And inputting the hidden state vector as an initial state of the decoder into the decoder for processing so that the decoder iteratively outputs the target abstract. The decoder adopts LSTM network iteration output abstract, each time step has 3 inputs, which are output of the previous time step, context vector and decoding state of the previous time step, and the output with the maximum probability is calculated according to the 3 inputs and is used as the output of the current time step. The output of the current time step is used as the input of the next time step to form a recursion process until the generation of the target abstract is finished. Wherein the context vector is derived from the attention distribution, which is the distribution of probability values (i.e. attention weights) of all words in the text to be processed.

In particular, the attention distribution is used to characterize the importance of the information that requires a great deal of attention. By obtaining a target area needing important attention, namely a focus of attention, and then putting more attention resources into the target area, more detailed information of the target needing attention can be obtained, and other useless information is suppressed. And determining which position of the text to be processed should be used for generating the next output vocabulary of the abstract according to the attention distribution, wherein the participle vocabulary with high probability value in the text is more helpful for the output vocabulary to be generated at the current output moment of the decoder.

In some embodiments, the target digest is determined over a plurality of time steps, and inputting the concealment vector into the decoder to obtain the target digest comprises:

acquiring a current decoding state;

calculating a context vector according to the hidden vector;

calculating a partial abstract output at the current time step according to the current decoding state and the context vector;

and traversing all the time steps, and obtaining the target abstract according to the partial abstract output by each time step.

And the decoder calculates the output of the current time step according to the current decoding state and the context vector corresponding to the hidden vector. The output of the current time step t may be directly copied from the text to be processed, or may be selected from a pre-configured dictionary, that is, automatically generated vocabulary.

Wherein, when the first output vocabulary of the abstract is output, a specific < start > flag is used as a signal for starting the output abstract so as to decode the current decoding state of the first output vocabulary of the abstract. Where the < start > flag may be 0. When other output words of the abstract are output, the word vector of the output word of the previous time step can be used as input to decode the current decoding state of the output word.

In some embodiments, obtaining the current decoding state comprises:

acquiring the decoding state of the last time step and a partial abstract output by the last time step;

calculating the current decoding state y according to the decoding state of the last time step and the output of the last time step_t-1Word embedding, S, of the output selected at the previous time step using soft-max_t-1Is the decoding status calculated at the last time step. Regarding any current decoding state during decoding, the associated score of the decoding state and each decoding state of the encoder is calculated, and the specific calculation formula of the current decoding state is as follows:

S_t＝LSTM(S_t-1，y_t-1)

wherein S is_tIs the decoding status of the current time step calculation.

In some embodiments, computing the context vector from the hidden vector comprises:

calculating attention distribution according to the hidden vector;

a context vector is calculated from the attention distribution.

Adding an attention mechanism in a decoder, and calculating attention distribution according to a hidden vector h', wherein a specific attention distribution calculation formula is as follows:

a＝Sotfmax(h′^TVs^T)

wherein the soft-max function represents a logistic regression function, i.e. a nonlinear mapping function, T is a matrix transposition operation, V is a parameter matrix, and s is a decoding state.

And then calculating a context vector by using the corresponding attention weight, wherein the specific context vector calculation formula is as follows:

wherein, C_tIs a context vector of a time step t, n is the length of a vocabulary set corresponding to a text to be processed, a_iIs the attention distribution, h ', corresponding to the ith vocabulary in the vocabulary set'_iThe vector representation of the hidden vector corresponding to the ith word in the word set.

The attention mechanism is to characterize different attention degrees by giving different scores to different vocabularies, and then passing the obtained scores through a soft-max layer to weight all hidden layer states in the encoding stage to obtain a context vector. A context vector is a semantic expression that can be understood as an input sequence that summarizes the information of the input sequence, encoding the corresponding hidden state of the input sequence. The output at each time step is calculated using soft-max over all possible vocabularies.

And determining the attention distribution of the vocabulary set at each output moment according to the hidden vector, and calculating the context vector of the text at each output moment according to the attention distribution and the hidden vector corresponding to each time step. And determining the output probability of each vocabulary at a plurality of time steps according to the context vector and the current decoding state of each output moment. Thus, the output vocabulary y output at the predicted current time step t_tIs the word with the highest prediction probability until the predicted output word is<End up>Until the mark time. For example, target digest y ═ { y ═ y₁y₂...y_t...y_mAnd m is the time step length of the output target abstract.

The range of each vocabulary for calculating the output probability comprises a vocabulary set corresponding to the text to be processed and a dictionary generated in the process of training the abstract generation model. Therefore, existing words in the text to be processed and words not in the text to be processed can be utilized, so that the abstract is flexibly generated according to the semantics of the text, the generated target abstract is more in line with the habit of manually summarizing the abstract by a user, and the extraction quality and the content fluency of the abstract are improved.

By preferentially identifying the special vocabulary in the text to be processed and selecting the corresponding special vocabulary as the key word according to the possibility of being copied to the target abstract, the context of the generated target abstract is ensured to be correct, and meanwhile, the key word in the text to be processed is directly copied to the target abstract, so that the error of generating the abstract caused by the error of identifying the special vocabulary is avoided, and the generated target abstract is easier to read.

The present specification further describes the implementation steps of the embodiments of the present application by taking the text to be processed as an example of the medical report.

Illustratively, the medical report input into the abstract generation model is a text a, and the text a is preprocessed to obtain a vocabulary set. If the input medical report is '… drug A1000mg divided into two parts for oral administration …', the vocabulary obtained by preprocessing the input sentence is combined into '… drug A/1000 mg/min/twice/oral administration …'.

Then, a dictionary is called, and a selector identifies candidate words according to the obtained vocabulary set to obtain a candidate word set. Whether the candidate word is the candidate word depends on the classification of the pair in a dictionary, and the dictionary is obtained according to the labeled training data in the process of training the abstract generation model. For example, the dictionary includes a disease thesaurus, a drug thesaurus and a relational thesaurus, wherein the candidate words are selected according to the drug thesaurus, and the rest words are non-candidate words. The word set is … drug A/1000 mg/min/twice/orally …, and the word set is matched according to a drug word bank to obtain a candidate word set of … drug A ….

And then calculating the selection probability of copying each candidate word in the candidate word set into the target abstract, and screening the candidate word set according to the selection probability to obtain a keyword sequence. For example, the selection probability to the candidate word "medicine a" is calculated to be 90%, and the preset selection probability is 85%, and thus, the candidate word "medicine a" is added to the keyword sequence. And obtaining a keyword sequence O after the screening is finished.

Inputting the keyword sequence O into a first encoder to perform word embedding processing on the text in the keyword sequence O to obtain a first vector h^oAnd inputting the text to be processed into a second encoder to obtain a second vector h.

Wherein,

l is the number of keywords in the keyword sequence, h ═ h₁，h₂，...，h_nN is the number of words in the vocabulary set.

By the filter layer to the first vector h^oAnd a second vector hSplicing and filtering to obtain a hidden vector h'_i. Decoder from concealment vector h'_iAnd calculating the output of the maximum probability of the current time step so as to output the complete target abstract in an iteration mode.

Referring to fig. 3, fig. 3 is a schematic block diagram of a text summary generating apparatus according to an embodiment of the present application, where the text summary generating apparatus may be configured in a server or a computer device for executing the text summary generating method.

As shown in fig. 3, the apparatus 500 includes: a keyword extraction module 501, a first vector acquisition module 502, a second vector acquisition module 503, a hidden vector acquisition module 504, and an output module 505.

A keyword extraction module 501, configured to obtain a text to be processed, and input the text to be processed into a selector in the abstract generation model to perform keyword extraction, so as to obtain a keyword sequence;

a first vector obtaining module 502, configured to input the keyword sequence into a first encoder in the abstract generating model for encoding, so as to obtain a first vector;

a second vector obtaining module 503, configured to input the text to be processed into a second encoder in the abstract generating model for encoding, so as to obtain a second vector;

a hidden vector obtaining module 504, configured to obtain a hidden vector according to the first vector and the second vector;

and an output module 505, configured to input the hidden vector to a decoder in the digest generation model for decoding, so as to obtain a target digest.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus, the modules and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The methods, apparatus, and devices of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above-described methods and apparatuses may be implemented, for example, in the form of a computer program that can be run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

As shown in fig. 4, the computer device 600 includes a processor 601, memory, and a network interface 604 connected by a system bus 602, where the memory may include non-volatile storage media and internal memory 603.

Non-volatile storage media may store operating system 605 and computer programs 606. The computer program 606 includes program instructions that, when executed, cause the processor 601 to perform any of the text summary generation methods.

The processor 601 is used to provide computing and control capabilities, supporting the operation of the overall computer device 600.

The internal memory 603 provides an environment for running a computer program 606 in a non-volatile storage medium, and when the computer program 606 is executed by the processor 601, the processor 601 may be caused to execute any one of the text summary generation methods.

The network interface 604 is used for network communication, such as sending assigned tasks, etc. Those skilled in the art will appreciate that the configuration of the computer device 600 is merely a block diagram of a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 600 upon which the present application is applied, and in particular that the computer device 600 may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

It should be understood that Processor 601 may be a Central Processing Unit (CPU), and that Processor 601 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor 601 may be a microprocessor, or the processor 601 may be any conventional processor.

In some embodiments, the processor 601 is configured to run the computer program 606 stored in the memory to implement the following steps:

acquiring a text to be processed, inputting the text to be processed into a selector in the abstract generation model for keyword extraction, thereby obtaining a keyword sequence; inputting the keyword sequence into a first encoder in the abstract generation model for encoding processing to obtain a first vector; inputting the text to be processed into a second encoder in the abstract generation model for encoding processing to obtain a second vector; obtaining a hidden vector according to the first vector and the second vector; and inputting the hidden vector into a decoder in the abstract generation model for decoding to obtain a target abstract.

In some embodiments, the extracting keywords from the selector in the text input summary generation model to be processed to obtain the keyword sequence includes: preprocessing a text to be processed to obtain a vocabulary set; calling a pre-configured dictionary to identify the vocabulary set to obtain a candidate word set; and calculating the retention probability corresponding to each candidate word in the candidate word set, and screening the candidate word set according to the retention probability to obtain a keyword sequence.

In some embodiments, preprocessing the text to be processed to obtain a vocabulary set, includes: extracting a plurality of continuous characters in a text to be processed to form a vocabulary preparation unit; and performing duplicate removal processing and stop word removal processing on the vocabulary preparation unit to obtain a vocabulary set.

In some embodiments, deriving the concealment vector from the first vector and the second vector comprises: determining a plurality of dimensions of a first vector and scalars corresponding to the dimensions; determining a plurality of dimensions of a second vector and scalars corresponding to the dimensions; superposing each dimension of the first vector and each dimension of the second vector, and filling scalars of corresponding dimensions in the superposed dimensions to obtain a spliced vector; and filtering the spliced vector according to the activation function to obtain a hidden vector.

In some embodiments, the target digest is determined over a plurality of time steps, and inputting the concealment vector into the decoder to obtain the target digest comprises: acquiring a current decoding state; calculating a context vector according to the hidden vector; calculating a partial abstract output at the current time step according to the current decoding state and the context vector; and traversing all the time steps, and obtaining the target abstract according to the partial abstract output by each time step.

In some embodiments, obtaining the current decoding state comprises: acquiring the decoding state of the last time step and a partial abstract output by the last time step; and calculating the current decoding state according to the decoding state of the last time step and the output of the last time step.

In some embodiments, computing the context vector from the hidden vector comprises: calculating attention distribution according to the hidden vector; a context vector is calculated from the attention distribution.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and the program instructions, when executed, implement any one of the text summary generation methods provided in the embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text summary generation method, the method comprising:

acquiring a text to be processed, and inputting the text to be processed into a selector in a summary generation model for keyword extraction, thereby obtaining a keyword sequence;

inputting the keyword sequence into a first encoder in the abstract generation model for encoding to obtain a first vector;

inputting the text to be processed into a second encoder in the abstract generation model for encoding to obtain a second vector;

obtaining a hidden vector according to the first vector and the second vector;

and inputting the hidden vector into a decoder in a digest generation model for decoding to obtain a target digest.

2. The method according to claim 1, wherein said extracting keywords from the selector in the text input abstract generation model to be processed to obtain the keyword sequence comprises:

preprocessing the text to be processed to obtain a vocabulary set;

3. The method of claim 2, wherein the pre-processing the text to be processed to obtain a vocabulary set comprises:

extracting a plurality of continuous characters in the text to be processed to form a vocabulary preparation unit;

and performing duplicate removal processing and stop word removal processing on the vocabulary preparation unit to obtain the vocabulary set.

4. The method of claim 1, wherein deriving the hidden vector from the first vector and the second vector comprises:

superposing each dimension of the first vector and each dimension of the second vector, and filling scalars corresponding to the dimensions in the superposed dimensions to obtain spliced vectors;

and filtering the spliced vector according to an activation function to obtain a hidden vector.

5. The method of claim 1, wherein the target digest is determined over a plurality of time steps, and wherein inputting the concealment vector into a decoder to obtain the target digest comprises:

acquiring a current decoding state;

calculating a context vector from the hidden vector;

calculating a partial summary output at the current time step according to the current decoding state and the context vector;

6. The method of claim 5, wherein the obtaining the current decoding status comprises:

and calculating the current decoding state according to the decoding state of the last time step and the output of the last time step.

7. The method of claim 5, wherein the computing a context vector from the hidden vector comprises:

calculating attention distribution according to the hidden vector;

calculating the context vector from the attention distribution.

8. A text summary generation apparatus, comprising:

a hidden vector obtaining module, configured to obtain a hidden vector according to the first vector and the second vector;

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory for storing a computer program;

the processor is configured to execute the computer program and to implement the text summary generation method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the text summary generation method according to any one of claims 1 to 7.