CN112906385A

CN112906385A - Text abstract generation method, computer equipment and storage medium

Info

Publication number: CN112906385A
Application number: CN202110489771.1A
Authority: CN
Inventors: 杨德杰
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-06-04
Anticipated expiration: 2041-05-06
Also published as: CN112906385B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a text abstract generating method, computer equipment and a storage medium, wherein the text abstract generating method comprises the following steps: performing word segmentation processing on the text to obtain target keywords; generating statement vectors of statements in the text and segment vectors of text segments according to the word vectors of the target keywords; calculating statement weight of the statement in the text according to the statement vector; coding the segment vector to obtain a hidden state of the text segment; obtaining the attention weight of the target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each moment; calculating according to the attention weight of the target keyword and the corresponding proxy weight to obtain vocabulary probability distribution at the time t, wherein the vocabulary probability distribution represents the target keyword appearing at the kth position, and the proxy weight is the sum of the sentence weights; and generating a text abstract according to the vocabulary probability distribution. The method and the device can accurately generate the text abstract, and the generated text abstract has strong readability and good effect on long texts.

Description

Text abstract generation method, computer equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a text abstract generating method, computer equipment and a storage medium.

Background

Training courses are often long, e.g., typically over an hour, and can assist a user in quick review and review after class by generating a text summary of the training course.

The inventor finds that the existing text abstract generation mode is divided into a generation mode and an extraction mode, the generation mode text abstract method is used for generating a new abstract after understanding an original document, readability is strong, but correctness of the generated abstract is low, the existing generation mode abstract method is usually focused on a short-course text and has poor application effect on a long-course text, and the extraction mode text abstract method is used for extracting key words or key sentences of the original document and combining the key words or key sentences according to importance degree to form the abstract, but readability of the generated abstract is poor, and contained information amount is small.

Disclosure of Invention

In view of the above, there is a need for a method, a computer device and a storage medium for generating a text abstract, which can accurately generate a text abstract, and the generated text abstract has high readability and good effect on long texts.

The first aspect of the present invention provides a text summary generating method, including:

performing word segmentation processing on a text to obtain a target keyword, and acquiring a word vector of the target keyword;

generating statement vectors of statements in the text according to the word vectors, and generating segment vectors of text segments in the text according to the word vectors;

calculating statement weight of the statement in the text according to the statement vector;

coding the segment vector to obtain the hidden state of the text segment;

obtaining attention weight of a target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each moment;

calculating according to the attention weight of the target keyword and a corresponding proxy weight to obtain vocabulary probability distribution at the time t, wherein the vocabulary probability distribution represents the target keyword appearing at the kth position of the text abstract, and the proxy weight is the sum of statement weights of statements in the text segment;

and generating a text abstract according to the vocabulary probability distribution.

In an optional embodiment, the calculating the sentence weight of the sentence in the text according to the sentence vector comprises:

acquiring a first maximum boundary correlation degree of the statement;

acquiring a second maximum boundary correlation degree of other sentences except the sentence in the text;

and obtaining the sentence weight of the sentence according to the first maximum boundary correlation degree and the second maximum boundary correlation degree.

In an optional embodiment, the calculating of the first maximum boundary correlation includes:

calculating a first similarity between the sentence and the text according to the sentence vector of the sentence;

calculating a second similarity between the statement and the rest statements according to the statement vectors of the statement and the statement vectors of the rest statements;

and calculating to obtain the first maximum boundary correlation degree according to the first similarity degree and the second similarity degree.

In an optional embodiment, the calculating the first similarity of the sentence and the text according to the sentence vector of the sentence includes: calculating a first feature representation of the sentence according to the word vector in the sentence; calculating a second feature representation of the text according to the word vectors in the text; and calculating to obtain a first similarity according to the first feature representation and the second feature representation by adopting a similarity calculation model.

In an optional embodiment, the calculating the second similarity between the sentence and the rest of the sentences according to the sentence vectors of the sentences and the sentence vectors of the rest of the sentences includes: calculating third feature representations of the rest sentences according to the word vectors in the rest sentences; and calculating to obtain a second similarity according to the first characteristic representation and the third characteristic representation by adopting the similarity calculation model.

In an optional implementation manner, the calculating the first maximum boundary correlation according to the first similarity and the second similarity includes:

obtaining a first value according to a preset hyper-parameter and the first similarity;

determining a maximum value of the second similarity;

obtaining a second value according to the preset hyper-parameter and the maximum value;

and obtaining a first maximum boundary correlation degree according to the first value and the second value.

calculating a sum of the second maximum boundary correlations;

calculating a ratio of the first maximum boundary correlation to the sum;

and mapping the ratio by using a preset function to obtain the statement weight of the statement.

In an alternative embodiment of the method according to the invention,

the encoding the segment vector to obtain the hidden state of the text segment comprises:

coding the segment vector through a first-stage bidirectional LSTM model to obtain a hidden state of a target keyword in the text segment;

and coding the hidden state of the target keyword through a second-stage bidirectional LSTM model to obtain the hidden state of the text segment.

In an optional embodiment, the generating the text excerpt according to the lexical probability distribution includes:

for any moment, acquiring the maximum probability of the probability distribution at the any moment;

determining the target keywords corresponding to the maximum probability as the target keywords in the text abstract;

and combining the target keywords in the text abstract according to a time sequence to obtain the text abstract.

A second aspect of the invention provides a computer device comprising a processor for implementing the text summary generation method when executing a computer program stored in a memory.

A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text digest generation method.

In summary, according to the text abstract generation method, the computer device, and the storage medium of the present invention, for a text requiring text abstract generation, a word vector of each target keyword in the text is extracted, a sentence vector is generated for each sentence in the text and a segment vector is generated for each text segment according to the word vector, so that a sentence weight of each sentence in the whole text is obtained according to the sentence vector, then the segment vector is encoded to obtain a hidden state of the text segment, an attention weight of the target keyword in the text segment is obtained according to the hidden state of the text segment and the hidden state vector at each time, and finally a vocabulary probability distribution at time t is obtained according to the attention weight of the target keyword and a corresponding proxy weight, where the vocabulary probability distribution represents the target keyword appearing at the k-th position. The generated text abstract not only considers the similarity between sentences and texts, but also considers the redundancy of abstract sentences, the weight of text segments formed by sentences with large meaning repeatability is relatively low, the importance of information in the process of a decoder module is also low, the possibility of repeated meaning target keywords in the finally obtained abstract is also lower, and the generated text abstract is more accurate, clear and concise in meaning.

Drawings

Fig. 1 is a flowchart of a text summary generating method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network architecture of a generative model provided in an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an agent in the generative model according to the second embodiment of the present invention.

Fig. 4 is a block diagram of a text abstract generating apparatus according to a second embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The text abstract generating method provided by the embodiment of the invention is executed by computer equipment, and correspondingly, the text abstract generating device runs in the computer equipment.

Fig. 1 is a flowchart of a text summary generating method according to an embodiment of the present invention. The text abstract generating method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.

S11, performing word segmentation processing on the text to obtain a target keyword, and obtaining a word vector of the target keyword.

The text refers to the expression form of written language, and may be a sentence, a paragraph, or an article. If the method is applied to a training scene, the text is a training course text, and the training course text can be derived from a training video.

In an optional implementation manner, the text is a training course text, the performing word segmentation processing on the text to obtain a target keyword, and obtaining a word vector of the target keyword includes:

according to a pre-established training course text non-use word library, drying the training course text to obtain a standard text;

performing word segmentation processing on the standard text to obtain a target keyword;

and extracting a word vector of the target keyword by using the trained word2vec model.

A computer device may first obtain a training video, extract a training voice of the training video, and recognize a voice text of the training voice using a voice recognition technique (e.g., a Sphinx system) to obtain a training course text.

The computer device may perform a word segmentation process on the text using a crust word segmentation tool to obtain a plurality of target keywords.

Since there is a lot of environmental noise during the actual training, and thus the training course text is obtained with a lot of noise information, the computer device creates a training course text dead word library in advance, the training course text dead word library including: and (3) matching the training course text with the training course text stop word library by using stop words such as word-atmosphere words, personal names, exclamation words and the like and nonsense words, and filtering words successfully matched with the training course text stop word library in the training course text, so that the effect of removing dryness of the training course text is realized, and the standard text is obtained.

Since the training course text contains many professional vocabularies, the training related professional vocabularies (e.g., insurance related vocabularies) need to be added on the basis of the common Chinese lexicon to obtain the course corpus lexicon. The computer equipment can train a word2vec model according to words in a course corpus lexicon by adopting a skip-gram training mode, and each target keyword in the standard text is represented into one by the trained word2vec model

A word vector of dimensions.

The computer device may use the textDictoryCorpus tool in genim to manage the created course text deactivation thesaurus and the course corpus thesaurus.

S12, generating statement vectors of the sentences in the text according to the word vectors, and generating segment vectors of the text segments in the text according to the word vectors.

And for any statement, combining the word vectors of all target keywords in the statement to obtain the statement vector of the statement. For example, for statements

To describe the sentence

The word vectors of all the target keywords in the sentence are combined to obtain the sentence

Statement vector of

Wherein, in the step (A),

presentation statement

To middle

A word vector for each of the target keywords,

as a sentence

The number of target keywords included.

In generalIn other words, the training course text belongs to a long text, and the computer device divides the training course text into a plurality of text segments in order to better extract a text abstract of the training course text. In specific implementation, periods of the training course text may be randomly used as division points, and the training course text may be divided into a plurality of text segments, where each text segment may include one sentence, or may include two or more sentences. Multiple text segments can be combined into a text segment sequence, denoted as

Wherein, in the step (A),

representing the first in training course text

And (4) each text segment. Since the text segment is composed of one or more sentences each including a plurality of target keywords, the text segment can be regarded as a sequence composed of a plurality of target keywords, and is denoted as

Wherein, in the step (A),

a segment of text is represented that is,

a target keyword is represented by a target keyword,

representing text segments

The number of target keywords included, each target keyword

Is one

Word vector of dimension

The segment vector is

An

A vector of dimensions.

And S13, calculating the sentence weight of the sentence in the text according to the sentence vector.

In this embodiment, by means of the idea of the abstract mode, text information may be extracted from the text as much as possible, that is, a sentence with a large amount of information is extracted from the text, so as to generate a text abstract by means of the idea of the abstract mode.

The computer device calculates the sentence weight of each sentence in the text according to the sentence vector of each sentence, the larger the sentence weight is, the more important the corresponding sentence is in the text, and the smaller the sentence weight is, the more unimportant the sentence is in the text.

In an alternative embodiment, calculating the sentence weight of the sentence in the text from the sentence vector comprises:

acquiring a first maximum boundary correlation degree of the statement;

Wherein the calculation process of the first maximum boundary correlation degree comprises: calculating a first similarity between the sentence and the text according to the sentence vector of the sentence, calculating a second similarity between the sentence and the other sentences according to the sentence vector of the sentence and the sentence vectors of the other sentences, and calculating the first maximum boundary correlation according to the first similarity and the second similarity.

Wherein the sentence is relative to the rest sentences, and the rest sentences refer to all sentences except the sentence in the text. For example, when the sentence is the first sentence in the text, the remaining sentences are the second sentence to the last sentence in the text. And when the sentence is the second sentence in the text, the other sentences are the first sentence, the third sentence and the last sentence in the text.

It should be understood that, since the sentence is relative to the rest of the sentences, the calculation process of the second maximum boundary correlation and the calculation process of the first maximum boundary correlation are not elaborated.

In an alternative embodiment, the computing, by the computer device, a first similarity of the sentence to the text from the sentence vector of the sentence comprises: calculating a first feature representation of the sentence according to the word vector in the sentence; calculating a second feature representation of the text according to the word vectors in the text; and calculating to obtain a first similarity according to the first feature representation and the second feature representation by adopting a similarity calculation model.

Illustratively, assume that a sentence includes 10 target keywords, a text includes 1000 target keywords, and a word vector for each target keyword is

Dimension, the first mean vector obtained by adding and re-averaging the word vectors of the 10 target keywords is still

Similarly, the second mean vector obtained by adding and averaging the word vectors of the 1000 target keywords is still

And (5) maintaining. The first mean vector is a first feature representation of the sentence, and the second mean vector is a second feature representation of the text, that is, after the sentence and the text are respectively converted into feature representations, a first similarity between the sentence and the text can be calculated.

In an alternative embodiment, the computing, by the computer device, the second similarity of the sentence to the remaining sentences according to the sentence vectors of the sentence and the sentence vectors of the remaining sentences comprises: calculating third feature representations of the rest sentences according to the word vectors in the rest sentences; and calculating to obtain a second similarity according to the first characteristic representation and the third characteristic representation by adopting the similarity calculation model.

And taking a third mean value vector obtained by adding and averaging the word vectors of the target keywords in the rest sentences as a third feature representation of the rest sentences.

The similarity calculation model may be a preconfigured similarity calculation function, such as an euclidean distance, a cosine angle, and the like.

determining a maximum value of the second similarity;

The computer device stores a first formula in advance, and the first maximum boundary correlation obtained by calculation according to the first similarity and the second similarity can be represented by the following first formula:

。

wherein the content of the first and second substances,

presentation statement

The maximum degree of correlation of the boundary of (c),

a feature representation (i.e., the second mean vector) representing text,

as a sentence

The feature representation of (i.e., the first mean vector),

as a sentence

The feature representation of (i.e., the third mean vector),

is a preset hyper-parameter and is continuously optimized in the subsequent training process,

presentation statement

A first degree of similarity to the text,

presentation statement

And sentences

Second phase ofAnd (4) degree.

Indicating the relevance of a single sentence to the entire text,

representing sentences in text

And sentences

The correlation between them.

The larger, the sentence

And sentences

The stronger the correlation between them, the statement is indicated

And sentences

The larger the redundancy is, the more redundant the generated text abstract is;

the smaller, the sentence

And sentences

The weaker the correlation between them, the statement

And sentences

The less redundant the text excerpt generated, the less redundant.

calculating a sum of the second maximum boundary correlations;

calculating a ratio of the first maximum boundary correlation to the sum;

The computer device stores a second formula in advance, and the statement weight of the statement calculated according to the first maximum boundary correlation and the second maximum boundary correlation can be represented by the following second formula:

。

wherein the content of the first and second substances,

presentation statement

The maximum degree of correlation of the boundary of (c),

presentation statement

The maximum degree of correlation of the boundary of (c),

presentation statement

The sentence weight of (1).

After calculating the maximum boundary correlation degree of each sentence in the text, the computer device calculates the sum of the maximum boundary correlation degrees of other sentences except the any sentence in the text for any sentence, calculates the ratio of the maximum boundary correlation degree of the any sentence to the sum, and finally performs softmax function processing to obtain the sentence weight (or called probability value) of the any sentence. The larger the ratio, the greater the relevance of the corresponding sentence to the text, and the higher the redundancy. The smaller the ratio, the weaker the correlation of the corresponding sentence with the text, and the lower the redundancy.

Combining the sentence weights of all sentences in the text to obtain a sentence weight vector representation which is recorded as

Wherein N is the number of sentences in the text,

as a sentence

The sentence prediction probability is used as a text abstract in the text.

S14, encoding the segment vector to obtain the hidden state of the text segment.

The segment vectors may be encoded using an agent in an encoder module that generates a model, resulting in a hidden state of the text segment.

In this embodiment, the network architecture of the generative model is as shown in fig. 2 below, and the generative model includes an Encoder module (e.g., an Encoder part in fig. 2) located at a bottom layer and a Decoder module (e.g., a Decoder part in fig. 2) located at an upper layer, where the Encoder module includes a plurality of agents (e.g., Agent1, Agent2, Agent m in fig. 2) independent of each other, each Agent includes a bidirectional LSTM (bilstm) model, and the Decoder module includes a multi-layer Attention mechanism layer (e.g., an extraction model Attention in fig. 2) and an Agent Attention mechanism layer (e.g., an Agent Attention in fig. 2) and an LSTM model.

The computer device configures the number of agents in the encoder module according to the number of text segments, that is, how many text segments exist, and how many agents are correspondingly configured in the encoder module, so that the text segments correspond to the agents one to one. The computer equipment inputs each text segment into one agent in the encoder module, encodes the text segment through the agent to obtain an intermediate vector (namely a hidden state), and transmits the intermediate vector to other agents, so that different agents can share global context information of different contents in the text. Each agent outputs an intermediate vector, and all intermediate vectors form an intermediate vector sequence. And finally, decoding the intermediate vectors output by all the agents through a decoder module, namely decoding the intermediate vector sequence to obtain the probability distribution of the text abstract, thereby obtaining the final text abstract. Referring to FIG. 2, a computer device segments text

Inputting Agent1 in encoder module, and text segment by Agent1

Coding to obtain text segments

Hidden state of (2), segmenting text

Inputting Agent2 in encoder module, and text segment by Agent2

Coding to obtain text segments

Hidden state of (2), segmenting text

Inputting proxy AgentM in encoder module, and aligning text segment by proxy AgentM

Coding to obtain text segments

Hidden state of (1).

In an alternative embodiment, as shown in fig. 3, the proxy includes two-stage bidirectional LSTM models (a first-stage bidirectional LSTM model and a second-stage bidirectional LSTM model), and the encoding the segment vector to obtain the hidden state of the text segment includes:

The BilSTM in each agent is a two-stage bi-directional LSTM model, where the first stage is a single-layer bi-directional LSTM (e.g., Local Encoder in FIG. 3), the output of each Local Encoder is input to a second stage context coder (e.g., context Encoder in FIG. 3), which is a multi-layer bi-directional LSTM structure called context Encoder.

The procedure of the first stage is as follows:

passing word vector

Combined text segments

And is fed as a whole into the first stage single layer bi-directional LSTM for encoding. The coding formula of the first-stage single-layer bidirectional LSTM is as follows:

hidden state in the middle:

，

hidden state of output:

。

wherein the content of the first and second substances,

as target key words

Is a hidden state, H is a hidden state

The dimension (c) of (a) is,

continuously optimizing the parameters in the training process;

representing target keywords

The intermediate hidden state resulting from the encoding of the bi-directional LSTM,

and the latter target keyword

Is added to the intermediate hidden state of (1).

As can be seen from the above formula, each target keyword in the text segment

The corresponding hidden states are from the middle hidden state of the previous and next target keywords and the word vector representation of the target keywords themselves.

The process of the second stage is as follows:

，

。

each agent is at

Layer(s)

In the context encoder, information received from an upper layer is jointly encoded. As can be seen from the above formulas

The input to the layer context encoder comprises three parts: hidden state from contextual target keywords: (

，

) (ii) a Hidden state of the previous layer output (

) (ii) a Information from other agents: (

Output of

Wherein, in the step (A),

is the other agent (1)

) Average value of layer output

。

For other agents in (1)

) The hidden state vector of the last target keyword generated by the layer at each moment. In the text segment

In each target keyword

In addition to learning the own contextual target keyword information, overall information from other text segments is also accepted.

The encoder module output

Representing a text segment

Is/are as follows

A hidden state vector sequence for each target keyword.

And S15, obtaining the attention weight of the target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each moment.

In an alternative embodiment, the attention extracting mechanism layer in the decoder module of the generative model may be used to calculate the attention weight of the target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each time.

And S16, calculating the vocabulary probability distribution at the time t according to the attention weight of the target keyword and the corresponding proxy weight.

The vocabulary probability distribution at the time t can be calculated by using an agent attention mechanism layer in the decoder module according to the attention weight of the target keyword and the corresponding agent weight.

The vocabulary probability distribution represents target keywords appearing at the kth position of the text abstract, and the agent weight is the sum of statement weights of statements in the text segment.

The decoder module is of a one-way LSTM structure so that the last target keyword of each text segment can contain the state information of all previous target keywords. Initial input state of decoder module

Output hidden state vector for last layer of last target key of all agents of encoder module

The hidden state vector contains all the text information encoded by the encoder module.

In specific implementation, a hidden state vector sequence generated by the last hidden layer of the agent is utilized

The attention weight of each target keyword in the text segment is calculated by the following formula

，

. Wherein the content of the first and second substances,

，

all of which are hyper-parameters,

representing the state vector at time t. For each agent a in the agent attention mechanism layer, the attention weight of the agent is represented by the weight of each target keyword in the text segment:

segmenting text

The sentence weight (i.e., the sentence prediction probability) of the middle sentence is added as the proxy weight

All agents will be

Weighted summation of the attention distributions to obtain the attention weight of the t-th word of the abstract

. Will be provided with

Putting the hidden state vector of each moment into a multilayer perceptron to obtain the vocabulary probability distribution of the t moment

。

Representing the target keyword appearing at the k-th position of the text excerpt.

Each time step t of the LSTM represents the location of the t-th target keyword of the generated text excerpt. Generating a hidden state at the current time

Predicted new target keyword at the t-th position in the abstract

Probability distribution of (2). In this process, a layered attention mechanism is introduced inside the LSTM structure, including the attention weight word attention of the target keyword and the attention weight agent attention of the agent.

The probability distribution of the obtained text abstract is

Wherein, in the step (A),

representing the probability of each word in the dictionary appearing at the moment t, wherein V is the number of target keywords in the course corpus thesaurus, and each element

。

In this alternative embodiment, the weight calculated by the generative model not only takes into account the similarity between the sentence and the text, but also takes into account the redundancy of the abstract sentence, and this weight is combined with the weight of the text segment of the decoder module of the generative model, and compared with a common decoder, the probability finally calculated also includes the information of redundancy. Namely, the weight of the text segment composed of sentences with large meaning repeatability is relatively low, the importance of the information in the process of the decoder module is also low, the probability of repeated meaning target keywords in the finally obtained abstract is also lower, the generated text abstract is more accurate, and the meaning is clear and concise.

And S17, generating a text abstract according to the vocabulary probability distribution.

Since the probability distribution represents the probability of each target keyword in the text appearing in the text abstract, the higher the probability that the corresponding target keyword appears in the text abstract, and the lower the probability, the lower the probability that the corresponding target keyword appears in the text abstract. The computer device may determine a target keyword appearing in the text excerpt according to the probability in the probability distribution, thereby generating the text excerpt according to the target keyword.

Illustratively, for a certain time

Selecting a time

The target keyword corresponding to the maximum probability of being the first of the text abstract to be generated

Generating text abstract finally according to the target key words

,

Is a time of day

The maximum probability of the lower case corresponds to the target keyword,

as a summary of the text

The number of target keywords contained therein.

After the computer equipment generates the text abstract, the generated text abstract can be displayed for a learner to review and review after a training class according to the generated text abstract, and a key course is selected for learning according to target keywords (serving as keywords) in the text abstract.

In summary, for a text requiring text summarization generation, the text is first segmented to obtain target keywords, then word vectors of each target keyword in the text are extracted, sentence vectors are generated for each sentence in the text and segment vectors are generated for each text segment in the text according to the word vectors, so as to obtain sentence weights of each sentence in the whole text according to the sentence vectors, then the segment vectors are encoded by adopting a trained LSTM model to obtain hidden states of the text segments, attention weights of the target keywords in the text segments are obtained according to the hidden states of the text segments and the hidden state vectors at each moment, vocabulary probability distribution at t moment is obtained according to the attention weights of the target keywords and corresponding proxy weights, and the vocabulary probability distribution represents the target keywords appearing at the kth position, and the proxy weight is the sum of the sentence weights, and finally, the text abstract is generated according to the vocabulary probability distribution. The generated text abstract not only considers the similarity between sentences and texts, but also considers the redundancy of abstract sentences, the weight of text segments formed by sentences with large meaning repeatability is relatively low, the importance of information in the process of a decoder module is also low, the possibility of repeated meaning target keywords in the finally obtained abstract is also lower, and the generated text abstract is more accurate, clear and concise in meaning.

It is emphasized that the text may be stored in a node of the blockchain in order to further ensure privacy and security of the text.

Fig. 2 is a block diagram of a text abstract generating apparatus according to a second embodiment of the present invention.

In some embodiments, the text summary generating device 40 may include a plurality of functional modules composed of computer program segments. The computer program of each program segment in the text summary generation apparatus 40 may be stored in a memory of a computer device and executed by at least one processor to perform the functions of text summary generation (described in detail in fig. 1).

In this embodiment, the text abstract generating device 40 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: a word segmentation processing module 401, a vector generation module 402, a weight calculation module 403, a vector encoding module 404, an attention calculation module 405, a probability distribution module 406, and a summary generation module 407. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The word segmentation processing module 401 is configured to perform word segmentation processing on a text to obtain a target keyword, and obtain a word vector of the target keyword.

In an optional embodiment, the text is a training course text, the word segmentation processing module 401 performs word segmentation processing on the text to obtain a target keyword, and obtaining a word vector of the target keyword includes:

Since the training course text contains many professional vocabularies, the training related professional vocabularies (e.g., insurance related vocabularies) need to be added on the basis of the common Chinese lexicon to obtain the course corpus lexicon. The computer equipment can train a word2vec model according to words in a course corpus lexicon by adopting a skip-gram training mode, and then each target keyword in the standard text is represented into an n-dimensional word vector through the trained word2vec model.

The vector generating module 402 is configured to generate a statement vector of a statement in the text according to the word vector, and generate a segment vector of a text segment in the text according to the word vector.

To describe the sentence

Statement vector of

Wherein, in the step (A),

presentation statement

To middle

A word vector for each of the target keywords,

as a sentence

The number of target keywords included.

Generally, the training course text belongs to a long text, and in order to better extract a text abstract of the training course text, the computer device divides the training course text into a plurality of text segments. In specific implementation, periods of the training course text may be randomly used as division points, and the training course text may be divided into a plurality of text segments, where each text segment may include one sentence, or may include two or more sentences. Multiple text segments can be combined into a text segment sequence, denoted as

Wherein, in the step (A),

representing the first in training course text

And (4) each text segment. Since the text segment is composed of one or more sentences each including a plurality of target keywords, the text segment may be composed of a plurality of target keywordsConsidered as a sequence consisting of a plurality of target keywords, denoted as

Wherein, in the step (A),

a segment of text is represented that is,

a target keyword is represented by a target keyword,

representing text segments

The number of target keywords included, each target keyword

Is one

Word vector of dimension

The segment vector is

An

A vector of dimensions.

The weight calculating module 403 is configured to calculate a sentence weight of the sentence in the text according to the sentence vector.

In an alternative embodiment, the calculating the sentence weight of the sentence in the text by the weight calculating module 403 according to the sentence vector includes:

acquiring a first maximum boundary correlation degree of the statement;

determining a maximum value of the second similarity;

。

wherein the content of the first and second substances,

presentation statement

The maximum degree of correlation of the boundary of (c),

a feature representation (i.e., the second mean vector) representing text,

as a sentence

The feature representation of (i.e., the first mean vector),

as a sentence

The feature representation of (i.e., the third mean vector),

presentation statement

A first degree of similarity to the text,

presentation statement

And sentences

Of the second similarity degree.

Indicating the relevance of a single sentence to the entire text,

representing sentences in text

And sentences

The correlation between them.

The larger, the sentence

And sentences

The stronger the correlation between them, the statement is indicated

And sentences

the smaller, the sentence

And sentences

The weaker the correlation between them, the statement

And sentences

The less redundant the text excerpt generated, the less redundant.

calculating a sum of the second maximum boundary correlations;

calculating a ratio of the first maximum boundary correlation to the sum;

。

wherein the content of the first and second substances,

presentation statement

The maximum degree of correlation of the boundary of (c),

presentation statement

The maximum degree of correlation of the boundary of (c),

presentation statement

The sentence weight of (1).

The vector encoding module 404 is configured to encode the segment vector to obtain a hidden state of the text segment.

Inputting Agent1 in encoder module, and text segment by Agent1

Coding to obtain text segments

Hidden state of (2), segmenting text

Inputting Agent2 in encoder module, and text segment by Agent2

Coding to obtain text segments

Hidden state of (2), segmenting text

Coding to obtain text segments

Hidden state of (1).

In an alternative embodiment, as shown in fig. 3, the agent includes two-stage bidirectional LSTM models (a first-stage bidirectional LSTM model and a second-stage bidirectional LSTM model), and the vector encoding module 404 encodes the segment vector to obtain the hidden state of the text segment includes:

The procedure of the first stage is as follows:

passing word vector

Combined text segments

hidden state in the middle:

，

hidden state of output:

。

wherein the content of the first and second substances,

as target key words

Is a hidden state, H is a hidden state

The dimension (c) of (a) is,

continuously optimizing the parameters in the training process;

representing target keywords

and the latter target keyword

Is added to the intermediate hidden state of (1).

As can be seen from the above formula, each target keyword in the text segment

The process of the second stage is as follows:

，

。

each agent is at

Layer(s)

，

) (ii) a Hidden state of the previous layer output (

) (ii) a Information from other agents: (

Output of

Wherein, in the step (A),

is the other agent (1)

) Average value of layer output

。

For other agents in (1)

) The layer generates a hidden state vector for the last target keyword. In the text segment

In each target keyword

The encoder module output

Representing a text segment

Is/are as follows

Hidden state vector order of individual target keywordsAnd (4) columns.

The attention calculating module 405 is configured to obtain the attention weight of the target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each time.

The probability distribution module 406 calculates the vocabulary probability distribution at the time t according to the attention weight of the target keyword and the corresponding agent weight.

The text segment is calculated by the following formulaAttention weight of each target keyword in (1)

，

. Wherein the content of the first and second substances,

，

all of which are hyper-parameters,

segmenting text

All agents will be

. Will be provided with

。

Predicted new target keyword at the t-th position in the abstract

The probability distribution of the obtained text abstract is

Wherein, in the step (A),

。

The abstract generating module 407 is configured to generate a text abstract according to the vocabulary probability distribution.

In an alternative embodiment, the digest generation module 407 generating the text digest according to the vocabulary probability distribution includes:

Illustratively, for a certain time

Selecting a time

Generating text abstract finally according to the target key words

,

Is a time of day

The maximum probability of the lower case corresponds to the target keyword,

as a summary of the text

The number of target keywords contained therein.

In summary, the text abstract generating device provided by the present invention, for a text requiring text abstract generation, performs word segmentation on the text to obtain target keywords, extracts word vectors of each target keyword in the text, generates a sentence vector for each sentence in the text and a segment vector for each text segment in the text according to the word vectors, thereby obtaining a sentence weight of each sentence in the whole text according to the sentence vectors, then encodes the segment vectors by using a trained LSTM model to obtain a hidden state of the text segment, obtains an attention weight of the target keyword in the text segment according to the hidden state of the text segment and the hidden state vector at each time, obtains a vocabulary probability distribution at time t according to the attention weight of the target keyword and a corresponding proxy weight, and the vocabulary probability distribution represents the target keyword appearing at the kth position, and the proxy weight is the sum of the sentence weights, and finally, the text abstract is generated according to the vocabulary probability distribution. The generated text abstract not only considers the similarity between sentences and texts, but also considers the redundancy of abstract sentences, the weight of text segments formed by sentences with large meaning repeatability is relatively low, the importance of information in the process of a decoder module is also low, the possibility of repeated meaning target keywords in the finally obtained abstract is also lower, and the generated text abstract is more accurate, clear and concise in meaning.

Fig. 5 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 5 comprises a memory 51, at least one processor 51, at least one communication bus 53 and a transceiver 54.

It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 5 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 5 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the computer device 5 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 5 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.

It should be noted that the computer device 5 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are incorporated herein by reference.

In some embodiments, the memory 51 has stored therein a computer program which, when executed by the at least one processor 51, implements all or part of the steps of the text summary generation method as described. The Memory 51 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In some embodiments, the at least one processor 51 is a Control Unit (Control Unit) of the computer device 5, and is connected to various components of the whole computer device 5 by various interfaces and lines, and executes various functions and processes data of the computer device 5 by running or executing programs or modules stored in the memory 51 and calling data stored in the memory 51. For example, when the at least one processor 51 executes the computer program stored in the memory, all or part of the steps of the text summary generation method according to the embodiment of the present invention are implemented; or to implement all or part of the functions of the text digest generation apparatus. The at least one processor 51 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.

In some embodiments, the at least one communication bus 53 is arranged to enable connection communication between the memory 51 and the at least one processor 51, etc.

Although not shown, the computer device 5 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 51 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 5 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention can also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A text summary generation method, the method comprising:

coding the segment vector to obtain the hidden state of the text segment;

2. The method of generating a text summary according to claim 1, wherein the calculating sentence weights of the sentences in the text according to the sentence vectors comprises:

acquiring a first maximum boundary correlation degree of the statement;

3. The text summary generation method of claim 2, wherein the calculation of the first maximum boundary relevance includes:

4. The text summary generation method of claim 3, wherein calculating a first similarity of the sentence to the text from the sentence vector of the sentence comprises: calculating a first feature representation of the sentence according to the word vector in the sentence; calculating a second feature representation of the text according to the word vectors in the text; calculating to obtain a first similarity according to the first feature representation and the second feature representation by adopting a similarity calculation model;

the calculating the second similarity between the sentence and the rest of the sentences according to the sentence vectors of the sentences and the sentence vectors of the rest of the sentences comprises: calculating third feature representations of the rest sentences according to the word vectors in the rest sentences; and calculating to obtain a second similarity according to the first characteristic representation and the third characteristic representation by adopting the similarity calculation model.

5. The method of generating a text summary according to claim 4, wherein the calculating the first maximum boundary relevance according to the first similarity and the second similarity comprises:

determining a maximum value of the second similarity;

6. The method of generating a text summary according to claim 4, wherein the calculating sentence weights of the sentences in the text according to the sentence vectors comprises:

calculating a sum of the second maximum boundary correlations;

calculating a ratio of the first maximum boundary correlation to the sum;

7. The method according to any one of claims 1 to 6, wherein said encoding the segment vector to obtain the hidden state of the text segment comprises:

8. The method of generating a text excerpt as claimed in claim 7, wherein said generating a text excerpt from said lexical probability distribution comprises:

9. A computer device, characterized in that the computer device comprises a processor for implementing the text summary generation method according to any one of claims 1 to 8 when executing a computer program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a text summary generation method according to any one of claims 1 to 8.