WO2021159803A1 - 文本摘要生成方法、装置、计算机设备及可读存储介质 - Google Patents

文本摘要生成方法、装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2021159803A1
WO2021159803A1 PCT/CN2020/131775 CN2020131775W WO2021159803A1 WO 2021159803 A1 WO2021159803 A1 WO 2021159803A1 CN 2020131775 W CN2020131775 W CN 2020131775W WO 2021159803 A1 WO2021159803 A1 WO 2021159803A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
sentence
processed
vector
word
Prior art date
Application number
PCT/CN2020/131775
Other languages
English (en)
French (fr)
Inventor
回艳菲
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159803A1 publication Critical patent/WO2021159803A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to the field of natural language processing technology, and in particular to a text abstract generation method, device, computer equipment, and computer-readable storage medium.
  • Text summarization is an important research in the field of natural language processing. According to different implementation methods, it is divided into extractive and generative.
  • extractive abstracts is relatively simple and widely used. Its principle is mainly to extract important sentences or paragraphs in the text, splicing them in a certain way and output them.
  • the generated abstract is a re-representation of the core content and concepts of the original text based on different forms, and the generated abstract does not need to be the same as the original text.
  • the early method used graphs for artificial feature engineering, and there was also the calculation of the similarity between sentences to obtain the most weighted sentences, and to splice them according to a specific method. Now the main focus is on the data-driven neural network, and the generation of text summaries through codec.
  • the inventor realizes that in the traditional method, the text data is not fully utilized, which leads to unreasonable information extraction, resulting in low accuracy of the extracted abstract content.
  • This application provides a text abstract generation method, device, computer equipment, and computer readable storage medium, which can solve the technical problem of low accuracy in extracting abstract content in the traditional technology.
  • the present application provides a method for generating a text summary, the method includes: obtaining a text to be processed, and obtaining a text vector corresponding to the text to be processed based on the text to be processed;
  • the vector is input to the preset Transformer model for processing to obtain the first output vector corresponding to the text to be processed;
  • the first output vector is input to the preset Seq2Seq model for processing to obtain the text corresponding to the text to be processed
  • the second output vector according to the second output vector, a text summary corresponding to the text to be processed is generated.
  • the present application also provides a text summary generating device, including: an acquiring unit configured to acquire a text to be processed, and to acquire a text vector corresponding to the text to be processed based on the text to be processed; first input Unit for inputting the text vector to be processed into a preset Transformer model for processing to obtain a first output vector corresponding to the text to be processed; a second input unit for inputting the first output vector
  • the preset Seq2Seq model is processed to obtain the second output vector corresponding to the text to be processed; the generating unit is configured to generate a text summary corresponding to the text to be processed according to the second output vector.
  • the present application also provides a computer device, which includes a memory and a processor, and a computer program is stored on the memory.
  • the processor executes the computer program, the following steps are implemented: obtaining a text to be processed, and Obtain the text vector corresponding to the text to be processed based on the text to be processed; input the text vector to be processed into a preset Transformer model for processing to obtain the first output vector corresponding to the text to be processed; The first output vector is input to a preset Seq2Seq model for processing to obtain a second output vector corresponding to the text to be processed; and a text summary corresponding to the text to be processed is generated according to the second output vector.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps: obtain Text, and obtain a text vector corresponding to the text to be processed based on the text to be processed; input the text vector to be processed into a preset Transformer model for processing to obtain the first output corresponding to the text to be processed Vector; input the first output vector to the preset Seq2Seq model for processing to obtain the second output vector corresponding to the text to be processed; generate the text corresponding to the text to be processed according to the second output vector Summary.
  • This application provides a method, device, computer equipment, and computer-readable storage medium for generating a text summary.
  • a text vector corresponding to the text to be processed is obtained, and the The to-be-processed text vector is input to a preset Transformer model for processing to obtain the first output vector corresponding to the to-be-processed text, and the first output vector is input to the preset Seq2Seq model for processing to obtain the to-be-processed
  • the second output vector corresponding to the text, and the text summary corresponding to the to-be-processed text is generated according to the second output vector.
  • Transformer uses the multi-head attention mechanism to make up for The defects of Seq2Seq
  • the Transformer and Seq2Seq models can fully complement each other, so as to achieve a richer encoding vector representation. It can extract the text to be processed to generate a text summary with coherent content and smooth sentence meaning, which improves the accuracy of generating text summaries. .
  • FIG. 1 is a schematic flowchart of a method for generating a text abstract provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a model process in a method for generating a text abstract provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of a sub-process of the method for generating a text abstract provided by an embodiment of the application;
  • FIG. 4 is a schematic block diagram of a text summary generating apparatus provided by an embodiment of the application.
  • Fig. 5 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a method for generating a text summary according to an embodiment of the application. As shown in Figure 1, the method includes the following steps S101-S104:
  • the to-be-processed text is obtained, and a text vector corresponding to the to-be-processed text is generated according to the to-be-processed text.
  • the to-be-processed text may be segmented to obtain the words contained in the to-be-processed text, and then the words are word-embedded to obtain the word vector corresponding to the to-be-processed text.
  • word embedding in English, WordEmbedding, refers to the conversion of a word (Word) into a vector (Vector) representation, which is a numerical representation of words. Generally, a word is mapped to a high-dimensional vector. The middle (word vector) represents the word.
  • Word embedding is sometimes called "word2vec", and it can be implemented by EmbeddingLayer, Word2Vec or GloVe.
  • the text to be processed can be segmented according to the punctuation marks contained in the text to be processed to obtain different sentences to be processed, and then each sentence can be obtained through a preset word segmentation method.
  • the word vector corresponding to each word since the word vector corresponding to each word has been obtained, the word vector corresponding to all the words composing the sentence can be obtained, and then according to the word vector corresponding to the words constituting the sentence, the sentence is embedded Obtain the sentence vector corresponding to the sentence.
  • sentence embedding refers to a way of converting a sentence into a sentence vector representation.
  • Sentence vector is a way to express a sentence exponentially. It can be represented by average word vector, TFIDF weighted average word vector, or SIF weighted average word vector and other methods are implemented.
  • the text is obtained, the text is extracted into the initial sentence, and the initial sentence is formed into the initial sentence set to obtain the initial sentence set, and then the extracted initial sentence is extracted to filter out the target sentence to form a text summary.
  • the initial target sentence related to the generated text summary is selected to form the initial target sentence set, and then the initial target sentence is extracted instead of all the initial sentences. Since the number of sentences to be processed is reduced, the text summary generation can be improved s efficiency. For example, for each text D j , which initially contains n sentences, you can extract m sentences among them to form an initial target sentence set, and then extract sentences from the initial target sentence set to generate a text summary, for example, the following formula (1) Shown.
  • FIG. 2 is a schematic diagram of a model process in a method for generating a text summary provided by an embodiment of the application.
  • the sentence vectors S1, S2, S3...Sn corresponding to the text vector are input into the Transformer, and the encoding process is performed through the Transformer Encoder layer included in the Transformer to obtain the first text corresponding to the text to be processed.
  • An output vector thus, the multi-head attention mechanism of Transformer is fully used to compensate for the defects of the Seq2Seq model, so that the vector representation obtained by encoding the text to be processed is richer, and the accuracy of the vector representation is improved.
  • the Transformer is composed of two parts: Encoder (ie encoding) and Decoder (ie decoding), and both Encoder and Decoder contain 6 blocks.
  • the input representation x of the word in Transformer is obtained by adding the word Embedding and the position Embedding.
  • Word Embedding can be obtained in many ways, for example, it can be obtained by pre-training algorithms such as Word2Vec or Glove, or it can be obtained by training in Transformer.
  • Position Embedding is represented by PE. The dimension of PE is the same as the word Embedding. PE can be obtained through training or calculated by using formulas.
  • the first output vector corresponding to the text to be processed obtained by the preset Transformer model is input to the preset Seq2Seq model for processing, so as to obtain the second output vector corresponding to the text to be processed.
  • Seq2Seq is a model used when the length of the output is uncertain.
  • Seq2Seq is a type of Encoder-Decoder structure, which uses two RNNs, one RNN as the Encoder and the other RNN as the Decoder.
  • Encoder is responsible for compressing the input sequence into a vector of specified length. This vector can be regarded as the semantics of the sequence. This process is called encoding.
  • the easiest way to obtain the semantic vector is to directly use the hidden state of the last input as the semantic vector. It is also possible to perform a transformation on the last hidden state to obtain a semantic vector, and it is also possible to perform a transformation on all the hidden states of the input sequence to obtain a semantic variable.
  • the Decoder is responsible for generating the specified sequence based on the semantic vector. This process is also called decoding. The simplest way is to input the semantic variable obtained by the Encoder as the initial state into the RNN of the Decoder to obtain the output sequence.
  • the Transformer encoder is used as the first encoding process of the input sentence to obtain the first output vector corresponding to the text to be processed, and the first output vector is input Perform a second encoding process to the preset Seq2Seq model to obtain the second output vector corresponding to the text to be processed.
  • the preset Seq2Seq model can adopt a single-layer unidirectional GRU-RNN, and its input is the output of the Transformer, that is, the first output vector.
  • the fixed vector generated by GRU-RNN represents the initial state of the decoder.
  • the decoder receives the previously generated vocabulary y t-1 and the hidden state s t-1 , and outputs it at each time step y t is the probability obtained by the Softmax classifier.
  • S104 Generate a text summary corresponding to the to-be-processed text based on the second output vector.
  • the second output vector corresponding to the text to be processed is obtained by processing the preset Seq2Seq model, and the text summary corresponding to the text to be processed is generated on the basis of the second output vector.
  • the second output vector is multi-classified to obtain the distribution probability of the second output vector, the second output vector with the highest distribution probability is obtained as the target vector, and the sentence corresponding to the target vector is used as the summary, so as to realize
  • the processed text is extracted to generate a text summary with coherent content and smooth sentence meaning.
  • the embodiment of the present application can process the Chinese text to be processed, and then generate a text summary corresponding to the Chinese text to be processed. It can improve the accuracy of the text summaries corresponding to the generated Chinese text.
  • the initial text is first encoded with the Transformer model, and then the first corresponding to the output of the Transformer is encoded.
  • the output vector is re-encoded through the Seq2Seq model to obtain the second output vector corresponding to the text to be processed.
  • the text summary corresponding to the text to be processed is generated.
  • FIG. 3 is a schematic diagram of a sub-process of the method for generating a text abstract provided by an embodiment of the application.
  • the step of generating a text summary corresponding to the to-be-processed text based on the second output vector includes:
  • S302 Determine whether the distribution probability is greater than or equal to a preset probability threshold
  • the step of multi-classifying the second output vector to obtain the distribution probability corresponding to the second output vector includes:
  • Multi-classification is performed on the second output vector according to the Sotfmax function to obtain the distribution probability corresponding to the second output vector.
  • the initial text is processed for the extraction model and the summary model, respectively, to obtain the second output vector, and then the second output vector is multi-classified.
  • the Sotfmax function can be used for multi-classification. Because Softmax is It is used in the classification process to achieve multi-classification. It maps some input vectors to real numbers between (0-1), and the normalization guarantees that the sum is 1, so that the sum of the multi-class probabilities is also just 1. Among them, Softmax can be divided into soft and max, max is the maximum value, suppose there are two variables a, b. If a>b, then max is a, otherwise it is b.
  • the output classification result is only a or b, which is a black or white result, but in practice, the output is the probability of getting a certain classification, or in other words, the desired score
  • the second output vector is multi-classified through a preset classifier based on the Sotfmax function
  • the second output vector is multi-classified according to the Sotfmax function to obtain the distribution probability corresponding to the second output vector
  • the last layer uses Softmax for multi-classification, so that the extraction model can learn the probability distribution of summaries contained in sentences.
  • an F1 method is also used to create extraction tags. Please refer to the 1 and 0 tags corresponding to each sentence vector in Figure 2, and use tag 1 to describe the sentence used to generate the text summary. Tag 0 describes excluding the sentence from the summary. It is assumed that each relevant summary is derived from at least one sentence in the text to be processed. The goal is to identify the most similar text sentence.
  • the sentence-level similarity score is based on the binary overlap of sentences. In addition, whenever two words in the binary overlap set are stop words, the similarity score is subtracted by 1, so that more important similarities can be captured.
  • the generated text summary is a fragment of the text to be processed, most of the tags are 0 (excluded from the summary). Therefore, higher classification accuracy does not necessarily translate into highly prominent abstracts. Therefore, this article considers the F1 score, which is a weighted average of accuracy and recall, and applies an early stopping criterion when minimizing the loss, if the F1 score does not increase after a certain number of training periods.
  • the F1 score in English, F1Score, is an indicator used to measure the accuracy of the two classification model. It takes into account the accuracy and recall of the classification model.
  • the F1 score can be regarded as one of the accuracy and recall of the model.
  • a harmonic average its maximum value is 1, and its minimum value is 0.
  • the step of obtaining the text to be processed and obtaining the text vector corresponding to the text to be processed based on the text to be processed includes:
  • the step of embedding the sentence into sentence based on the word vector to obtain the sentence vector corresponding to the sentence includes:
  • the to-be-processed text is segmented first to obtain the words included in the to-be-processed text.
  • word segmentation is also called Chinese word segmentation.
  • Chinese word segmentation is performed according to dictionary-based word segmentation algorithms, or statistical machine learning algorithms.
  • common algorithms include HMM, CRF, SVM, and deep learning algorithms.
  • Stanford and Hanlp word segmentation The tool is based on the CRF algorithm.
  • word embedding is performed on the words to obtain the word vectors corresponding to the words. Since word embedding is a way of mapping words to numerical vectors to describe the words, Get the word vector corresponding to the word. Since punctuation marks are generally used to segment sentences in the text to be processed, punctuation marks can also be used as the division of sentences. Therefore, the punctuation marks can be used to cut the text to be processed to identify the sentences contained in the text to be processed, thereby Obtain the sentences contained in the to-be-processed text. After the sentence is obtained, since the sentence is composed of words, the word segmentation that has been performed on the sentence can be used to determine which words the sentence is composed of.
  • the sentence is embedded in the sentence, and the sentence vector corresponding to the sentence can be obtained.
  • S1, S2, S3...Sn in Figure 3 the set of sentence vectors composed of S1, S2, S3...Sn is The text vector corresponding to the text to be processed.
  • the step of obtaining the sentence vector corresponding to the sentence according to the word vector and the words contained in the sentence includes:
  • the word vectors corresponding to the words contained in all the sentences are added and averaged to obtain the sentence vectors corresponding to the sentences.
  • the word vectors corresponding to the words contained in the sentence are obtained, and the word vectors corresponding to all the words contained in the sentence are added and averaged to obtain the sentence vector corresponding to the sentence .
  • the input of Transformer is the representation of the text to be processed
  • the representation of the text to be processed is composed of a series of sentence representations, and the representation of each sentence is obtained by averaging the vector of its constituent words, and the text contains The vector combination corresponding to all the sentences of, that is, the text vector is obtained.
  • the step of embedding the sentence into sentence based on the word vector to obtain the sentence vector corresponding to the sentence includes:
  • sentence embedding is performed on the target sentence to obtain a sentence vector corresponding to the target sentence.
  • a thesaurus can be preset, and the words in the thesaurus have no relevance to the subject of the recognized text.
  • the words in the thesaurus are general words, such as words such as "above”, “can", and “see”.
  • the sentence does not contain a preset word, it is assumed that The sentence has a great influence on the generated text summary, so the sentence is used as the target sentence, and then based on the word vector, the target sentence is embedded in the sentence to obtain the sentence vector corresponding to the target sentence, so that only the target sentence Perform extraction processing to generate a text summary based on the target sentence, thereby pre-screening the sentences contained in the text, narrowing the selection range of the sentence for generating the text summary, reducing the amount of data processing, and improving the efficiency of generating the text summary At the same time, since the text summary is generated for the target sentence, the accuracy of the text summary generation is also improved.
  • FIG. 4 is a schematic block diagram of a text summary generating apparatus provided by an embodiment of the application.
  • an embodiment of the present application also provides a device for generating a text abstract.
  • the text summary generating device includes a unit for executing the above-mentioned text summary generating method, and the text summary generating device may be configured in a computer device.
  • the text summary generating apparatus 400 includes an acquiring unit 401, a first input unit 402, a second input unit 403, and a generating unit 404.
  • the obtaining unit 401 is configured to obtain a text to be processed, and obtain a text vector corresponding to the text to be processed based on the text to be processed;
  • the first input unit 402 is configured to input the to-be-processed text vector into a preset Transformer model for processing, so as to obtain the first output vector corresponding to the to-be-processed text;
  • the second input unit 403 is configured to input the first output vector to a preset Seq2Seq model for processing, so as to obtain a second output vector corresponding to the text to be processed;
  • the generating unit 404 is configured to generate a text summary corresponding to the to-be-processed text according to the second output vector.
  • the generating unit 404 includes:
  • the classification subunit is configured to perform multiple classifications on the second output vector to obtain the distribution probability corresponding to the second output vector;
  • the first judging subunit is used to judge whether the distribution probability is greater than or equal to a preset probability threshold
  • a screening subunit configured to use the second output vector corresponding to the distribution probability as a target vector if the distribution probability is greater than or equal to the preset probability threshold;
  • the first obtaining subunit is used to obtain the target sentence corresponding to the target vector
  • a generating subunit is used to combine the target sentences to obtain a text summary corresponding to the text to be processed.
  • the classification subunit includes:
  • the second obtaining subunit is configured to perform multi-classification of the second output vector according to the Sotfmax function to obtain the distribution probability corresponding to the second output vector.
  • the acquiring unit 401 includes:
  • the third obtaining subunit is used to obtain the text to be processed, and to segment the text to be processed to obtain the words included in the text to be processed;
  • the word embedding subunit is used to perform word embedding on the word to obtain the word vector corresponding to the word;
  • a cutting subunit configured to cut the to-be-processed text according to the punctuation marks contained in the to-be-processed text to obtain sentences contained in the to-be-processed text;
  • the first sentence embedding subunit is used to embed the sentence based on the word vector to obtain the sentence vector corresponding to the sentence;
  • the combining subunit is used to combine all the sentence vectors to obtain the text vector corresponding to the text to be processed.
  • the first sentence embedding subunit includes:
  • the fourth acquisition subunit is used to acquire the words contained in the sentence
  • the fifth obtaining subunit is used to obtain the sentence vector corresponding to the sentence according to the word vector and the words contained in the sentence.
  • the fifth acquiring subunit includes:
  • the sixth acquiring subunit is used to acquire the word vectors corresponding to the words contained in the sentence;
  • the average subunit is used to add up and average the word vectors corresponding to the words contained in all the sentences to obtain the sentence vectors corresponding to the sentences.
  • the first sentence embedding subunit includes:
  • the second judging subunit is used to judge whether the sentence contains preset words
  • the seventh acquiring subunit is used to use the sentence as a target sentence if the sentence does not contain a preset word
  • the second sentence embedding subunit is used to embed the target sentence in sentence based on the word vector to obtain the sentence vector corresponding to the target sentence.
  • the division and connection of the units in the text summary generating device are only used for illustration.
  • the text summary generating device may be divided into different units as needed, or each of the text summary generating devices may be divided into different units.
  • the units adopt different connection sequences and methods to complete all or part of the functions of the above-mentioned text summary generating device.
  • the above text summary generating apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 5.
  • FIG. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.
  • the storage medium 503 may be non-volatile or volatile.
  • the storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the above-mentioned method for generating a text summary.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503.
  • the processor 502 can make the processor 502 execute the above-mentioned text summary generation method.
  • the network interface 505 is used for network communication with other devices.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 5, and will not be repeated here.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the text abstract generation method described in the embodiment of the present application.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors.
  • Integrated circuit Application Specific Integrated Circuit, ASIC
  • ready-made programmable gate array Field-Programmable Gate Array, FPGA
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the steps of the text summary generating method described in the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本摘要生成方法、装置、计算机设备及计算机可读存储介质,属于自然语言处理技术领域,该方法通过获取待处理文本,及基于待处理文本获取待处理文本所对应的文本向量(S101),将文本向量输入至预设Transformer模型进行处理,以得到待处理文本所对应的第一输出向量(S102),将第一输出向量输入至预设Seq2Seq模型进行处理,以得到待处理文本所对应的第二输出向量(S103),根据第二输出向量生成待处理文本所对应的文本摘要(S104)。

Description

文本摘要生成方法、装置、计算机设备及可读存储介质
本申请要求于2020年09月02日提交中国专利局、申请号为202010912303.6、申请名称为“文本摘要生成方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及自然语言处理技术领域,具体涉及一种文本摘要生成方法、装置、计算机设备及计算机可读存储介质。
背景技术
文本摘要是自然语言处理领域的一项重要研究,根据实现方式不同,将其分为抽取式和生成式。抽取式摘要应用比较简单,并且被广泛使用,其原理主要是摘取文本中的重要句子或段落,将其以某种方式进行拼接并输出。生成式摘要是基于不同的形式对原文本的核心内容以及概念进行重新表示,生成的摘要无需与原文本相同。早期方法是使用图进行人工特征工程,也存在通过计算句子间的相似度,来获取权重最大的几个句子,并对其根据特定的方法进行拼接。现在主要把焦点集中在数据驱动的神经网络上,通过编解码进行文本摘要的生成。
发明人意识到,传统的方法中,没有对文本数据进行充分的利用,导致进行了不太合理的信息抽取,致使抽取的摘要内容准确性不高。
发明内容
本申请提供了一种文本摘要生成方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中对摘要内容进行抽取存在准确性较低的技术问题。
第一方面,本申请提供了一种文本摘要生成方法,所述方法包括:获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
第二方面,本申请还提供了一种文本摘要生成装置,包括:获取单元,用于获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;第一输入单元,用于将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;第二输入单元,用于将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;生成单元,用于根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
第三方面,本申请还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现如下步骤:获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
第四方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行如下步骤:获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
本申请提供了一种文本摘要生成方法、装置、计算机设备及计算机可读存储介质,通过获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量,将所述待 处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量,将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量,根据所述第二输出向量生成所述待处理文本所对应的文本摘要,由于使用了Transformer和Seq2Seq分别为抽取模型和摘要模型,Transformer使用多头注意力机制弥补了Seq2Seq的缺陷,Transformer和Seq2Seq模型能够充分的互补,从而实现更丰富的编码向量表示,能够实现通过对待处理文本进行抽取,生成内容连贯、句意通顺的文本摘要,提升了生成文本摘要的准确性。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的文本摘要生成方法的一个流程示意图;
图2为本申请实施例提供的文本摘要生成方法中模型流程示意图;
图3为本申请实施例提供的文本摘要生成方法的一个子流程示意图;
图4为本申请实施例提供的文本摘要生成装置的一个示意性框图;以及
图5为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的文本摘要生成方法的一个流程示意图。如图1所示,该方法包括以下步骤S101-S104:
S101、获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量。
具体地,针对要生成摘要的待处理文本,获取待处理文本,并根据待处理文本生成待处理文本所对应的文本向量。生成文本向量时,可以先对待处理文本进行分词以得到待处理文本所包含的词语,然后将词语进行词嵌入以得到所述待处理文本所对应的词向量。其中,词嵌入,英文为WordEmbedding,是指将一个词语(Word)转换为一个向量(Vector)表示,是单词的一种数值化表示方式,一般情况下会将一个单词映射到一个高维的向量中(词向量)来代表这个单词,词嵌入有时又被叫作“word2vec",可以通过EmbeddingLayer、Word2Vec或者GloVe等方式实现。
将待处理文本中所包含的词语进行词嵌入后,可以将待处理文本根据待处理文本中所包含的标点符号进行分割以得到待处理所包含的不同句子,进而通过预设分词方式获取每个句子中所包含的词语,由于已经得到每个词语所对应的词向量,因此,可以获得组成句子的所有词语各自所对应的词向量,进而根据组成句子的词语所对应的词向量,通过句子嵌入获得句子所对应的句子向量,由于待处理文本是由句子组成的,获得每个句子所对应的句子向量后,即可得到所述待处理文本所对应的文本向量,每一个句子对应一个句子向量,若干个句子就对应若干个句子向量,将所述句子向量可以看作序列,从而后续将生成待处理所对应的文本摘要当做序列分类问题。其中,句子嵌入,英文为SentenceEmbedding,是指将一个句子转换为句子向量表示的方式,句子向量(英文为Sentencevector)是指数值化表示句子的方式,可以通过平均词向量、TFIDF加权平均词向量或者SIF加权平均词向量等方式实现。
进一步地,获取文本,将文本提取成初始句子,并将初始句子组成初始句子集合后,以得到初始句子集合后,再将提取的初始句子进行抽取,以筛选出目标句子组成文本摘要,可以从初始句子中筛选出与生成文本摘要有关的初始目标句子以组成初始目标句子集合,进而对初始目标句子而不是所有的初始句子进行抽取,由于减小了处理句子的句子数量,可以提高文本摘要生成的效率。例如,对于每个文本D j,初始包含n个句子,可以提取其中的m个 句子,以组成初始目标句子集合,然后从初始目标句子集合中进行抽取句子,以生成文本摘要,例如,如下公式(1)所示。
Figure PCTCN2020131775-appb-000001
其中,
Figure PCTCN2020131775-appb-000002
用于描述第j个文本的第m个句子,
Figure PCTCN2020131775-appb-000003
表示第j个文本的句子子集合。
S102、将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量。
具体地,获得待处理文本所对应的文本向量后,其中,文本向量是由句子向量组成的,将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量,所述第一输出向量中包括各个句子所对应的句子向量。请参阅图2,图2为本申请实施例提供的文本摘要生成方法中模型流程示意图。如图2所示,将文本向量所对应的句子向量S1、S2、S3…Sn输入至Transformer中,经由Transformer所包括的Transformer Encoder层进行编码处理,即可得到所述待处理文本所对应的第一输出向量
Figure PCTCN2020131775-appb-000004
从而充分使用了Transformer的多头注意力机制以弥补Seq2Seq模型的缺陷,而使对所述待处理文本进行编码得到的向量表示更丰富,提升了向量表示的准确性。
其中,Transformer由Encoder(即编码)和Decoder(即解码)两个部分组成,Encoder和Decoder都包含6个block。Transformer中单词的输入表示x由单词Embedding和位置Embedding相加得到。单词的Embedding有很多种方式可以获取,例如可以采用Word2Vec或者Glove等算法预训练得到,也可以在Transformer中训练得到。位置Embedding用PE表示,PE的维度与单词Embedding是一样的,PE可以通过训练得到,也可以使用公式计算得到。
Transformer的工作流程如下:
1)获取输入句子的每一个单词的表示向量X,X由单词的Embedding和单词位置的Embedding相加得到。
2)将得到的单词表示向量矩阵(如上图所示,每一行是一个单词的表示x)传入Encoder中,经过6个Encoderblock后可以得到句子所有单词的编码信息矩阵。
3)将Encoder输出的编码信息矩阵C传递到Decoder中,Decoder依次会根据当前翻译过的单词1~i翻译下一个单词i+1。
S103、将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量。
具体地,将通过预设Transformer模型得到的所述待处理文本所对应的第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量。
其中,Seq2Seq模型是输出的长度不确定时采用的模型,Seq2Seq属于Encoder-Decoder结构的一种,是利用两个RNN,一个RNN作为Encoder,另一个RNN作为Decoder。Encoder负责将输入序列压缩成指定长度的向量,这个向量就可以看成是这个序列的语义,这个过程称为编码,获取语义向量最简单的方式就是直接将最后一个输入的隐状态作为语义向量,也可以对最后一个隐含状态做一个变换得到语义向量,还可以将输入序列的所有隐含状态做一个变换得到语义变量。Decoder则负责根据语义向量生成指定的序列,这个过程也称为解码,最简单的方式是将Encoder得到的语义变量作为初始状态输入到Decoder的RNN中,得到输出序列。
在本申请实施例生成文本摘要的过程中,使用Transformer编码器作为对输入语句的第一次编码处理,以得到所述待处理文本所对应的第一输出向量,将所述第一输出向量输入至预设Seq2Seq模型进行第二次编码处理,以得到所述待处理文本所对应的第二输出向量。预设Seq2Seq模型可以采用一个单层的单向的GRU-RNN,它的输入是Transformer的输出即第一输出向量。
由GRU-RNN产生的固定的向量表示作为解码器的初始状态,在每一个时间步,解码器 收到之前产生的词汇y t-1和隐状态s t-1,在每个时间步输出的y t是经过Softmax分类器得到的概率。
本申请实施例中,针对要生成连贯的、从语法上讲正确的句子,需要去学习长期的依赖,由于Transformer使用多头注意力机制弥补了Seq2Seq的缺陷,通过Transformer和Seq2Seq模型能够充分的互补,以充分发挥两种模型的各自功能,在功能互补方面效果显著,使得到的向量含义更为丰富,对文本数据进行充分的利用而进行合理的信息抽取,充分表征了待处理文本中的句子上下文的信息,从而实现更丰富的编码向量表示,使抽取的摘要内容连贯、句意通顺,提升了生成文本摘要的准确性。
S104、基于所述第二输出向量生成所述待处理文本所对应的文本摘要。
具体地,通过预设Seq2Seq模型进行处理得到所述待处理文本所对应的第二输出向量,在所述第二输出向量的基础上,生成所述待处理文本所对应的文本摘要,可以再通过对所述第二输出向量进行多分类,以得到所述第二输出向量的分布概率,获取分布概率最高的第二输出向量作为目标向量,将目标向量所对应的句子作为摘要,从而实现针对待处理文本进行抽取,以生成内容连贯、句意通顺的文本摘要,例如,对于给定的中文文本,本申请实施例可以对待处理中文文本进行处理,进而生成待处理中文文本所对应的文本摘要,能够提高生成中文文本所对应的文本摘要的准确性。
在本申请实施例中,由于使用了Transformer和Seq2Seq分别为抽取模型和摘要模型,以基于混合二次编码再解码,对初始文本先用Transformer模型进行编码,然后对Transformer的输出所对应的第一输出向量,经过Seq2Seq模型进行二次编码,以得到所述待处理文本所对应的第二输出向量,基于所述第二输出向量生成所述待处理文本所对应的文本摘要,由于要生成连贯的、从语法上讲正确的句子,需要去学习长期的依赖,而Transformer使用多头注意力机制弥补了Seq2Seq的缺陷,实践表明Transformer和Seq2Seq模型能够充分的互补,从而实现更丰富的编码向量表示,能够实现对文本进行抽取,生成内容连贯、句意通顺的文本摘要,提升了生成文本摘要的准确性。
请参阅图3,图3为本申请实施例提供的文本摘要生成方法的一个子流程示意图。如图3所示,在该实施例中,所述基于所述第二输出向量生成所述待处理文本所对应的文本摘要的步骤包括:
S301、将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率;
S302、判断所述分布概率是否大于或者等于预设概率阈值;
S303、若所述分布概率小于所述预设概率阈值,不将所述分布概率所对应的第二输出向量作为目标向量;
S304、若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量;
S305、获取所述目标向量所对应的目标句子;
S306、将所述目标句子进行组合,以生成所述待处理文本所对应的文本摘要。
进一步地,所述将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率的步骤包括:
将所述第二输出向量输入至基于Sotfmax函数的预设分类器;
根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率。
具体地,基于Transformer和Seq2Seq分别为抽取模型和摘要模型对初始文本进行处理,以得到第二输出向量后,将所述第二输出向量进行多分类,可以使用Sotfmax函数进行多分类,由于Softmax是用于分类过程,用来实现多分类的,它把一些输入的向量映射到(0-1)之间的实数,并且归一化保证和为1,从而使得多分类的概率之和也刚好为1,其中,Softmax可以分为soft和max,max也就是最大值,假设有两个变量a,b。如果a>b,则max为a,反之为b。那么在分类问题里面,如果只有max,输出的分类结果只有a或者b,是个非黑即 白的结果,但是在实践中,希望输出的是取到某个分类的概率,或者说,希望分值大的那一项被经常取到,而分值较小的那一项也有一定的概率偶尔被取到,所以应用到了soft的概念,即最后的输出是每个分类被取到的概率,从而在将所述第二输出向量通过基于Sotfmax函数的预设分类器进行多分类,根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率后,判断所述分布概率是否大于或者等于预设概率阈值,若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量,获取所述目标向量所对应的目标句子,将所述目标句子进行组合,以生成所述待处理文本所对应的文本摘要。
在本申请实施例中,针对生成文本摘要的抽取模型中,最后一层采用Softmax做的多分类,从而可以使抽取模型可以学到包含在句子中的摘要的概率分布。
进一步地,在本申请实施例中,还使用一种F1方法创建提取标签,请参阅图2中每个句子向量所对应的1和0标签,用标签1描述生成文本摘要所采取的句子,用标签0描述从摘要中排除该句子,假设每个相关的摘要来源于至少一个待处理文本中的句子,目标是识别最相似的文本句子,句子级相似度得分是基于句子的二元重叠。此外,每当二元重叠集合中的两个单词都是停用词时,将相似度得分减去1,从而能够捕捉更重要的相似点。
因为生成的文本摘要是待处理文本中的一个片段,所以大多数标签都是0(从摘要中排除)。因此,较高的分类精度并不一定转化为高度突出的摘要。因此,本文考虑F1得分,它是精度和回忆的加权平均值,并在最小化损失时应用一个早期停止标准,如果F1得分在一定数量的训练期后没有增加。此外,在训练中,可以综合平衡标签,通过强迫一些随机的句子被标记为1,然后掩盖它们的权重。其中,F1分数,英文为F1Score,是用来衡量二分类模型精确度的一种指标,它同时兼顾了分类模型的精确率和召回率,F1分数可以看作是模型精确率和召回率的一种调和平均,它的最大值是1,最小值是0。
在一实施例中,所述获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量的步骤包括:
获取待处理文本,并将所述待处理文本进行分词,以得到所述待处理文本所包括的词语;
对所述词语进行词嵌入以得到所述词语所对应的词向量;
将所述待处理文本根据所述待处理文本中所包含的标点符号进行切割,以得到所述待处理文本中所包含的句子;
基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量;
将所有所述句子向量进行组合以得到所述待处理文本所对应的文本向量。
进一步地,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
获取所述句子中所包含的词语;
根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量。
具体地,获取要生成文本摘要的待处理文本后,先对待处理文本进行分词,以得到所述待处理文本所包括的词语。其中,分词,又称为中文分词,中文分词根据基于词典分词算法进行,也可以基于统计的机器学习算法,例如常用算法包括HMM、CRF、SVM及深度学习等算法,其中,Stanford及、Hanlp分词工具是基于CRF算法。
得到所述待处理文本所包括的词语后,对所述词语进行词嵌入以得到所述词语所对应的词向量,由于词嵌入是将词语映射到数值向量来描述该词语的方式,因此,可以得到词语所对应的词向量。由于待处理文本中一般通过标点符号进行断句,也可以将标点符号作为对句子的划分,因此,可以利用标点符号对待处理文本进行切割,以识别出所述待处理文本中所包含的句子,从而得到所述待处理文本中包含的句子。得到句子后,由于句子由词语组成,利用已进行对句子进行的分词,即可确定句子由哪些词语组成,获得组成句子的词语后,由于已经得到词语所对应的词向量,基于所述词向量,将所述句子进行句子嵌入,即可得到所述句子所对应的句子向量,例如图3中的S1、S2、S3…Sn,由S1、S2、S3…Sn组成的句子 向量集合,即为所述待处理文本所对应的文本向量。
在一实施例中,所述根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量的步骤包括:
获取所述句子中所包含的词语所对应的词向量;
将所有所述句子中所包含的词语所对应的词向量进行相加并取平均,以得到所述句子所对应的句子向量。
具体地,获取所述句子中所包含的词语所对应的词向量,将所有所述句子中所包含的词语所对应的词向量进行相加并取平均,以得到所述句子所对应的句子向量。其中,Transformer的输入是待处理文本的表示,待处理文本的表示是由一些列句子表示组合而成,而每个句子的表示是通过对其组成词的向量进行平均得到的,将文本中包含的所有句子各自所对应的向量组合,即得到文本向量。
在一实施例中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
判断所述句子中是否包含预设词语;
若所述句子中未包含预设词语,将所述句子作为目标句子;
基于所述词向量,将所述目标句子进行句子嵌入以得到所述目标句子所对应的句子向量。
具体地,可以预先设置一个词库,该词库中的词语对辨识文本的主题没有关联性,例如该词库的词语为通用性词语,例如“上述”、“可以”及“参阅”等词语,可以根据具体领域不同文本的属性进行设置。在对文本进行切割以获得句子后,判断所述句子中是否包含预设词语,若包含预设词语,默认该句子对生成文本摘要的影响较小,若句子中未包含预设词语,默认该句子对生成文本摘要存在较大影响,从而将该句子作为目标句子,后续基于所述词向量,将所述目标句子进行句子嵌入以得到所述目标句子所对应的句子向量,从而仅对目标句子进行抽取处理,在目标句子的基础上生成文本摘要,从而通过对文本所包含句子进行预筛选,缩小了生成文本摘要的句子的选取范围,较少了数据处理量,能够提高生成文本摘要的效率,同时,由于针对目标句子生成文本摘要,也提高了文本摘要生成的准确性。
需要说明的是,上述各个实施例所述的文本摘要生成方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图4,图4为本申请实施例提供的文本摘要生成装置的一个示意性框图。对应于上述所述文本摘要生成方法,本申请实施例还提供一种文本摘要生成装置。如图4所示,该文本摘要生成装置包括用于执行上述所述文本摘要生成方法的单元,该文本摘要生成装置可以被配置于计算机设备中。具体地,请参阅图4,该文本摘要生成装置400包括获取单元401、第一输入单元402、第二输入单元403及生成单元404。
其中,获取单元401,用于获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;
第一输入单元402,用于将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;
第二输入单元403,用于将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;
生成单元404,用于根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
在一实施例中,所述生成单元404包括:
分类子单元,用于将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率;
第一判断子单元,用于判断所述分布概率是否大于或者等于预设概率阈值;
筛选子单元,用于若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量;
第一获取子单元,用于获取所述目标向量所对应的目标句子;
生成子单元,用于将所述目标句子进行组合,以得到所述待处理文本所对应的文本摘要。
在一实施例中,所述分类子单元包括:
输入子单元,用于将所述第二输出向量输入至基于Sotfmax函数的预设分类器;
第二获取子单元,用于根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率。
在一实施例中,所述获取单元401包括:
第三获取子单元,用于获取待处理文本,并将所述待处理文本进行分词,以得到所述待处理文本所包括的词语;
词嵌入子单元,用于对所述词语进行词嵌入以得到所述词语所对应的词向量;
切割子单元,用于将所述待处理文本根据所述待处理文本中所包含的标点符号进行切割,以得到所述待处理文本中所包含的句子;
第一句子嵌入子单元,用于基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量;
组合子单元,用于将所有所述句子向量进行组合以得到所述待处理文本所对应的文本向量。
在一实施例中,所述第一句子嵌入子单元包括:
第四获取子单元,用于获取所述句子中所包含的词语;
第五获取子单元,用于根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量。
在一实施例中,所述第五获取子单元包括:
第六获取子单元,用于获取所述句子中所包含的词语所对应的词向量;
均值子单元,用于将所有所述句子中所包含的词语所对应的词向量进行相加并取平均,以得到所述句子所对应的句子向量。
在一实施例中,所述第一句子嵌入子单元包括:
第二判断子单元,用于判断所述句子中是否包含预设词语;
第七获取子单元,用于若所述句子中未包含预设词语,将所述句子作为目标句子;
第二句子嵌入子单元,用于基于所述词向量,将所述目标句子进行句子嵌入以得到所述目标句子所对应的句子向量。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述文本摘要生成装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
同时,上述文本摘要生成装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将文本摘要生成装置按照需要划分为不同的单元,也可将文本摘要生成装置中各单元采取不同的连接顺序和方式,以完成上述文本摘要生成装置的全部或部分功能。
上述文本摘要生成装置可以实现为一种计算机程序的形式,该计算机程序可以在如图5所示的计算机设备上运行。
请参阅图5,图5是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图5,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括存储介质503和内存储器504。所述存储介质503可以是非易失性的,也可以是易失性的。
该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行一种上述文本摘要生成方法。
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。
该内存储器504为存储介质503中的计算机程序5032的运行提供环境,该计算机程序 5032被处理器502执行时,可使得处理器502执行一种上述文本摘要生成方法。
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图5所示实施例一致,在此不再赘述。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例所描述的文本摘要生成方法。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central ProcessingUnit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以为易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的所述文本摘要生成方法的步骤。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种文本摘要生成方法,包括:
    获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;
    将所述文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;
    将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;
    根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
  2. 根据权利要求1所述文本摘要生成方法,其中,所述根据所述第二输出向量生成所述待处理文本所对应的文本摘要的步骤包括:
    将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率;
    判断所述分布概率是否大于或者等于预设概率阈值;
    若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量;
    获取所述目标向量所对应的目标句子;
    将所述目标句子进行组合,以生成所述待处理文本所对应的文本摘要。
  3. 根据权利要求2所述文本摘要生成方法,其中,所述将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率的步骤包括:
    将所述第二输出向量输入至基于Sotfmax函数的预设分类器;
    根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率。
  4. 根据权利要求1所述文本摘要生成方法,其中,所述获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量的步骤包括:
    获取待处理文本,并将所述待处理文本进行分词,以得到所述待处理文本所包括的词语;
    对所述词语进行词嵌入以得到所述词语所对应的词向量;
    将所述待处理文本根据所述待处理文本中所包含的标点符号进行切割,以得到所述待处理文本中所包含的句子;
    基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量;
    将所有所述句子向量进行组合以得到所述待处理文本所对应的文本向量。
  5. 根据权利要求4所述文本摘要生成方法,其中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
    获取所述句子中所包含的词语;
    根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量。
  6. 根据权利要求5所述文本摘要生成方法,其中,所述根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量的步骤包括:
    获取所述句子中所包含的词语所对应的词向量;
    将所有所述句子中所包含的词语所对应的词向量进行相加并取平均,以得到所述句子所对应的句子向量。
  7. 根据权利要求4所述文本摘要生成方法,其中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
    判断所述句子中是否包含预设词语;
    若所述句子中未包含预设词语,将所述句子作为目标句子;
    基于所述词向量,将所述目标句子进行句子嵌入以得到所述目标句子所对应的句子向量。
  8. 一种文本摘要生成装置,包括:
    获取单元,用于获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;
    第一输入单元,用于将所述待处理文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;
    第二输入单元,用于将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;
    生成单元,用于根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
  9. 一种计算机设备,所述计算机设备包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述计算机程序,以执行如下步骤:
    获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;
    将所述文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;
    将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;
    根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
  10. 根据权利要求1所述计算机设备,其中,所述根据所述第二输出向量生成所述待处理文本所对应的文本摘要的步骤包括:
    将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率;
    判断所述分布概率是否大于或者等于预设概率阈值;
    若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量;
    获取所述目标向量所对应的目标句子;
    将所述目标句子进行组合,以生成所述待处理文本所对应的文本摘要。
  11. 根据权利要求10所述计算机设备,其中,所述将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率的步骤包括:
    将所述第二输出向量输入至基于Sotfmax函数的预设分类器;
    根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率。
  12. 根据权利要求9所述计算机设备,其中,所述获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量的步骤包括:
    获取待处理文本,并将所述待处理文本进行分词,以得到所述待处理文本所包括的词语;
    对所述词语进行词嵌入以得到所述词语所对应的词向量;
    将所述待处理文本根据所述待处理文本中所包含的标点符号进行切割,以得到所述待处理文本中所包含的句子;
    基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量;
    将所有所述句子向量进行组合以得到所述待处理文本所对应的文本向量。
  13. 根据权利要求12所述计算机设备,其中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
    获取所述句子中所包含的词语;
    根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量。
  14. 根据权利要求13所述计算机设备,其中,所述根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量的步骤包括:
    获取所述句子中所包含的词语所对应的词向量;
    将所有所述句子中所包含的词语所对应的词向量进行相加并取平均,以得到所述句子所对应的句子向量。
  15. 根据权利要求12所述计算机设备,其中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
    判断所述句子中是否包含预设词语;
    若所述句子中未包含预设词语,将所述句子作为目标句子;
    基于所述词向量,将所述目标句子进行句子嵌入以得到所述目标句子所对应的句子向量。
  16. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现如下步骤:
    获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量;
    将所述文本向量输入至预设Transformer模型进行处理,以得到所述待处理文本所对应的第一输出向量;
    将所述第一输出向量输入至预设Seq2Seq模型进行处理,以得到所述待处理文本所对应的第二输出向量;
    根据所述第二输出向量生成所述待处理文本所对应的文本摘要。
  17. 根据权利要求16所述计算机可读存储介质,其中,所述根据所述第二输出向量生成所述待处理文本所对应的文本摘要的步骤包括:
    将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率;
    判断所述分布概率是否大于或者等于预设概率阈值;
    若所述分布概率大于或者等于所述预设概率阈值,将所述分布概率所对应的第二输出向量作为目标向量;
    获取所述目标向量所对应的目标句子;
    将所述目标句子进行组合,以生成所述待处理文本所对应的文本摘要。
  18. 根据权利要求17所述计算机可读存储介质,其中,所述将所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率的步骤包括:
    将所述第二输出向量输入至基于Sotfmax函数的预设分类器;
    根据所述Sotfmax函数对所述第二输出向量进行多分类,以得到所述第二输出向量所对应的分布概率。
  19. 根据权利要求16所述计算机可读存储介质,其中,所述获取待处理文本,并基于所述待处理文本获取所述待处理文本所对应的文本向量的步骤包括:
    获取待处理文本,并将所述待处理文本进行分词,以得到所述待处理文本所包括的词语;
    对所述词语进行词嵌入以得到所述词语所对应的词向量;
    将所述待处理文本根据所述待处理文本中所包含的标点符号进行切割,以得到所述待处理文本中所包含的句子;
    基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量;
    将所有所述句子向量进行组合以得到所述待处理文本所对应的文本向量。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述基于所述词向量,将所述句子进行句子嵌入以得到所述句子所对应的句子向量的步骤包括:
    获取所述句子中所包含的词语;
    根据所述词向量及所述句子中所包含的词语,得到所述句子所对应的句子向量。
PCT/CN2020/131775 2020-09-02 2020-11-26 文本摘要生成方法、装置、计算机设备及可读存储介质 WO2021159803A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010912303.6 2020-09-02
CN202010912303.6A CN112052329A (zh) 2020-09-02 2020-09-02 文本摘要生成方法、装置、计算机设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021159803A1 true WO2021159803A1 (zh) 2021-08-19

Family

ID=73607224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131775 WO2021159803A1 (zh) 2020-09-02 2020-11-26 文本摘要生成方法、装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112052329A (zh)
WO (1) WO2021159803A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732898A (zh) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 文献摘要生成方法、装置、计算机设备及存储介质
CN113435183B (zh) * 2021-06-30 2023-08-29 平安科技(深圳)有限公司 文本生成方法、装置及存储介质
CN114444471A (zh) * 2022-03-09 2022-05-06 平安科技(深圳)有限公司 句子向量生成方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597979A (zh) * 2019-06-13 2019-12-20 中山大学 一种基于自注意力的生成式文本摘要方法
US20200065346A1 (en) * 2017-07-26 2020-02-27 International Business Machines Corporation Extractive query-focused multi-document summarization
CN110929024A (zh) * 2019-12-10 2020-03-27 哈尔滨工业大学 一种基于多模型融合的抽取式文本摘要生成方法
CN111597327A (zh) * 2020-04-22 2020-08-28 哈尔滨工业大学 一种面向舆情分析的无监督式多文档文摘生成方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657051A (zh) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 文本摘要生成方法、装置、计算机设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200065346A1 (en) * 2017-07-26 2020-02-27 International Business Machines Corporation Extractive query-focused multi-document summarization
CN110597979A (zh) * 2019-06-13 2019-12-20 中山大学 一种基于自注意力的生成式文本摘要方法
CN110929024A (zh) * 2019-12-10 2020-03-27 哈尔滨工业大学 一种基于多模型融合的抽取式文本摘要生成方法
CN111597327A (zh) * 2020-04-22 2020-08-28 哈尔滨工业大学 一种面向舆情分析的无监督式多文档文摘生成方法

Also Published As

Publication number Publication date
CN112052329A (zh) 2020-12-08

Similar Documents

Publication Publication Date Title
WO2021159803A1 (zh) 文本摘要生成方法、装置、计算机设备及可读存储介质
WO2022100045A1 (zh) 分类模型的训练方法、样本分类方法、装置和设备
WO2022142011A1 (zh) 一种地址识别方法、装置、计算机设备及存储介质
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN113255320A (zh) 基于句法树和图注意力机制的实体关系抽取方法及装置
US11636272B2 (en) Hybrid natural language understanding
CN111767694B (zh) 文本生成方法、装置和计算机可读存储介质
US20220300708A1 (en) Method and device for presenting prompt information and storage medium
CN115759119B (zh) 一种金融文本情感分析方法、系统、介质和设备
CN116956835B (zh) 一种基于预训练语言模型的文书生成方法
CN111738006A (zh) 基于商品评论命名实体识别的问题生成方法
CN112101042A (zh) 文本情绪识别方法、装置、终端设备和存储介质
CN116049387A (zh) 一种基于图卷积的短文本分类方法、装置、介质
CN116822464A (zh) 一种文本纠错方法、系统、设备及存储介质
CN111145914B (zh) 一种确定肺癌临床病种库文本实体的方法及装置
CN114328939B (zh) 基于大数据的自然语言处理模型构建方法
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
CN112818110B (zh) 文本过滤方法、设备及计算机存储介质
CN113326367A (zh) 基于端到端文本生成的任务型对话方法和系统
CN112632956A (zh) 文本匹配方法、装置、终端和存储介质
CN110888944A (zh) 基于多卷积窗尺寸注意力卷积神经网络实体关系抽取方法
CN113486169B (zh) 基于bert模型的同义语句生成方法、装置、设备及存储介质
CN115238698A (zh) 生物医疗命名实体识别方法及系统
CN114691716A (zh) Sql语句转换方法、装置、设备及计算机可读存储介质
CN114298032A (zh) 文本标点检测方法、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918767

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918767

Country of ref document: EP

Kind code of ref document: A1