CN114547289A

CN114547289A - NLP technology-based Chinese abstract automatic generation method and system

Info

Publication number: CN114547289A
Application number: CN202210204288.9A
Authority: CN
Inventors: 王峥; 段京华
Original assignee: Shanxi Jubo Tianhao Technology Co ltd
Current assignee: Shanxi Jubo Tianhao Technology Co ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-05-27

Abstract

The invention relates to the field of automatic generation of abstracts, and particularly provides a Chinese abstract automatic generation method and system based on an NLP (non-line-of-sight) technology, which comprises the following steps: s1: performing target training on the text needing to generate the abstract, and maximizing the probability of generating each target word; s2: automatically generating an evaluation index; s3: evaluating the text needing to generate the abstract by adopting an automatic generation evaluation index; s4: and adopting an abstract generation model to extract sentences of the text to generate an abstract. The invention automatically generates the abstract through a natural language processing technology, and refers to automatically generating a segment of abstract which retains key information in an input text and has smooth, concise and accurate semantics according to one or more documents. The automatic text summarization can generate the summarization quickly, accurately and in real time, and overcomes the defects of manual summarization.

Description

NLP technology-based Chinese abstract automatic generation method and system

Technical Field

The invention relates to the field of automatic generation of abstracts, in particular to a Chinese abstract automatic generation method and system based on an NLP technology.

Background

Natural Language Processing (NLP) is a discipline that studies the linguistic problems of human interaction with computers. According to different technical implementation difficulties, such systems can be divided into three types, namely simple matching type, fuzzy matching type and paragraph understanding type. The simple matching type tutoring and answering system mainly realizes the matching of questions proposed by students and related answering items in an answer library through a simple keyword matching technology, thereby realizing the automatic answering of the questions or the related tutoring. The fuzzy matching type tutoring and answering system increases the matching of synonyms and antonyms on the basis of the fuzzy matching type tutoring and answering system. Thus, even if the student does not find a directly matching answer in the answer library according to the original keyword in the question, if the words synonymous with the keyword or antisense to the keyword can be matched, the relevant answer item can be found in the answer library. Paragraph understanding type tutoring and answering system is the most ideal and truly intelligent tutoring and answering system (simple matching type and fuzzy matching type, which can only be called "automatic tutoring and answering system" rather than "intelligent tutoring and answering system" strictly speaking). However, the system relates to paragraph understanding of natural language, and for Chinese, the understanding relates to various complex technologies in the NLP field such as automatic word segmentation, part of speech analysis, syntactic analysis and semantic analysis, so that the realization difficulty is high. In recent years, automatic text summarization has become one of the important research directions in the fields of artificial intelligence and natural language processing. The automatic text summarization aims to extract key information in an original text and generate a summary which is semantically smooth, concise and accurate, and aims to improve the information browsing efficiency of a user. With the development of deep learning, the automatic text summarization model of the present day is mainly constructed based on a sequence-to-sequence framework. However, the application of the current sequence-to-sequence framework in automatic text summarization also has many problems, such as difficulty in generating out-of-set words, inability to effectively model the connections between words, lack of modeling for the key information extraction process, etc.

Disclosure of Invention

The invention mainly aims to provide a Chinese abstract automatic generation method and system based on NLP technology, so as to solve the problems in the related technology.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method and a system for automatically generating a chinese abstract based on NLP technology, comprising the following steps:

s1: performing target training on the text needing to generate the abstract, and maximizing the probability of generating each target word;

s2: automatically generating an evaluation index;

s3: evaluating the text needing to generate the abstract by adopting an automatic generation evaluation index;

s4: extracting sentences from the text by adopting an abstract generation model to generate an abstract;

further, the target training of the text needing to generate the abstract specifically includes:

wherein £ (theta) is the probability of generating each target word to the maximum, D is the training data set, x is the input text, y is the target abstract, and theta is a parameter of the model.

Further, the automatic generation evaluation index is one or the combination of two of the ROUGE-N, ROUGE-L.

Further, the ROUGE-N index is specifically as follows:

where S represents a sentence, gram, in the reference abstract_nRepresents an n-tuple, Count (gram)_n) Represents the number of n-tuples, Count in S_match(gram_n) Representing the number of n-tuples that the model-generated digest and the reference digest match.

Further, the ROUGE-L index is specifically as follows:

wherein X is the reference summary, m is its length, Y is the summary generated by the model, n is its length,

further, the sentence extraction of the text specifically includes representing the text content as a set composed of feature items, extracting a topic from the set according to the feature items, extracting a word from a word distribution corresponding to the extracted topic, and repeating the above process until the abstract is generated.

Further, the representing the text content as a set of feature items is specifically: doc (t)₁，t₂，…，t_n) Specifying t_kFor the feature item, the text is represented by the feature item and the corresponding weight thereof to form a vector, and the vector is in the form of: doc ((t)₁，w₁)，(t₂，w₂)，…，(t_n，w_n) Wherein w)_kIs a feature item t_kThe weight of (c).

Further, the extracting a topic from the set according to the feature items, and the extracting a word from the word distribution corresponding to the extracted topic specifically includes:

p(w|d)＝p(w|t)×p(t|d)

where t is the topic of the abstraction, w is the word being abstracted, d is the set being abstracted, and p is the abstract of the composition.

On the other hand, the system for automatically generating the Chinese abstract based on the NLP technology comprises a text input unit, an encoding unit and a decoding unit, wherein the text input unit is used for inputting a text needing to generate the abstract through a user terminal, the encoding unit is used for encoding the text needing to generate the abstract to obtain a text representation, and the decoding unit is used for decoding the text representation of the input text to generate the abstract.

Further, the coding unit is formed by stacking N identical coding layers, and the coding process of the l-th layer of the coding unit is as follows:

wherein,

indicating that the l-1 level of the encoder is for the ith word x in the input text x_iThe output of the l-1 layer is the input of the l layer; Self-Attn denotes the application of a Self-attentive mechanism to the input, LayerNorm denotes layer normalization, FFN denotes a feed-forward neural network,

and

expressed as an intermediate result of the calculation process;

the decoding unit passes a probability distribution P_uocabObtaining the output word of the current step, the probability distribution P_uocab＝softmax(W_oS+b_o) Wherein W is_oAnd b_oFor trainable parameters, S is the output of the last layer of the decoder, and softmax is a softmax function.

Compared with the prior art, the invention has the following beneficial effects: the invention automatically generates the abstract through a natural language processing technology, and refers to automatically generating a segment of abstract which retains key information in an input text and has smooth, concise and accurate semantics according to one or more documents. The automatic text summarization can generate the summarization quickly, accurately and in real time, and overcomes the defects of manual summarization.

Drawings

FIG. 1 is a schematic view of the overall process of the present invention;

FIG. 2 is an overall system block diagram of the present invention;

FIG. 3 is a schematic diagram of a portion of the modules of the present invention.

In the figure: 100. a text input unit; 200. an encoding unit; 300. and a decoding unit.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. It should be noted that when one component is referred to as being "connected" to another component, it can be directly connected to the other component or intervening components may also be present.

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

An automatic Chinese abstract generation method based on NLP technology comprises the following steps:

s2: automatically generating an evaluation index;

The model of the embodiment is trained in two steps: pre-training and fine-tuning. To better exploit the pre-training model and limit the hardware conditions, the inventors replaced the pre-training process with a model MASS. And fine-tuning the pre-trained model on the data set of the text abstract. In fine tuning, maximum likelihood estimation is used to maximize the conditional probability of generating each target word given the model parameters θ and the input text x, which is equivalent to minimizing the negative log-likelihood between the model-generated word and the target word.

Further, the ROUGE-N index is specifically as follows:

wherein, S isTabulated sentences, grams in the reference abstract_nRepresents an n-tuple, Count (gram)_n) Represents the number of n-tuples, Count in S_match(gram_n) Representing the number of n-tuples that the model-generated digest and the reference digest match. The index counts the co-occurrence recall rate between the reference digest and the model generation digest n-tuple.

Further, the ROUGE-L index is specifically as follows:

the index measures the quality of the model generated abstract according to the longest common substring between the model generated abstract and the reference abstract.

Further, the representing the text content as a set of feature items is specifically: doc (t)₁，t₂，…，t_n) Specifying t_kFor the feature item, the text is represented by the feature item and the corresponding weight thereof to form a vector, and the vector is in the form of: doc ((t)₁，w₁)，(t₂，w₂)，…，(t_n，w_n) Therein), whichIn, w_kIs a feature item t_kThe weight of (c).

In this embodiment, the topic model realizes extension of the BOW model by introducing a "topic (topic)" as a hidden variable, and abstracting the association relationship between words and documents as: the topic model maps words or phrases with the same topic to the same dimension, and the judgment basis of two different words belonging to the same topic is as follows: two different words may have a higher probability of being generated than other words if both words have a higher probability of occurring in the same document at the same time, or given a topic. The topic model is a special probability map model, the mathematical basis is complete, and the inference based on Gibbs sampling is simple and effective. Assuming that there are K topics (which are generally set by human, and this is a possible problem of the model), an article is represented as a K-dimensional vector, each dimension of the vector represents a topic, and the weight represents the probability that the article belongs to the corresponding topic. Thus, the topic model calculates the word distribution of topics in the corpus of text, and calculates the topic distribution of each article.

The text features are extracted from an original text, and can be characters, words, phrases, sentences or other forms to form nodes, the same feature items only form one node, the total number of the nodes is the number of the feature items which are different from each other in the text, a node set V is formed, edges are formed by the relationship among the nodes in V, the relationship is the most simple co-occurrence relationship, and if two feature items appear in a window, such as a sentence, a specific number of character intervals, a document and the like, the connecting edges exist between the nodes corresponding to the feature items in the window. The text graph can be directional or non-directional, the set of all edges forms an edge set E. In addition to the co-occurrence relationships forming a textual co-occurrence graph, a grammatical relationship graph or semantic relationship graph structure for the text may be similarly constructed.

p(w|d)＝p(w|t)×p(t|d)

where t is the topic of the extraction, w is the word of the extraction, d is the set being extracted, and p is the summary of the composition.

On the other hand, the system for automatically generating the Chinese abstract based on the NLP technology comprises a text input unit 100, an encoding unit 200 and a decoding unit 300, wherein the text input unit 100 is used for inputting a text needing to generate the abstract through a user terminal, the encoding unit 200 is used for encoding the text needing to generate the abstract to obtain a text representation, and the decoding unit 300 is used for decoding the text representation of the input text to generate the abstract.

Further, the encoding unit 200 is formed by stacking N identical encoding layers, and the encoding process of the l-th layer of the encoding unit 200 is as follows:

wherein,

indicating that the encoder level l-1 is for the ith word x in the input text x_iThe output of the l-1 layer is the input of the l layer; Self-Attn denotes the application of a Self-attentive mechanism to the input, LayerNorm denotes layer normalization, FFN denotes a feed-forward neural network,

and

expressed as an intermediate result of the calculation process;

the decoding unit 300 passes through a probability distribution P_uocabObtaining the output word of the current step, the probability distribution P_uocab＝softmax(W_oS+b_o) Wherein W is_oAnd b_oFor trainable parameters, S is the output of the last layer of the decoder, and softmax is a softmax function.

Spatially relative terms, such as "above … …," "above … …," "above … … surface," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial relationship to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary term "above … …" can include both an orientation of "above … …" and "below … …". The device may be otherwise variously oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances such that, for example, embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automatic Chinese abstract generation method based on NLP technology is characterized by comprising the following steps:

s2: automatically generating an evaluation index;

s4: and adopting an abstract generation model to extract sentences of the text to generate an abstract.

2. The method for automatically generating a chinese abstract based on NLP technology according to claim 1, wherein the target training of the text to be generated with an abstract specifically comprises:

3. The method for automatically generating a chinese abstract according to claim 1, wherein the automatically generated evaluation index is any one or a combination of two of the root-N, ROUGE-L.

4. The method and system for automatically generating a chinese abstract based on NLP technology according to claim 3, wherein the route-N index is specifically:

5. The method for automatically generating a chinese abstract according to claim 3, wherein the route-L indicators are specifically:

6. the method according to claim 1, wherein the sentence extraction of the text specifically includes representing the text content as a set of feature items, extracting a topic from the set according to the feature items, extracting a word from a word distribution corresponding to the extracted topic, and repeating the above process until the abstract is generated.

7. The method for automatically generating a chinese abstract based on NLP technology according to claim 1, wherein the representing of text content as a set of feature items specifically is: doc (t)₁，t₂，…，t_n) Specifying t_kFor the feature item, the text is represented by the feature item and the corresponding weight thereof to form a vector, and the vector is in the form of: doc ((t)₁，w₁)，(t₂，w₂)，…，(t_n，w_n) Wherein w)_kIs a feature item t_kThe weight of (c). .

8. The method for automatically generating a chinese abstract based on NLP technology according to claim 7, wherein said extracting a topic from the set according to the feature items, and extracting a word from the word distribution corresponding to the extracted topic specifically comprises:

p(w|d)＝p(w|t)×p(t|d)

9. The Chinese abstract automatic generation system based on the NLP technology comprises a text input unit (100), an encoding unit (200) and a decoding unit (300), wherein the text input unit (100) is used for inputting a text needing to be abstracted through a user terminal, the encoding unit (200) is used for encoding the text needing to be abstracted to obtain a text representation, and the decoding unit (300) is used for decoding the text representation of the input text to generate the abstract.

10. The system for automatically generating the chinese abstract based on the NLP technology of claim 9, wherein the coding unit (200) is formed by stacking N identical coding layers, and the coding process of the l-th layer of the coding unit (200) is as follows:

wherein,

and

expressed as an intermediate result of the calculation process;

the decoding unit (300) passes a probability distribution P_uocabObtaining the output word of the current step, the probability distribution P_uocab＝softmax(W_oS+b_o) Wherein W is_oAnd b_oS is the output of the last layer of the decoder, and softmax is the softmax function, for trainable parameters.