CN111191023A

CN111191023A - A method, device and system for automatically generating topic tags

Info

Publication number: CN111191023A
Application number: CN201911395888.2A
Authority: CN
Inventors: 李建欣; 毛乾任; 李熙; 黄洪仁; 钟盛海; 朱洪东
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-22
Anticipated expiration: 2039-12-30
Also published as: CN111191023B

Abstract

A method, device and system for automatically generating topic tags, comprising: step 1: constructing a training data set and data preprocessing; step 2: implementing a Transformer encoder feature encoder based on a content selection mechanism of content segments; step 3: transforming a Transformer decoder Topic summary generator model; step 4: training data and optimizing according to cross-validation, and realizing the interface realization of model encapsulation and device; the invention realizes the automatic generation of topic tags through text summary generation technology, and proposes a topic tag generation method. In the new scenario, the present invention proposes the Transformer encoding of the content selection mechanism and extracts important source text fragments, which are input to the decoder for text generation. This design not only captures effective core semantic fragments, but also reduces the cost of model training.

Description

Automatic generation method, device and system for topic labels

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device and a system for automatically generating topic labels.

Background

The rapid development of the internet has been accompanied by the generation of a large amount of text data every day, and each user of the internet can distribute and modify contents on the internet as a content producer, so that contents on the internet are increasing every day at an unthinkable rate. The mass content on the Internet brings great convenience to people who want to acquire the content, and people can find any content which they want to acquire from the Internet; meanwhile, a large amount of invalid content and junk content exist in mass content on the internet, so that the way for acquiring new information is fundamentally changed in the internet information era, and more challenges are brought. Balancing the relationship between content generation and content acquisition has become a research focus in the field of Natural Language Processing (NLP). How to efficiently acquire useful information from a large amount of disordered, disordered and unstructured texts becomes an urgent problem to be solved. Among many information processing methods, text generation can provide a way for users to quickly understand the content of the original text, can help people quickly understand the information in the content, and is a key point for balancing the relationship between content generation and content acquisition.

The generation of topics is a high generalization of the upper semantic content for text. The topic is a central semantic meaning obtained by concentrating a text, and in the process of generating the text with the high-order semantic meanings, firstly, effective representation is required to be carried out on the word and sentence levels of the text, and then, the representation vectors of the word and the sentence are required to be abstracted into higher-order feature words to form the topic.

In the text analysis of the social data for the microblog, a large amount of texts may not contain topic tags, and in order to solve the problem, the texts need to be automatically summarized by a method for realizing automatic generation in the process of summarizing topics, so as to generate topics. At present, a thorough research aiming at a topic automatic generation and induction method is lacked, and the intensive research mainly focuses on the extraction research and topic content clustering research of topic keywords for social media analysis in China. Microblogs report the latest topics from daily life to and around the world, reflecting real-time events in our lives. Considering the huge volume of microblog contents and the inherent redundancy and noise thereof, the method for generating topics by clustering information is not suitable for application landing, and the method is limited by clustering performance. Most of the current methods of generating based on topic labels utilize and organize the semantic materials in a form based on rules to generate the topic labels, and the whole pipeline process has the problems of error accumulation and error transmission. The topic summarization method is low in availability, narrow in adaptability, low in accuracy and high in cost.

Disclosure of Invention

In order to solve the technical problem, the invention provides an end-to-end topic label automatic generation method based on a Transformer model.

A topic label automatic generation method comprises the following steps:

the method comprises the following steps: constructing a training data set and preprocessing data;

step two: a Transformer encoder feature encoder for implementing a content selection mechanism based on content segments;

step three: a topic abstract generator model of a Transformer decoder;

step four: training data and adjusting and optimizing according to cross validation, and realizing model encapsulation and interface realization of the device.

Further, in the first step, the method for constructing the training data set and preprocessing the data includes:

dividing the microblog topic and the microblog content, and generating a topic label by using a Source text;

screening sentences with topic semantics, and generating topics by using the screened sentences;

dividing the microblog content into segments, segmenting the source text content, and presenting the source text content in a segment form;

semantically coding the source text in the form of the fragments, and adding [ cls ] and [ eos ] labels to each fragment;

combining each fragment with the starting tag and the ending tag of the fragment, and adding a [ senten ] tag at the beginning of a sentence for learning the semantics of the sentence to obtain Source data;

and constructing a training data set, processing data taking the topic as Target, and simultaneously filling the data into a model for training to obtain an initial training corpus.

Further, in the second step, the method for implementing a transformer encoder based on a content selection mechanism of content segments includes:

the method comprises the following steps that a Transformer based on a content selection mechanism encodes microblog content, obtains microblog content vector representation, and obtains a source text sentence feature encoding vector:

source_embedding＝Transformer(weibo content)

extracting a sentence [ senten ]]Labels and [ cls_i]Label corresponding Embedding:

T_senten＝GetSenten(source_embedding)

T_cls＝GetCls(source_embedding)

wherein, T_sentenFeature vectors, T, representing source text sentences output by a Transformer encoder_clsRepresenting a set of feature vectors representing each content segment output by the transform encoder;

transform feature coding using content selection mechanism by computing the sentence [ senten ]]Represents a group of formulae [ cls ]_i]The importance of the fragment representation is mainly calculated by utilizing a bilinear function attention mechanism_sentenAnd T_clsThe importance of (c):

R_i＝T_sentenWT_cls

wherein R is_iTo representFeature weight calculation input vector integrating T_sentenAnd T_clsThe semantic correlation of the two labels is learned through a weight matrix W, and the semantic correlation is obtained through the normalization calculation of a Softmax function

I.e. each T_clsRelative to T_sentenThe importance weight of;

extracting 3 [ cls ] with highest similarity_i]And taking the content fragment text corresponding to the label as the input of the generator:

wherein, T_[3]Indicating three important [ cls ] selected_i]All token vectors for the fragment.

Further, in the third step, the implementation method of the topic abstract generator model of the Transformer decoder is as follows:

encoding topic text using a Transformer

tar get_embedding＝Transformer(weibo topic)；

Inputting the feature codes and the topic codes obtained in the step two into a transform abstract generator to generate an abstract

y＝Decoder(tar get_embedding，T_[1-l])。

Further, training data in the fourth step is optimized according to cross validation, and model encapsulation and interface realization of the device are realized; and designing a loss function to train the model, and after the parameters are adjusted and optimized, using the trained model for device interface packaging.

The invention also provides a technical scheme, which comprises the following steps:

an automatic generation device of a topic label comprises an information input module, a source text preprocessing module and a topic label generating module, wherein the information input module is used for preprocessing the content of a source text and inputting the source text; the topic label automatic generation module is used for carrying out abstract generation on an input source text by applying the topic label automatic generation method based on the content segment content selection mechanism; and the information output module outputs the automatically generated abstract through an interface program.

when the server executes the summary generation process, the server obtains a source text from a data input module through the automatic topic label generation device of the content selection mechanism based on the content fragments, and executes the method to obtain a final topic summary output with the source text.

Aiming at the application scene of the development of a generative text abstract in the prior art on the generation of an upper-layer semantic topic label, the invention provides an automatic topic label generation method, which designs a Transformer characteristic coding model of a content selection mechanism and realizes the generation of a topic text through the training of a generative Transformer abstract generation model. The traditional topic labels are all set by manual editing, the invention realizes the automatic generation of the topic labels through a text abstract generation technology, and the invention provides a new scene for generating the topic labels.

The invention provides a Transformer coding of a content selection mechanism, extracts important source text segments, inputs the important source text segments into a decoder for text generation, and not only captures effective core semantic segments, but also reduces the expense of model training.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a topic tag generation model based on a content selection mechanism according to the present invention.

Detailed Description

So that the manner in which the features and aspects of the embodiments of the present invention can be understood in detail, a more particular description of the embodiments of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings.

To clearly illustrate the design concept of the present invention, the present invention will be described with reference to the following examples.

Example one

As shown in fig. 1, an automatic generation method of topic labels includes:

the method comprises the following steps: constructing a training data set and preprocessing data, wherein the method comprises the following steps:

taking the microblog topic as an example, the main form is as follows: "[ China International Explorer # countdown for 8 days! The second international introduction exposition of china will be held in Shanghai in 11 months, 5 days-10 days. Eating, playing, using, global good will be converged here! At that time, watching news at the heart will turn on multi-live broadcasting, take you to step into this "buying and buying at Splendid"! Forward tells more people the Bar! ". The portion included in the "#" is topic content. Then the microblog topic is "# international import exposition #".

Firstly, dividing microblog topics and microblog contents. And extracting a part included by the "#" as topic content (Target), and cleaning the rest part as microblog content (Source). The invention realizes the generation of the topic label by using the Source text.

In fact, the embodiment can see that the semantics of the microblog topics can be covered by partial sentences in the source text. Dividing the microblog content into segments, dividing the source text content by the symbols such as periods or commas, and presenting the source text content in the form of segments.

And thirdly, performing semantic coding on the source text to obtain fragments after the source text is divided, and adding [ cls ] and [ eos ] labels to the front and the back of each fragment respectively, wherein the [ cls ] labels are mainly used for learning coding information for the beginning of the fragments in the sentence modeling process and can represent the semantics of the whole fragments, and the [ eos ] is mainly used for learning the semantics of the ending of the fragments.

And fourthly, combining each segment with the start and end labels of the segment. And adds a [ senten ] at the beginning of the sentence]The tags are used to learn the semantics of the entire sentence. The structure is obtained as { [ senten ]]，[cls₁]，[x₁]，[cls₂]，[x₂]，...，[eos]Source data of. Where x represents the word vector representation in a segment.

Fifthly, training data is constructed, the purpose of the invention is to realize the generation of a topic label, a large number of real samples are needed in the modeling process, and after a large number of real topic labels and corresponding source texts are collected on a microblog platform, the construction of a training data set comprises the following steps: the method comprises the steps of processing data with a topic as Target and simultaneously filling the data into a model for training, wherein the source text is coded content, and the topic is decoded content. The structure of the topic Target data is { [ senten ], [ y ], [ eos ] }. The initial corpus is obtained.

Step two: a Transformer characteristic encoder based on a content segment selection mechanism is realized, and the method comprises the following steps:

after semantic coding of the source text and the topic, a generation model needs to be established for model modeling of topic label data generation. The invention provides a Transformer Encoder characteristic coding based on a content fragment content selection mechanism by using sequence-to-sequence characteristic coding capability of a Transformer model for reference, selects three content fragments with the highest importance scores as compressed representation of the whole microblog content, and inputs vectors of the compressed representation into a decoder to assist a Transformer decoder at a decoder end of the sequence model to realize topic generation. The method comprises the following specific steps:

firstly, as shown in fig. 2, a Transformer based on the content selection mechanism provided by the present invention encodes microblog content to obtain a microblog content vector representation. And obtaining a feature coding vector of the source text sentence.

source_embedding＝Transformer(weibo content)

Secondly, extracting the sentence [ senten]Labels and [ cls_i]And the labels respectively imply the overall characteristics of the sentence and the characteristics of each content segment. This step extracts the sentence [ senten]Labels and [ cls_i]Label corresponding to label.

T_senten＝GetSenten(source_embedding)

T_cls＝GetCls(source_embedding)

Wherein, T_sentenFeature vectors, T, representing source text sentences output by a Transformer encoder_clsA set of feature vectors representing each content segment output by the transform encoder is represented.

Thirdly, using Transformer feature coding of content selection mechanism to calculate sentence [ senten]Represents a group of formulae [ cls ]_i]The importance of the fragment representation is mainly calculated by utilizing a bilinear function attention mechanism_sentenAnd T_clsThe importance of (c).

R_i＝T_sentenWT_cls

Wherein R is_iRepresenting a feature weight calculation input vector integrating T_sentenAnd T_clsAnd learning the semantic relevance of the two labels through a weight matrix W. Obtained by normalization calculation of Softmax function

I.e. each T_clsRelative to T_sentenImportance weight of.

Fourthly, extracting 3 [ cls ] with highest similarity_i]The content fragment text corresponding to the label is used as the input of the generator.

Step three: the topic abstract generator model based on the Transformer is realized, and the method comprises the following steps:

the topic abstract generator is based on a Transformer Decoder structure, and has two input parts: source data and Target data. In the generation model, Source data is a microblog content feature code generated by a Transformer based on the content selection mechanism of the topic segment in the step two, and Target is obtained by performing Transformer coding on the topic text.

Firstly, the topic text is coded by using a Transformer.

target_embedding＝Transformer(weibo topic)。

And II, inputting the feature codes and the topic codes in the step II into a transform abstract generator to generate an abstract.

y＝Decoder(target_embedding，T_[1-l])。

And designing a loss function to train the model, and after adjusting parameters, packaging the interface of the trained model.

The invention firstly considers that the composition of topic texts is often from a plurality of word groups or word combinations, and the grammatical structure of the topic texts is different from the continuity of the semantic meaning of news headlines or news topic sentence sentences on the grammatical structure. Different semantic units such as Chinese characters and words are taken as representation learning objects to be input into the model to discuss the generation effect of the different semantic units. According to the method, topic contents are usually from a fragment of an original text, namely part of Clause, secondary screening is carried out based on the Clause level, and the screened Clause contents are input into an encoder to serve as context semantics of a generated text. The invention provides a Transformer structure of a content selection mechanism, which is used for a feature representation vector of a bottom-layer modeling semantic unit, a feature representation vector of a Clause level of upper-layer modeling, an important semantic segment is obtained by selecting the feature weight of the Clause level through Attention weight calculation and calculating, and then the important semantic segment is input to a coder to be used as a context feature representation vector. The method has the advantages that important contents are selected and integrated, and the cost of model training is saved.

Example two

EXAMPLE III

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method for automatically generating a topic tag, the method comprising:

step three: a topic abstract generator model of a Transformer decoder;

2. The method for automatically generating topic labels according to claim 1, wherein in the first step, the method for constructing the training data set and preprocessing the data comprises:

3. The method for automatically generating a topic tag according to claim 1, wherein in the second step, a method for implementing a transform encoder based on a content selection mechanism of a content segment comprises:

source_embedding＝Transformer(weibo content)

T_senten＝GetSenten(source_embedding)

T_cls＝GetCls(source_embedding)

R_i＝T_sentenWT_cls

wherein R is_iRepresenting a feature weight calculation input vector integrating T_sentenAnd T_clsThe semantic correlation of the two labels is learned through a weight matrix W, and the semantic correlation is obtained through the normalization calculation of a Softmax function

I.e. each T_clsRelative to T_sentenThe importance weight of;

4. The method for automatically generating the topic tag according to claim 1, wherein in the third step, the topic abstract generator model of the transform decoder is implemented by:

encoding topic text using a Transformer

target_embedding＝Transformer(weibo topic)；

y＝Decoder(target_embedding，T_[1-l])。

5. The method for automatically generating topic labels according to claim 1, wherein the step four comprises training data and tuning according to cross validation, and realizing interface implementation of model encapsulation and devices.

6. An automatic topic tag generation apparatus, the method according to any one of claims 1 to 5, characterized in that: the method comprises an information input module, a source text preprocessing module and a source text display module, wherein the information input module is used for preprocessing the content of a source text and inputting the source text; the topic label automatic generation module is used for carrying out abstract generation on an input source text by applying the topic label automatic generation method based on the content segment content selection mechanism; and the information output module outputs the automatically generated abstract through an interface program.

7. An automatic topic tag generation system according to claim 6, wherein the automatic topic tag generation apparatus comprises: the method comprises the steps that at least one server and a topic label automatic generation device connected with the server are included, when the server executes the summary generation process, a source text is obtained from a data input module through the topic label automatic generation device based on the content selection mechanism of the content fragments, and the method is executed to obtain a final topic summary output with the source text.