CN115017404A - Target news topic abstracting method based on compressed space sentence selection - Google Patents

Target news topic abstracting method based on compressed space sentence selection Download PDF

Info

Publication number
CN115017404A
CN115017404A CN202210449431.0A CN202210449431A CN115017404A CN 115017404 A CN115017404 A CN 115017404A CN 202210449431 A CN202210449431 A CN 202210449431A CN 115017404 A CN115017404 A CN 115017404A
Authority
CN
China
Prior art keywords
sentence
sentences
topic
abstract
news
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210449431.0A
Other languages
Chinese (zh)
Inventor
余正涛
卢天旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210449431.0A priority Critical patent/CN115017404A/en
Publication of CN115017404A publication Critical patent/CN115017404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a target news topic abstracting method based on compressed space sentence selection, and belongs to the field of natural language processing. The invention includes: constructing a target news topic abstract data set; the method comprises the steps of filtering information irrelevant to topic description by using a sentence importance evaluation module to compress a search space, encoding a screened document set and sentences by using a document set encoder module based on an improved Bert model, and then extracting a summary of a sentence composition topic with the highest final score by calculating the salient features and the repeated features of the sentences in the encoded document set. The method integrates the guidance of topic key descriptors and uses a sentence selection method capable of balancing the prominent features and the repeated features, so that the quality of the generated topic abstract is improved on the basis of compressing a search space, and support is provided for subsequent tasks such as target supervision.

Description

Target news topic abstracting method based on compressed space sentence selection
Technical Field
The invention relates to a target news topic abstracting method based on compressed space sentence selection, and belongs to the technical field of natural language processing.
Background
When a hot event related to a case occurs in the society, information is often rapidly spread and fermented on news websites and media, and how to effectively perform target supervision is a key problem. The information is quickly acquired by a technical means, the key content of the news of the topic of the case is extracted in time, and the method is very important for relevant departments to monitor the network trend and maintain the network order stability. The number of news documents under the same topic is hundreds, and users can only spend much time and energy to read one by one if directly searching the information describing the topic from the news, so the processing is very inconvenient. The topic abstract which is concise in form and accurate in coverage and aims at topic key information is extracted from news of a topic cluster through a text automatic abstraction technology, so that the reading time of a user is greatly reduced, the cost of information storage can be effectively reduced, and the method plays an important role in carrying out supervision work aiming at a target topic by related departments and acquiring topic key content.
At present, research and analysis aiming at topic summarization in the general field are more, in recent years, deep learning and neural network algorithms are greatly developed, and the application of the method to the topic summarization task also achieves a plurality of research results. Cheng et al can extract the characteristics of sentences and words respectively to obtain better abstracts through a hierarchical neural network structure based on a data-driven abstract framework of a coder extractor; cao et al use a recurrent neural network's ranking model to select sentences in a document set, which represents the sentence-ranking task as a hierarchical regression process while measuring the prominence of the sentences and their components in the parse tree; zhang et al propose an enhanced multi-view convolutional neural network to commonly obtain the characteristics of sentences and sequence the sentences; nallapati et al propose model-generated abstract prediction based on recurrent neural networks, including abstract features such as sentence content, saliency, and repeatability. The reference abstract can be trained independently, and dependence on sentence extraction labels is eliminated. However, the topic abstract of the target news aims at news documents under the same case topic, the title and the text content of the news documents contain key information of the target topic, if a general automatic abstract method is used, the key information is easy to be omitted and not completely covered, sentences irrelevant to the description of the target topic can be extracted, and the prominence of the sentences in the abstract is not high. In addition, the existing method is simple and practical to extract representative sentences from the original text directly by using a sentence selection model, but cannot adapt to the characteristics of multiple case elements of news data in a topic cluster on a target news topic abstract task, and the generated abstract sentences are not critical enough, have high repeatability and have no practicability. Therefore, it is considered that the importance of sentences is screened by combining the key words of the target topic, and the quality of the generated topic abstract can be improved on the basis of compressing the search space by using a sentence selection method capable of balancing the prominent features and the repeated features.
Disclosure of Invention
The invention provides a target news topic abstract method based on compressed space sentence selection, which integrates the guidance of topic key descriptors and uses a sentence selection method capable of balancing outstanding features and repeated features to improve the quality of generated topic abstract on the basis of compressed search space.
The technical scheme of the invention is as follows: a target news topic abstract method based on compressed space sentence selection comprises the following specific steps:
step1, crawling target news through a crawler technology, selecting news related to topics and selecting 30 topic news clusters from the news, wherein the news in each cluster describes the same topic, each cluster contains 20 target news including titles and text contents, and construction of a data set is performed on 15343 sentences in total; carrying out data denoising cleaning and preprocessing; analyzing the crawled target news to enable each piece of news to only belong to one topic cluster, labeling news documents under the same topic cluster to obtain labels of sentences in the documents, and manually compiling a reference abstract of each topic cluster;
step2, screening out sentences which contain topic words and have the highest importance scores in all the documents through the defined topic key description words; using an improved pre-training model to code the screened document set and sentences; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; and generating the salient features and the repeated features of the abstract in a balanced manner through a sentence selection model, thereby obtaining the abstract containing accurate topic information.
The specific steps of Step1 are as follows:
step1.1, crawling news data related to target topics recently from various big news websites and public platform by using a crawler technology, selecting 30 topic news clusters, wherein the news in each cluster describes the same topic, each cluster contains 20 target news comprising titles and text contents, and 15343 sentences in total form a data set;
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the data cleaning, denoising and preprocessing process comprises the removal of webpage labels, advertisement links and special symbols, the removal of repeated data, foreign language data and traditional Chinese characters, and the manual calibration of the correlation of news data and case topics;
step1.3, obtaining sentence labels in a data set by adopting a manual labeling mode; manually compiling a reference abstract of each topic cluster;
as a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, defining keywords in a topic cluster, extracting a sentence set containing the keywords through regularization matching, filtering out sentences containing irrelevant information, calculating the word frequency of the keywords in a news document to obtain the importance scores of the sentences, extracting the sentences with the highest importance scores in the documents, and combining the sentences into a new topic cluster document set;
step2.2, coding a document set and a sentence by adopting a document set coder based on an improved Bert model to obtain a representation;
step2.3, measuring the protruding degree of the candidate sentences and the sentences which are not selected through a bilinear mapping function;
step2.4, calculating the coincidence degree of n-gram phrases of the candidate sentences and the selected abstract sentences in the sentence selection process, calculating the similarity of semantic representations through cosine similarity, carrying out normalization and discrete processing, converting the two features into one-hot vector representations, splicing and fusing to obtain repeated feature vectors;
and Step2.5, balancing the saliency characteristics and the repetition characteristics of the candidate sentences through a bilinear mapping function to obtain matching vectors, inputting the matching vectors into a multilayer perceptron to obtain the final scores of the candidate sentences, and selecting the sentences with high scores to put into a summary sentence set to obtain a topic summary.
In a preferred embodiment of the present invention, the step2.1 specifically comprises:
calculating the importance score of a sentence, namely calculating the score of a keyword, namely the word frequency of the keyword; extracting a sentence set containing keywords by using the regularization matching of Python, filtering out sentences containing irrelevant information, and calculating the word frequency of the keywords in the news document; let num (w) i ) As a keyword w i The number of occurrences in a certain news article,
Figure RE-GDA0003759361600000031
for the sum of the number of occurrences of all words in the news, keyword w i Is scored as SC (w) i );
Figure RE-GDA0003759361600000032
Calculating the score of the keywords in the document to calculate the score of the sentences in the document, and setting the nth sentence s of the mth document in the document set mn Is SC(s) mn );
Figure RE-GDA0003759361600000033
After the importance scores of the sentences in the document set are calculated, sentence screening is carried out once according to the scores, the document set sentence coding is carried out by using a document set coder based on an improved Bert model, and the Bert model has limitation on the coding length, so that the sentences with the highest importance scores in the documents are extracted before the sentence coding, and are combined into a new document set L for subsequent coding operation;
as a preferred embodiment of the present invention, step2.2 specifically comprises:
after the evaluation of the importance of the sentences, the newly combined document set L contains m target news sentences { L } 1 ,l 2 ,…,l m H, wherein l i Presentation setThe ith sentence in the sentence; in order to obtain high-quality sentence and document set representations, the application of a Bert pre-training model to a topic abstraction task is considered; because the Bert model is based on Token-level coding and not sentence-level coding, and the segment embedding part of the model is used for judging whether two sentences are related, only two types of segment embedding are included, and the segment embedding part cannot be directly applied to the topic summarization task of inputting a plurality of sentences, the improved Bert model-based document set encoder is used for each sentence l in the document set i Previously added [ CLS]The tag is used to summarize sentence embedding information, and the tail is added with SEP]Labeled to distinguish the boundaries of different sentences. To distinguish sentences in different positions, E is introduced odd And E even Two different spacer segment insertions; for sentence l i If i is odd, the interval fragment of the sentence is embedded as E odd Otherwise, when i is even number, the embedding is E even (ii) a Token embedding E for obtaining sentences from each sentence by using the coding mode l Spacer segment insertion E odd /E even And position embedding E p Fusion of three kinds of embedding; after being coded by a plurality of transform coding layers, the coded sentence is passed through a sentence l i Previous [ CLS]Characterization of tag output T [CLS] As a characterization of the corresponding sentence, can be recorded as E' li ,E' li And each sentence characterizes a position embedding E 'in the document set encoder' p Fusing to form an input representation sequence, adding an embedded E capable of representing the document set at the head of the sequence set Combining the input sequence into a complete document set-sentence representation input sequence, inputting the sequence into a plurality of transform coding layers for coding, and finally obtaining the representation r of the complete document set L set And the sentence code representation r li
In a preferred embodiment of the present invention, the step2.3 specifically comprises:
the topic abstract task needs to extract representative sentences, namely sentences with high prominence, so that a single-step sentence prominence calculation module is designed on the basis of document set coding and sentence coding; manual authoring with document set LThe reference abstract of (1) is R, and the aim is to extract k sentences capable of summarizing key information from L as abstract sentences; for the t-th selection step, the summary sentence sets generated currently are
Figure RE-GDA0003759361600000041
Let l j For sentences in L that have not yet been selected, r is characterized by computing a set output by the document set encoder set And sentence characterization r li Bilinear mapping function F of pro To measure the probability that the selected sentence is contained in the reference abstract R;
Figure RE-GDA0003759361600000042
wherein W bm A weight matrix for bilinear mapping, which can be r set And r li Respectively carrying out linear transformation on two vectors with different dimensions and mapping the two vectors into another space; the target function is to maximize the log-likelihood function of the sentences contained in the reference abstract R in the training sample;
Figure RE-GDA0003759361600000043
bilinear mapping function F pro As a measure of the current candidate sentence l i And a sentence l that has not been selected j The prominence scoring function of (a) can calculate the attention score of each candidate sentence, namely the prominence score of the sentence;
as a preferred embodiment of the present invention, step2.4 specifically includes:
after the prominence score of the candidate sentence is calculated, the repeated characteristic of the sentence is calculated, when the t-th selection process is carried out, the n-element grammar model matching characteristic of the process is firstly calculated and represents the candidate sentence l i And selected abstract sentence l t-1 The coincidence degree of the n-gram phrases;
Figure RE-GDA0003759361600000051
the more overlapped phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase overlap ratio of the unitary, binary and ternary grammar models is calculated respectively; in order to mine deeper sentence representation similarity, the maximum semantic similarity F of sentence representation is fused on the basis of obtaining the phrase overlap ratio of the n-element grammar model sim To calculate a coincidence feature;
Figure RE-GDA0003759361600000052
in order to expand the numerical difference of the coincidence feature calculated by the cosine similarity of the candidate sentence and the selected sentence, the feature value is dispersed between 0 and 1 by using linear normalization;
Figure RE-GDA0003759361600000053
the repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical values of the c blocks into corresponding equally divided blocks from 0 to 1 according to the contact degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, converting each part of characteristics into one-hot vector representation with the length of c, splicing and fusing each part, and obtaining the repeated characteristic vector representation F of the whole module rep (l i );
Figure RE-GDA0003759361600000054
In the formula
Figure RE-GDA0003759361600000055
Partitioning the one-hot vector of the repeated feature vector of each part, so as to capture each partInfluence of repeated features, we want the selected abstract sentence to have less repeated features;
in a preferred embodiment of the present invention, the step2.5 specifically comprises:
after the prominence score and the repeatability feature are obtained by the sentence prominence calculation module and the repeat feature calculation module, the two features need to be balanced in the sentence selection module, so that the selected abstract sentence has a certain prominence and cannot contain excessive repeatability features; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; by calculating the saliency features F pro (l i ) And repetition feature F rep (l i ) To balance the candidate sentence l i Obtaining a d-dimensional mapping matching vector by using the two characteristics; inputting it into MLP to get final score SC (l) of sentence i );
Figure RE-GDA0003759361600000061
Wherein
Figure RE-GDA0003759361600000062
Bilinear mapping matrix for two features, W h Is a weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information and learns to find the next outstanding and unrepeated sentence, and the objective function is
Figure RE-GDA0003759361600000063
Figure RE-GDA0003759361600000064
Target function representation in the t-th process, any sentence l is selected i Is the sentence score SC (l) i ) Sentence L remaining in L j Softmax function above; loss of sentence selection moduleIndependent of the order of sentence selection, because the given sentence is a set of unordered sentences in the training process, the selection object of the module is always the next outstanding and nonrepeating sentence, and finally the sentence set is obtained
Figure RE-GDA0003759361600000065
As a generated topic summary;
further, topic cluster news documents and sentences are coded by adopting a document set coder based on an improved Bert model, and the topic cluster news documents and sentences comprise 12 hidden layers, each layer has 12 attention heads, the dimensionality of each hidden layer is 768, and the size of a word list is 30522; the number of layers of a Transformer of a document set coded by a document set coder is 2, and dropout of each layer is set to be 0.1; the size of the training batch is 128, the training round is 20, the learning rate is 2e-3, and Adam, beta is adopted by the optimizer 1 Is 0.9, beta 2 Is 0.999; the one-hot vector representation length c of each part of the repetitive features in the formula 8 is 20, the feature dimension d output by bilinear mapping of the saliency features and the repetitive features in the formula 9 is 10, and unordered sentences randomly selected from the reference abstract are adopted for model training as context information.
The invention has the beneficial effects that:
(1) aiming at a target news topic abstract, more news documents exist under the same topic cluster, the search space is larger, more sentences irrelevant to key information exist, and how to construct a better sentence selection model enables the generated topic abstract to embody the key information of target topic news and well reduce repeated and unimportant sentences, a target news topic abstract method based on compressed space sentence selection is provided, and a sentence importance evaluation module is designed to filter out irrelevant information to compress the search space;
(2) the invention provides two parallel modules for respectively calculating the prominent features and the repeated features;
(3) the invention provides a topic abstract composed of sentences with the highest final scores extracted after a sentence selection model balances two characteristics, and solves the problems of large search space, low generated abstract prominence and more repeated information in a target news topic abstract task.
Drawings
FIG. 1 is a flow chart of a target news topic abstract based on compressed space sentence selection proposed by the method of the present invention;
FIG. 2 is a model diagram of a document set encoder module based on an improved Bert model in the process flow of the present invention.
Detailed Description
Example 1: as shown in fig. 1 and 2, a method for abstracting a target news topic based on compressed space sentence selection includes the following steps:
step1, crawling target news through a crawler technology, selecting news related to topics and selecting 30 topic news clusters from the news, wherein the news in each cluster describes the same topic, each cluster contains 20 pieces of target news, the target news comprises a title and text content, and 15343 sentences in total are used for constructing a data set; carrying out data denoising cleaning and preprocessing; the method comprises the steps of analyzing crawled target news to enable each piece of news to only belong to one topic cluster, marking news documents under the same topic cluster to obtain labels of sentences in the documents, and manually compiling a reference abstract of each topic cluster.
Step1.1, crawling target key news of each big news website and a public platform in recent years by a crawler technology, and selecting a total of 17889 news of more than ten case topics with high netizen attention, such as a certain maintenance right and the like;
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the data cleaning, denoising and preprocessing process comprises the removal of webpage labels, advertisement links and special symbols, the removal of repeated data, foreign language data and traditional Chinese characters, and the manual calibration of the correlation of news data and case topics;
step1.3, obtaining sentence labels in a data set by adopting a manual labeling mode; manually writing a reference abstract of each topic cluster. The scale of the experimental data set is shown in table 1:
table 1 experimental data set statistics
Figure RE-GDA0003759361600000071
Step2, screening sentences which contain topic words and have the highest importance scores in all the documents through the defined topic key description words; using an improved pre-training model to code the screened document set and sentences; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; and generating the salient features and the repeated features of the abstract in a balanced manner through a sentence selection model, calculating the score of the sentence, and extracting the sentence with high score to obtain the abstract containing accurate topic information.
Step2.1, calculating the importance score of a sentence, firstly, calculating the score of a keyword, namely the word frequency of the keyword; extracting a sentence set containing keywords by using the regularization matching of Python, filtering out sentences containing irrelevant information, and then calculating the word frequency of the keywords in the news document; let num (w) i ) As a keyword w i The number of times it occurs in a certain news,
Figure RE-GDA0003759361600000081
for the sum of the number of occurrences of all words in the news, keyword w i Is scored as SC (w) i );
Figure RE-GDA0003759361600000082
Calculating the score of the keywords in the document to calculate the score of the sentences in the document, and setting the nth sentence s of the mth document in the document set mn Is SC(s) mn );
Figure RE-GDA0003759361600000083
After the importance scores of the sentences in the document set are calculated, sentence screening is carried out once according to the scores, the document set sentence coding is carried out by using a document set coder based on an improved Bert model, and the Bert model has limitation on the coding length, so that the sentences with the highest importance scores in the documents are extracted before the sentence coding, and are combined into a new document set L for subsequent coding operation;
step2.2, after the evaluation of the importance of the sentences, the newly combined document set L contains m target news sentences { L } 1 ,l 2 ,…,l m In which l i Representing the ith sentence in the set; in order to obtain high-quality sentence and document set representations, the application of a Bert pre-training model to a topic abstraction task is considered; because the Bert model is based on Token-level coding and not sentence-level coding, and the segment embedding part of the model is used for judging whether two sentences are related, only two types of segment embedding are included, and the segment embedding part cannot be directly applied to the topic summarization task of inputting a plurality of sentences, the improved Bert model-based document set encoder module is used for each sentence l in the document set i Previously added [ CLS]The tag is used to summarize sentence embedding information, and the tail is added with SEP]Labeled to distinguish the boundaries of different sentences. To distinguish sentences in different positions, E is introduced odd And E even Two different spacer segment insertions; for sentence l i If i is odd, the interval fragment of the sentence is embedded as E odd Otherwise, when i is even number, the embedding is E even (ii) a Token embedding E for obtaining sentences from each sentence by using the coding mode l Spacer segment insertion E odd /E even And position embedding E p Fusion of three kinds of embedding; after being coded by a plurality of transform coding layers, the coded words are coded by a sentence l i Previous [ CLS]Characterization of tag output T [CLS] As a representation of the corresponding sentence, can be remembered as E' li ,E' li And each sentence characterizes a location embedding E 'in the document set encoder' p The merged sequence is composed of an input representation sequence, and an embedded part E capable of representing the document set is added at the head of the sequence set Combining the input sequences into a complete document set-sentence characterization input sequence, inputting the sequence into a plurality of transform coding layers for coding, and finally obtaining the characterization r of the complete document set L set And the sentence code representation r li
Step2.3, the topic abstract task needs to extract representative sentences, namely sentences with high prominence, so that a sentence prominence calculation module with a single step is designed on the basis of document set encoding and sentence encoding; setting a manually written reference abstract of a document set L as R, and extracting k sentences capable of summarizing key information from the L as abstract sentences; for the t-th selection step, the summary sentence sets generated currently are
Figure RE-GDA0003759361600000091
Let l j For sentences in L that have not been selected, r is characterized by computing a set output by the document set encoder set And sentence characterization r li Bilinear mapping function F of pro To measure the probability that the selected sentence is contained in the reference abstract R;
Figure RE-GDA0003759361600000092
wherein W bm A weight matrix for bilinear mapping, which can be r set And r li Respectively carrying out linear transformation on two vectors with different dimensions and mapping the two vectors into another space; the target function is to maximize the log-likelihood function of the sentences contained in the reference abstract R in the training sample;
Figure RE-GDA0003759361600000093
bilinear mapping function F pro As a measure of the current candidate sentence l i And a sentence l that has not been selected j The prominence scoring function of (a) can calculate the attention score of each candidate sentence, namely the prominence score of the sentence;
step2.4, after calculating the prominence score of the candidate sentence, calculating the repeat feature of the sentence, when the t-th selection process is carried out, firstly calculating the n-gram of the processMethod model matching features, which represent candidate sentences l i And selected abstract sentence l t-1 The coincidence degree of the n-gram phrases;
Figure RE-GDA0003759361600000094
the more overlapped phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase overlap ratio of the unitary, binary and ternary grammar models is calculated respectively; in order to mine deeper sentence representation similarity, the maximum semantic similarity F of sentence representation is fused on the basis of obtaining the phrase overlap ratio of the n-element grammar model sim To calculate a coincidence feature;
Figure RE-GDA0003759361600000095
in order to expand the numerical difference of the coincidence feature calculated by the cosine similarity of the candidate sentence and the selected sentence, the feature value is dispersed between 0 and 1 by using linear normalization;
Figure RE-GDA0003759361600000096
the repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical values of the c blocks into corresponding equally divided blocks from 0 to 1 according to the contact degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, converting each part of characteristics into one-hot vector representation with the length of c, splicing and fusing each part, and obtaining the repeated characteristic vector representation F of the whole module rep (l i );
Figure RE-GDA0003759361600000101
In the formula
Figure RE-GDA0003759361600000102
Partitioning the repeated feature vectors of each part into one-hot vectors, so that the influence of the repeated features of each part can be captured, and the selected abstract sentence is expected to have less repeated features;
step2.5, after obtaining prominence score and repetitiveness characteristics through a sentence prominence calculation module and a repetition characteristic calculation module, balancing the two characteristics in a sentence selection module to ensure that the selected abstract sentence has certain prominence and cannot contain excessive repetitiveness characteristics; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; by calculating the saliency features F pro (l i ) And repetition feature F rep (l i ) To balance the candidate sentence l i Obtaining a d-dimensional mapping matching vector by using the two characteristics; inputting it into MLP to get final score SC (l) of sentence i );
Figure RE-GDA0003759361600000103
Wherein
Figure RE-GDA0003759361600000104
Bilinear mapping matrix for two features, W h Is a weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information and learns to find the next outstanding and unrepeated sentence, and the objective function is
Figure RE-GDA0003759361600000105
Figure RE-GDA0003759361600000106
The objective function means that in the t-th process, an arbitrary is selectedWhat sentence l i Is the sentence score SC (l) i ) Sentence L remaining in L j Softmax function above; the loss of the sentence selection module is independent of the order of sentence selection, because the given sentence is a set of unordered sentences in the training process, the selection object of the module is always the next outstanding and unrepeated sentence, and finally the sentence set is obtained
Figure RE-GDA0003759361600000107
As a generated topic summary;
step2.6, in the model experiment, a document set encoder based on an improved Bert model is adopted to encode topic cluster news documents and sentences, the topic cluster news documents and sentences comprise 12 hidden layers, each layer has 12 attention heads, the dimensionality of each hidden layer is 768, and the size of a word list is 30522; the number of layers of a Transformer of a document set coded by a document set coder is 2, and dropout of each layer is set to be 0.1; the size of the training batch is 128, the training round is 20, the learning rate is 2e-3, and Adam, beta is adopted by the optimizer 1 Is 0.9, beta 2 Is 0.999; the one-hot vector representation length c of each part of the repetitive features in the formula 8 is 20, the feature dimension d output by bilinear mapping of the saliency features and the repetitive features in the formula 9 is 10, and unordered sentences randomly selected from the reference abstract are adopted for model training as context information.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the promotion of the abstract performance of the topic, the second group of experiments verify the effectiveness of the model, and the third group of experiments verify the influence of different abstract lengths on the effectiveness of the model.
(1) Topic summary performance enhancement verification
Respectively using target news topic abstract data sets constructed by step1 as model inputs in a baseline model to perform a comparison experiment, and selecting 5 models as reference models, wherein the model inputs are respectively as follows: LEAD-3, LDA topic model, TextRank, BertSum and RL-MMR, and the experimental results are shown in table 2.
TABLE 2 comparison of Performance of baseline models
Figure RE-GDA0003759361600000111
As can be seen from the analysis of the table 2, the method of the present invention achieves better performance compared with other reference models, and the LEAD-3 algorithm has the worst effect when generating the target news topic abstract, because it only focuses on the beginning of the document set, and extracts the first three sentences by means of hard cut-off, and the first three sentences recite more irrelevant information, which results in more unrepresentative contents and poorer model performance. The LDA topic model depends on statistical characteristics, and due to the particularity of target news, the LDA has the problem of inconsistent topic importance. The TextRank algorithm is based on a graph model and has obvious advantages in constructing the association relation of sentences in a document set, but the method does not firstly carry out sentence importance screening, and is easily influenced by high-frequency words of non-topic keywords when applied to a target field, so that obvious defects exist in the indexes of ROUGE-2 and ROUGE-L. Each index of the DPP model is better than that of the prior comparison method, the model selects representative samples through a determinant point process, sentences which contain few overlapped words and have repeated semantemes are filtered by utilizing a capsule network, the effect on removing repeated characteristics is better, but the model lacks end-to-end representation learning, errors can be accumulated, and the effect is still to be improved. Compared with a comparison model except the text model, the RL-MMR model achieves good effects, but compared with the text model, the mode of sentence ranking by soft attention introduced by RL-MMR is not perfect, the guidance of topic keyword information is not provided, and non-topic key information can also appear in sentences with high ranking. After a sentence importance evaluation module is introduced into the model of the method, a large number of sentences irrelevant to key information are filtered, repeated features and prominent features are balanced, the score of the finally selected sentence is not biased to any one feature, and therefore the best effect is achieved.
(2) Model validation
In order to verify the effectiveness of each module of the model, the main model is ablated into three sub models, namely the main model removing sentence importance assessment, the main model removing prominent feature and the main model removing repeated feature, the evaluation indexes are all calculated by using the ROUGE value, the optimal result is represented by bold, and the evaluation indexes are kept unchanged. The test results are shown in table 3:
TABLE 3 simplified model Performance analysis
Figure RE-GDA0003759361600000121
As can be seen from the analysis of Table 3, the sentence importance evaluation module of the model is removed, the index effects are the worst, the value of ROUGE-1 is reduced by 6.28, the value of ROUGE-2 is reduced by 6.32, and the value of ROUGE-L is reduced by 7.71. This is because, after the module is removed, the sentence set input by the model is not filtered, and contains more words of non-key information, and the document set encoder module performs hard truncation on excessive sentence codes, and is not accurate enough for describing topic contents, and has a large difference from the reference abstract. After the model removes the salient features, the effect is slightly better than that of removing the sentence importance evaluation module, the value of ROUGE-1 is reduced by 4.47, the value of ROUGE-2 is reduced by 3.53, and the value of ROUGE-L is reduced by 5.94. Although the model extracts sentences containing topic keywords, after the salient features are removed, the information in the abstract sentences extracted by the model is not representative and can not well describe the topic content. After repeated characteristics of the model are removed, each index is reduced to the minimum, the value of ROUGE-1 is reduced by 3.73, the value of ROUGE-2 is reduced by 2.42, and the value of ROUGE-L is reduced by 3.77. The method is characterized in that the model reserves a calculation module with prominent features, the extracted sentences contain more representative information, but the repeated features are removed, the extracted sentence set contains a large amount of repeated information, although the key information is more, the generated abstract is still not the best abstract, and the validity of the method model is verified from the side.
(3) Verification of influence of different abstract lengths on model effectiveness
In order to verify the influence of the digests with different lengths generated by the model on the ROUGE index, namely to verify whether the model has better adaptability, the invention performs the following experiment, stipulates to generate four digests with different lengths for comparison, and the experimental result is shown in Table 4. The test results are shown in table 4:
TABLE 4 verification analysis of the impact of different abstract lengths on model validity
Figure RE-GDA0003759361600000122
Analysis table 4 shows that when the generated abstract length is 50 and 100, each index of the model has the worst effect, and the performance of the model is obviously reduced, because the generated abstract is too short, a large amount of related information is lost. The model approaches the best performance when the generated summary is 150 in length, and is best when the length is 200. The reason is that when a data set is constructed, the average length of the artificial reference abstracts compiled for each topic cluster in a test set is about 178, the closer the generated abstracts are to the length of the reference abstracts, the more the number of the co-occurrence phrases and the longest character strings with the reference abstracts are, and the better the performance effect of the model of the method is.
The experimental data prove that the invention provides a target news topic abstract method based on compressed space sentence selection aiming at the problems that the document searching space in the topic cluster is large and a plurality of sentences irrelevant to topic key information exist. Based on the constructed target news topic abstract data set and the manually written reference abstract, various experiments prove that the topic abstract model provided by the text can extract representative information and unrepeated key sentences, and the generated abstract has higher quality. Experiments show that the method of the invention obtains the optimal effect compared with a plurality of baseline models. Aiming at the target news topic abstract task, the target news topic abstract method based on compressed space sentence selection is effective in improving the performance of the news topic abstract in the target field.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (8)

1. A target news topic abstracting method based on compressed space sentence selection is characterized in that: the method comprises the following specific steps:
step1, crawling target news through a crawler technology, and selecting news related to topics to construct a target news topic abstract data set; carrying out data denoising cleaning and preprocessing; analyzing the crawled target news to enable each piece of news to only belong to one topic cluster, labeling news documents under the same topic cluster to obtain labels of sentences in the documents, and manually compiling a reference abstract of each topic cluster;
step2, screening sentences which contain topic words and have the highest importance scores in all the documents through the defined topic key description words; using an improved pre-training model to code the screened document set and sentences; two sentence characteristics are obtained through a prominence calculation module and a repeated characteristic calculation module; and generating the salient features and the repeated features of the abstract in a balanced manner through a sentence selection model, calculating the score of the sentence, and extracting the sentence with high score to obtain the abstract containing accurate topic information.
2. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, crawling news data related to target topics from each big news website and each public platform through a crawler technology to form a topic cluster;
step1.2, performing data cleaning, denoising and preprocessing on the crawled data, wherein the data cleaning, denoising and preprocessing process comprises the removal of webpage labels, advertisement links and special symbols, the removal of repeated data, foreign language data and traditional Chinese characters, and the manual calibration of the correlation of news data and case topics;
step1.3, obtaining sentence labels in a data set by adopting a manual labeling mode; manually writing a reference abstract of each topic cluster.
3. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, defining keywords in a topic cluster, extracting a sentence set containing the keywords through regularization matching, filtering out sentences containing irrelevant information, calculating the word frequency of the keywords in a news document to obtain the importance scores of the sentences, extracting the sentences with the highest importance scores in the documents, and combining the sentences into a new topic cluster document set;
step2.2, coding a document set and a sentence by adopting a document set coder based on an improved Bert model to obtain a representation;
step2.3, measuring the protruding degree of the candidate sentences and the sentences which are not selected through a bilinear mapping function;
step2.4, calculating the coincidence degree of n-gram phrases of the candidate sentences and the selected abstract sentences in the sentence selection process, calculating the similarity of semantic representations through cosine similarity, carrying out normalization and discrete processing, converting the two features into one-hot vector representations, splicing and fusing to obtain repeated feature vectors;
and Step2.5, balancing the saliency characteristics and the repetition characteristics of the candidate sentences through a bilinear mapping function to obtain matching vectors, inputting the matching vectors into a multilayer perceptron to obtain the final scores of the candidate sentences, and selecting the sentences with high scores to put into a summary sentence set to obtain a topic summary.
4. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 3, wherein: the Step2.1 specifically comprises:
calculating the importance score of a sentence by first calculating the score of a keywordI.e., word frequency of the keyword; extracting a sentence set containing keywords by using the regularization matching of Python, filtering out sentences containing irrelevant information, and then calculating the word frequency of the keywords in the news document; let num (w) i ) As a keyword w i The number of times it occurs in a certain news,
Figure RE-FDA0003759361590000021
for the sum of the number of occurrences of all words in the news, keyword w i Is scored as SC (w) i );
Figure RE-FDA0003759361590000022
Calculating the score of the keywords in the document to calculate the score of the sentences in the document, and setting the nth sentence s of the mth document in the document set mn Is SC(s) mn );
Figure RE-FDA0003759361590000023
After the importance scores of the sentences in the document set are calculated, sentence screening needs to be performed once according to the scores, the document set sentence coding is performed by using a document set coder module based on an improved Bert model, and the Bert model has a limit on the coding length, so that the sentences with the highest importance scores in the documents are extracted before sentence coding, and are combined into a new document set L for subsequent coding operation.
5. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 3, wherein: the Step2.2 specifically comprises:
after the evaluation of the importance of the sentences, the newly combined document set L contains m target news sentences { L } 1 ,l 2 ,…,l m In which l i Representing the ith sentence in the set; to getIn terms of high-quality sentence and document set representation, the Bert pre-training model is considered to be applicable to the topic abstract task; because the Bert model is based on Token-level coding and not sentence-level coding, and the segment embedding part of the model is used for judging whether two sentences are related, only two types of segment embedding are included, and the segment embedding part cannot be directly applied to the topic summarization task of inputting a plurality of sentences, the improved Bert model-based document set encoder is used for each sentence l in the document set i Previously added [ CLS]The tag is used to summarize sentence embedding information, and the tail is added with SEP]Marking to distinguish the boundaries of different sentences, and introducing E to distinguish sentences in different positions odd And E even Two different spacer segment insertions; for sentence l i If i is odd, the interval fragment of the sentence is embedded as E odd Otherwise, when i is even number, the embedding is E even (ii) a Token embedding E for obtaining sentences from each sentence by the coding mode l Spacer segment insertion E odd /E even And position embedding E p Fusion of three kinds of embedding; after being coded by a plurality of transform coding layers, the coded sentence is passed through a sentence l i Previous [ CLS]Characterization of tag output T [CLS] As a representation of the corresponding sentence, can be recorded as E' li ,E' li And each sentence characterizes a position embedding E 'in the document set encoder' p Fusing to form an input representation sequence, adding an embedded E capable of representing the document set at the head of the sequence set Combining the input sequence into a complete document set-sentence representation input sequence, inputting the sequence into a plurality of transform coding layers for coding, and finally obtaining the representation r of the complete document set L set And the sentence code representation r li
6. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 3, wherein: the Step2.3 specifically comprises:
the topic abstract task needs to extract representative sentences, namely sentences with high prominence, so that a topic abstract task is designed on the basis of document set coding and sentence codingA sentence prominence calculation module of a single step; setting a manually written reference abstract of a document set L as R, and extracting k sentences capable of summarizing key information from the L as abstract sentences by aiming at the abstract; for the t-th selection step, the summary sentence sets generated currently are
Figure RE-FDA0003759361590000031
Let l j For sentences in L that have not yet been selected, r is characterized by computing a set output by the document set encoder set And sentence characterization r li Bilinear mapping function F of pro To measure the probability that the selected sentence is contained in the reference abstract R;
Figure RE-FDA0003759361590000032
wherein W bm Weight matrix for bilinear mapping, can be for r set And r li Respectively carrying out linear transformation on two vectors with different dimensions and mapping the two vectors into another space; the target function is to maximize the log-likelihood function of the sentences contained in the reference abstract R in the training samples;
Figure RE-FDA0003759361590000033
bilinear mapping function F pro As a measure of the current candidate sentence l i And a sentence l that has not been selected j The prominence scoring function of (a) can calculate an attention score for each candidate sentence, i.e., a prominence score for the sentence.
7. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 3, wherein: the Step2.4 specifically comprises:
after the prominence score of the candidate sentence is calculated, the repeated characteristics of the sentence also need to be calculated, and when the t-th selection process is carried out, n of the process is firstly calculatedMeta-grammar model matching features, which represent candidate sentences l i And selected abstract sentence l t-1 The coincidence degree of n-gram phrases;
Figure RE-FDA0003759361590000041
the more overlapped phrases indicate the more repeated features, and in order to accurately calculate the repeated features, the phrase overlap ratio of the unitary, binary and ternary grammar models is calculated respectively; in order to mine deeper sentence representation similarity, the maximum semantic similarity F of sentence representation is fused on the basis of obtaining the phrase overlap ratio of the n-element grammar model sim To calculate a coincidence feature;
Figure RE-FDA0003759361590000042
in order to expand the numerical difference of the coincidence feature calculated by the cosine similarity of the candidate sentence and the selected sentence, the feature value is dispersed between 0 and 1 by using linear normalization;
Figure RE-FDA0003759361590000043
the repeated feature calculation module calculates two repeated features, and the two features are fused to obtain an integral repeated feature; firstly, equally dividing the interval length from 0 to 1 into c blocks, dispersing the numerical values of the c blocks into corresponding equally divided blocks from 0 to 1 according to the contact degree characteristics and normalized semantic similarity characteristics of unitary, binary and ternary grammar phrases, converting each part of characteristics into one-hot vector representation with the length of c, splicing and fusing each part, and obtaining the repeated characteristic vector representation F of the whole module rep (l i );
Figure RE-FDA0003759361590000044
In the formula
Figure RE-FDA0003759361590000045
The one-hot vectors after the repeated feature vectors of all parts are partitioned can capture the influence of the repeated features of all parts, and people hope that the selected abstract sentences have less repeated features.
8. The method for abstracting a target news topic based on compressed space sentence selection as claimed in claim 3, wherein: the Step2.5 specifically comprises:
after the prominence score and the repeatability feature are obtained through the sentence prominence calculation module and the repeat feature calculation module, the two features need to be balanced in the sentence selection module, so that the selected abstract sentence has a certain prominence and cannot contain too many repeatability features; in the first step of sentence selection, only the sentence with the highest prominence score is extracted as the first sentence of the abstract; by calculating the saliency features F pro (l i ) And repetition feature F rep (l i ) To balance the candidate sentence l i Obtaining a d-dimensional mapping matching vector by using the two characteristics; inputting it into MLP to get final score SC (l) of sentence i );
Figure RE-FDA0003759361590000051
Wherein
Figure RE-FDA0003759361590000052
Bilinear mapping matrix for two features, W h Is a weight matrix of the MLP; the sentence selection module randomly selects sentences from the reference abstract R in the training process, so that the model learns the context information and learns to find the next outstanding and unrepeated sentence, and the objective function is
Figure RE-FDA0003759361590000053
Figure RE-FDA0003759361590000054
Target function representation in the t-th process, any sentence l is selected i Is the sentence score SC (l) i ) Sentence L remaining in L j Softmax function above; the loss of the sentence selection module is independent of the order of sentence selection, because a given sentence is a set of unordered sentences in the training process, the selection object of the module is always the next outstanding and unrepeated sentence, and finally a sentence set is obtained
Figure RE-FDA0003759361590000055
As the generated topic summary.
CN202210449431.0A 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection Pending CN115017404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210449431.0A CN115017404A (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210449431.0A CN115017404A (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Publications (1)

Publication Number Publication Date
CN115017404A true CN115017404A (en) 2022-09-06

Family

ID=83066510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210449431.0A Pending CN115017404A (en) 2022-04-27 2022-04-27 Target news topic abstracting method based on compressed space sentence selection

Country Status (1)

Country Link
CN (1) CN115017404A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687628A (en) * 2022-12-30 2023-02-03 北京搜狐新媒体信息技术有限公司 News quality judging method, system, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687628A (en) * 2022-12-30 2023-02-03 北京搜狐新媒体信息技术有限公司 News quality judging method, system, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112801010B (en) Visual rich document information extraction method for actual OCR scene
CN111160031A (en) Social media named entity identification method based on affix perception
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN111291188B (en) Intelligent information extraction method and system
CN111914062B (en) Long text question-answer pair generation system based on keywords
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN112231472A (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN112633431A (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN115390806A (en) Software design mode recommendation method based on bimodal joint modeling
CN111159405B (en) Irony detection method based on background knowledge
CN115510863A (en) Question matching task oriented data enhancement method
CN115408488A (en) Segmentation method and system for novel scene text
CN114611520A (en) Text abstract generating method
CN113961706A (en) Accurate text representation method based on neural network self-attention mechanism
CN114398900A (en) Long text semantic similarity calculation method based on RoBERTA model
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN112749566B (en) Semantic matching method and device for English writing assistance
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination