CN110852096B

CN110852096B - Method for automatically generating Chinese literature reviews

Info

Publication number: CN110852096B
Application number: CN201910567582.4A
Authority: CN
Inventors: 王会进; 朱蔚恒; 龙舜; 陈俊标
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2023-04-18
Anticipated expiration: 2039-06-27
Also published as: CN110852096A

Abstract

The invention discloses a method for automatically generating a Chinese literature review, which particularly relates to the field of literature reviews and specifically comprises the following steps: s1, preprocessing data; s2, extracting characteristics; s3, scoring the importance of the sentences and extracting theme information; s4, sentence selection; and S5, sequencing sentences. The solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different requirements of disciplines.

Description

Method for automatically generating Chinese literature reviews

Technical Field

The invention relates to the technical field of document reviews, in particular to a method for automatically generating a Chinese document review.

Background

The literature review is a discourse different from the research paper and is formed by understanding, sorting, confliction through, comprehensive analysis and evaluation after a researcher reads a document of a certain subject in advance. The basic invention review is to summarize and evaluate the prior knowledge about the research topic to summarize and state the prior knowledge; high-level literature reviews review relevant literature in selected research interests and topics, help researchers find appropriate topics and innovation points. Retrieving and reading literature is an important prerequisite for composing reviews. Researchers often find themselves struggling with a vast amount of existing literature in selected areas, must first understand this literature to discuss how to innovate the breakthrough, and experts in any research area are also faced with the challenge of keeping up with the advances in the rapidly developing field. Today's scientific research is often highly interdisciplinary, which means that researchers need to know many related fields in addition to their own field, which puts higher demands on readability of literature reviews and the like. The traditional way to understand the state of development of research in a certain field is through review articles in that field. Good review articles require researchers in this field to spend a lot of time and effort writing, and research in some fields is developing day by day, so sometimes the needs of researchers cannot be met in time by reading limited review articles, and a lot of time is required for searching and reading related articles by an academic search engine. The academic literature automated review technique is well able to solve the problems set forth above. At present, foreign academic literature automatic review research based on English has already achieved certain achievements, and compared with Chinese academic literature automatic review research, chinese academic literature automatic review research is still in the starting stage and is not seen yet;

document auto-summarization technology has been an important branch of the natural language processing field, and is now also used for automatic generation of a document summary. From the generation mode of the abstract, the document abstract method has two modes: an extraction formula and a generation formula. The extraction method firstly uses natural language processing technology to assign certain importance scores to document structure units (sentences, paragraphs and the like of a source document), and then selects a plurality of most important structure units to combine and sort so as to obtain an abstract. The generative method is generally based on a deep learning model, and generates a new abstract sentence through technologies such as rephrasing, synonymous substitution, sentence abbreviation and the like. Because the sentences of academic literature are more rigorous than the general documents, the improper use of a word or symbol often leads to the situation that the expression of the literature generates "poor milli-centimetre," spurious thousands of miles. The generating method cannot guarantee the grammatical correctness and semantic accuracy of the generated sentences, so that the extraction method is generally adopted for the automatic review of academic documents in the research field at present.

The related research of the current academic literature automatic review is generally based on an English data set, and from the aspect of the number of input documents, the academic literature review automatic generation method is mainly divided into two categories: single document review generation and multiple document review generation [3]; the input to the single document generation method is a single document, while the input to the multiple document generation method is a set of documents.

The disadvantages of the prior art solutions include: 1) Lack of support for Chinese; 2) Lack of support for cross-language documentation; 3) Insufficient support for differences in demand for different disciplines; and 4) do not consider the rational ordering of the various overview presentations provided by the habits of the various disciplines.

Therefore, it is necessary to invent a method for automatically generating a Chinese literature review.

Disclosure of Invention

In order to overcome the above-mentioned defects in the prior art, the embodiment of the invention provides a method for automatically generating a Chinese document review, and the solution proposed by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different disciplines and requirements.

In order to achieve the purpose, the invention provides the following technical scheme: a method for automatically generating a Chinese document review specifically comprises the following steps:

s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary so as to carry out more reasonable evaluation on the importance of the sentences;

s2, extracting characteristics; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;

s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:

s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;

s3.2, predicting the importance of the sentence by using the trained regression model;

s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;

s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;

s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.

In a preferred embodiment, in step S3.1, the expression of sentences by vectors is specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:

wherein, w _i Represents the vector of the ith word in the sentence, n represents the number of words contained in the sentence, and s v is the vector representation of the sentence.

In a preferred embodiment, in step S3.1, the sentence importance score measure is specifically: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:

where S denotes a candidate sentence in the reference, S ^* The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:

/>

where a represents the vector of sentence s (a = (a) ₁ ,A ₂ ,…,A _n ) B represents a sentencest vector (B = (B) ₁ ,B ₂ ,…,B _n ))。

In a preferred embodiment, in step S3.1, the sentences in different languages are processed in a cross-language manner, specifically, the foreign language material is translated into chinese by machine translation, and then the text similarity calculation is performed in the same language.

In a preferred embodiment, in step S3.2, the sentence importance score prediction specifically comprises the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.

In a preferred embodiment, in step S3.2, keyword features, sentence length features, title features, TD-IDF features, part of speech features, professional term features, and stop word features are extracted, and the regression model adopts a random forest model to learn and compare the accuracy of the scoring task.

In a preferred embodiment, in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:

s3.3.1, dividing corpus; calculating the topic distribution of sentences by adopting hidden Dirichlet distribution, generating corresponding literature reviews by utilizing the contents of a plurality of academic documents, independently taking the reference document set of each sample as an LDA training corpus, training an LDA model on the corpus of each sample, and obtaining the topic distribution of the sentences of the reference documents in the samples by utilizing the model;

s3.3.2, determination of the number of topics; definition of subject Z _i And Z _j The similarity formula is as follows:

in the formula, beta _i And beta _j In each case subject Z _i And Z _j The number of the topics is set as m, and then the average similarity of the topics is defined as follows:

s3.3.3, predicting topic distribution of sentences; the sentence topic calculation process is mainly divided into the following steps:

1) Sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;

2) Inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;

3) And inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.

In a preferred embodiment, in step S4, the step of selecting the best sentence is as follows:

in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;

the first partial formula of the objective function is as follows:

wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;

the second partial formula of the objective function is as follows:

where B denotes a bigram set contained in the candidate sentence, B _i Representing the bigram in the set B,

denotes b _i Number of occurrences, y _i Denotes b _i Whether or not to include the generated overview;

adding

As weights for bigrams to include more important bigrams;

by combining the two parts, the objective function formula can be obtained as follows:

x _ij ，y _i ∈{0,1}

wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L _max A text length representing the generated overview; the formula II ensures that each sentence can only belong to one subject when the text is generated; formula three guarantees if sentence s _i Is selected, then all bigrams thereof should also be selectedIs selected, B _i Representing a bigram set in the candidate sentence i; formula four guarantees if b _k Is selected, then all sentences containing the bigram should also be selected,

the representation contains b _k The sentence sets of (2);

the optimal selection problem of the sentence is converted into a linear programming problem, and then the linear programming problem is solved to obtain the optimal result of sentence selection.

In a preferred embodiment, in step S5, the step of ordering among sentences is as follows:

for any two of the sentences a and b,

1) If a and b are from the same article, arranging in the order of appearance in the source article;

2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;

3) If the years are the same, the source article is ranked according to its importance, and the importance of the article is considered from the following three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;

the quoted number of the article to which the sentence belongs, the influence factor of the published periodical and the contribution degree of the author in the field are respectively represented by index reference, if and distribution, the three importance indexes are normalized to [0,1], and finally, the indexes are weighted and combined by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:

score＝λ ₁ *reference+λ ₂ *if+λ ₃ contribution

λ ₁ +λ ₂ +λ ₃ ＝1

0≤λ ₁ ，λ ₂ ，λ ₃ ≤1

wherein λ is ₁ 、λ ₂ And λ ₃ Weight parameters of reference, if and constraint, respectively.

The invention has the technical effects and advantages that:

1. the solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be combined with different corpora/dictionaries to carry out review generation of different subject specialties, can automatically generate document reviews according to habits and requirements of different subject specialties, and provides more reasonable and flexible statement arrangement according to different requirements of the subject specialties;

2. the invention can automatically and quickly generate the overview of the given literature, help domestic researchers quickly grasp the development status of the related fields in time, and save precious time.

Drawings

FIG. 1 is a flow chart of the overall scheme of the present invention.

FIG. 2 is a diagram illustrating a sentence scoring prediction process according to the present invention.

FIG. 3 is a diagram illustrating corpus partitioning during LDA model training according to the present invention.

FIG. 4 is a diagram illustrating the main process of sentence topic distribution calculation according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

according to the method for automatically generating the Chinese document review shown in fig. 1, the method specifically comprises the following steps:

s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed to improve the accuracy of word segmentation, and meanwhile, the professional dictionary is utilized to extract the characteristics related to the subjects to make more reasonable evaluation on the importance of the sentence;

s2, extracting characteristics; analyzing text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise semantic features (such as similarity between sentences and titles), non-semantic features (such as sentence length) and disciplinary-related features (features extracted by combining disciplinary professional dictionaries);

s3, scoring the importance of the sentences and extracting theme information; the method specifically comprises the following steps:

s4, sentence selection; whether the sentence selection is reasonable or not directly determines the quality of the generated overview, on the basis of comprehensively considering the importance of the sentence and the subject information of the sentence, an optimization strategy of the sentence selection is designed, and then the sentence selection is carried out;

Example 2:

calculation of importance score for sentence A

A.1 vector representation of sentences

The preprocessed academic literature text is a series of character strings and is not suitable for direct calculation, so that a sentence is represented by a vector, specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:

wherein w _i A vector representing the ith word in the sentence, n represents the number of words contained in the sentence, sv is a vector representation of a sentence; using Word2Vec in Word embedding, a corpus of Chinese Wikipedia was first obtained using the open source genesis of google [36]The library trains initial word vectors; because the Word2Vec model trained by the Chinese Wikipedia corpus is not accurate enough for the expression of words in some academic fields, the crawled Chinese academic literature corpus is input into the trained Word2Vec model for incremental training, so that the Word2Vec can express words in the academic fields more accurately.

A.2 sentence importance score metric

The importance score of the sentence is an important basis for selecting the sentence when finally generating the document review, in order to evaluate the importance of the candidate sentence, similarity calculation is carried out between the candidate sentence and all sentences of the standard review corresponding to a given training set, then the maximum value is selected as the importance score of the candidate sentence, based on the following assumption, if the sentence extracted from the candidate sentence is highly similar to the sentence in the standard document review, the document review generated based on the sentences is closer to the standard document review, and the sentence importance score calculation formula is as follows:

where a represents the vector of sentence s (a = (a) ₁ ,A ₂ ,…,A _n ) B represents a vector of the sentence st (B = (B)) ₁ ,B ₂ ,…,B _n ))。

A.3 Cross-language processing

The problem brought by the multi-language reference is mainly that on the aspect of sentence importance evaluation, a machine translation is utilized to translate foreign language materials into Chinese, and then the text similarity is calculated under the same language.

A.4 sentence importance score prediction

Predicting the importance scores of the sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and then predicting the importance scores of the sentences in a testing set by using the trained regression model; through the analysis and research of relevant documents and the combination of the characteristics of academic document texts, the invention extracts a series of characteristics, including:

1) The key character: jones et al consider keyword scores to be a valid feature for text summarization, and the description of national standards for keywords in academic papers is: the keywords are words or terms used for representing the subject information items of the whole text selected from reports and articles for the purpose of document indexing work, obviously, in academic documents, the keywords can clearly and intuitively represent the subjects discussed or expressed by the documents, and important sentences may contain more keywords;

2) Sentence length: teufel et al created a binary feature of sentence length in the research of automatic summarization of scientific literature indicating whether the sentence length exceeds a set threshold. The length of the sentence is directly taken as a characteristic, and generally, the long sentence in the text generally contains more information than the short sentence;

3) Title characteristics: the title summarizes the content and the core of the whole article, and the similarity between the sentence and the title is used as the value of the title characteristic;

4) TD-IDF characteristics: the word frequency (TF) Inverse Document Frequency (IDF) can measure the importance of a word in a document; if a word appears more frequently in an article and less frequently in a corpus, the word is considered to be more important in the article; calculating the TD-IDF value of each word (neglecting stop words) in the sentence, and then taking the average TF-IDF value of each word in the sentence as the TF-IDF value of the sentence;

5) The part of speech characteristics: a review of the literature is a summary of the corresponding references and should be strongly informative. The name words have a strong identification effect on the information content of the sentences. Calculating the proportion and the absolute number of nouns in the sentence as the part-of-speech characteristic value of the sentence;

6) Characteristic of the professional terminology: the term "terminology" is used in a generic sense to refer to each subject, and is essentially a term that encompasses and is limited to the entire system of concepts in a given subject or area of expertise. Therefore, the professional terms have a certain marking effect on the importance degree of the sentence in the original text;

7) Stop word characteristics: in the problem of natural language processing, stop words are generally considered to have no actual meaning. At this time, the proportion of stop words in a sentence can describe the information richness of the sentence to a certain extent, and the lower the proportion of stop words in the sentence is, the more useful information of the sentence is shown, which is helpful for the evaluation of the importance of the sentence.

The individual characteristics are described in table 1:

TABLE 1 extracted features and their description

/>

By comparing the accuracy of the scoring task through machine learning of several types of machines such as Linear Regression (LR), support Vector Regression (SVR), classification regression tree (CART), random forest (Randomforest), and the like, the random forest model with the highest accuracy is adopted at the moment, and the specific figure is shown in FIG. 2.

Example 3:

b calculating topic distribution of sentences

B.1 corpus partitioning

Calculating the topic distribution of the sentence by adopting the hidden Dirichlet distribution; since literature reviews typically organize content from different topics, the topics may be different research topics or different aspects of a broad research topic. Therefore, the contents of a plurality of academic documents are utilized to generate corresponding document reviews, each review comprises different topics, and the topics of different sample reviews are independent from each other, so that the reference document set of each sample is independently used as an LDA training corpus;

as shown in fig. 3: and training an LDA model on the corpus of each sample, and obtaining the topic distribution of the reference sentences in the samples by using the model.

B.2 determination of the number of topics

Implicit Dirichlet distribution is an unsupervised learning algorithm, a topic model can be trained only by inputting training corpuses and the number of topics, document reviews illustrate contents from different topics, and when the topic model is trained, the quality of generating the document reviews is reduced when the number of given topics is too large or too small. Therefore, through research on relevant documents and combination of characteristics of experiments in the text, a method provided by Cao Juan and the like is selected to determine the optimal number of subjects, and the method defines a subject Z _i And Z _j The similarity formula is as follows:

in the formula, beta _i And beta _j In each case as subject Z _i And Z _j The number of the topics is set as m, and then the average similarity definition formula of the topics is as follows:

/>

when the temperature is higher than the set temperature

When the minimum value is obtained, the LDA model is optimal at the moment, namely the set number of subjects is optimal at the moment, and the detailed calculation process is shown as an algorithm 1;

the invention adopts cosine similarity to measure the similarity between subjects, and the method comprises the following steps: after the training corpus and the number of the topics are input into an LDA topic model for training, the topic-Word distribution of each topic can be obtained, after the Word distribution of the topic is obtained, each keyword for representing the topic is converted into a corresponding Word vector by using a Word2Vec model trained in the field, then the key Word vector is multiplied by the corresponding weight coefficient and added (each component corresponding to the vector is added), and the vector obtained after the addition is used as the topic vector for representing.

B.3 predicting topic distribution of sentences

The topic distribution calculation process of the candidate sentences is shown in FIG. 4;

the sentence topic calculation process is mainly divided into the following steps:

Example 4:

c selecting the best sentence

The importance score of the sentence and the topic distribution information of the sentence can be obtained through the sentence importance prediction and the topic calculation. An automatic review of academic documents belongs to the problem of multi-article summarization in the field of academic documents, and the content of texts needs to be close to the standard review when the review is generated, and sentences describing the same subject also need to be gathered together. Therefore, important scores and sentence topic distribution of sentences are comprehensively considered in the sentence selection process, sentence selection is converted into an optimization problem by referring to an optimization framework provided by Yue Hu and the like, and an optimal sentence set can be obtained by optimally solving an objective function;

the first partial formula of the objective function is as follows:

the present invention adds sentence length to the objective function to penalize short sentences, otherwise the objective function will tend to select more short sentences. Also the objective function should not be inclined to select very long sentences. When optimizing and selecting a sentence using an objective function, it is necessary to set the length of generating a document summary in advance. Thus, if the objective function tends to select a long sentence, then there are fewer choices, which may result in the generated document summary containing less information than is possible to make a comprehensive summary of the reference difficult. To solve this problem, a trade-off needs to be made between the number of sentences selected and the average length of the sentences, and a variable l _i The addition of (b) plays a role if l is not present _i In the process of solving the objective function, more phrases are selected for the objective function in order to make the final value of the objective function as large as possible.

In order to avoid redundancy in the summary, when a sentence is selected to generate a document summary, it should be avoided that the selected different sentences contain repeated information, so that the generated summary content has the least redundancy, and therefore the formula of the second part of the objective function is as follows:

denotes b _i Number of occurrences, y _i Denotes b _i Whether or not to include the generated reviews;

adding

As weights for bigrams to include more important bigrams;

/>

x _ij ，y _i ∈{0,1}

wherein formula one ensures that the length of the generated summary text will not beExceeding a predetermined value, L _max A text length representing a generated summary; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s _i Is selected, then all bigrams thereof should also be selected, B _i Representing a bigram set in the candidate sentence i; formula four guarantees if b _k Is selected, then all sentences containing the bigram should also be selected,

the representation contains b _k The sentence sets of (2);

in short, the invention converts the optimal selection problem of sentences into a linear programming problem, and then solves the linear programming problem to obtain the optimal result of sentence selection.

Example 5:

d ordering between sentences

The research overview is an overview of the current situation of research, and the tendency of different subjects varies in the presentation, but generally the content is organized according to time and arranged from far to near according to time, and the reader always wants to see the most important information, i.e. the more valuable viewpoint, first, so the steps of ordering among sentences are as follows:

for any two of the sentences a and b,

2) a and b do not belong to the same article, are sorted according to the publication years of respective source articles, and are arranged in front of the article with the date in front;

3) If the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects:

1. number of quotes of the article to which the sentence belongs. In the academic field, the more a document is cited, the more valuable the document is, and naturally, the point of expression is extremely valuable in the field. For Chinese documents, acquiring citation numbers of the Chinese documents from Baidu academy; for english literature, its citations are obtained from google academy;

2. and (5) publishing influence factors of the journal. The influence factors become a universal evaluation index of periodicals internationally, and the evaluation index not only is an index for measuring the usefulness and the display degree of the periodicals, but also is an important index for measuring the academic level of the periodicals and even the quality of papers;

3. degree of contribution of the author in the field. The contribution, or influence, of the article author in the academic domain may indicate to some extent the quality of the article. The articles published by the senior experts in the field generally have more influence and reference value, so the number of cited articles of the articles to which the influence sentences of the author belong in the field, the influence factors of published periodicals and the contribution degree of the author in the field are measured by counting the number of the articles published by the author in the field.

Table 2 article importance index Specification

Variables of	Description of the invention
		reference	Number of citations of article
if	Influence factor of publication journal
		contribution	Number of articles published by authors in the field

Table 2 is a symbolic description of each index, three indexes of reference, if and contribution are normalized to [0,1], and finally, each index is weighted and combined by using a formula to obtain a ranking score of an article, wherein the calculation formula is shown as follows; the sentences with different source articles and the same publication year are ranked according to article ranking scores, wherein the sentences with high scores are ranked in the front, and the sentences with low scores are ranked in the back;

score＝λ ₁ *reference+λ ₂ *if+λ ₃ contribution

λ ₁ +λ ₂ +λ ₃ ＝1

0≤λ ₁ ，λ ₂ ，λ ₃ ≤1

wherein λ ₁ 、λ ₂ And λ ₃ Weight parameters of reference, if and constraint, respectively;

the ranking algorithm is shown in algorithm 2:

and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for automatically generating Chinese document reviews is characterized by comprising the following steps: the method specifically comprises the following steps:

s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary, so that the importance of the sentence is more reasonably evaluated;

in step S4, the step of selecting the best sentence is as follows:

in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set is obtained by performing optimization solution on an objective function;

the first partial formula of the objective function is as follows:

wherein n represents the number of the candidate sentences, m represents the number of the topics, represents the length of the candidate sentences, represents the importance score of the candidate sentences, represents the correlation degree of the sentence i and the topic j, and represents whether the sentence i is selected and the finally allocated topic is j;

the second partial formula of the objective function is as follows:

where B denotes a bigram set contained in the candidate sentence, B _i Represents a bigram in the set B,

adding

As weights for bigrams to include more important bigrams;

combining the two parts, obtaining an objective function formula as follows:

/>

x _ij ，y _i ∈{0,1}

wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L _max A text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s _i Is selected, then all bigrams thereof should also be selected, B _i Representing a bigram set in the candidate sentence i; formula four guarantees if by b _k Is selected, then all sentences containing the bigram should also be selected,

the representation contains b _k The sentence set of (1);

converting the optimal selection problem of the sentence into a linear programming problem, and then solving the linear programming problem to obtain the optimal result of sentence selection;

s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review.

2. The method for automatically generating the Chinese document review according to claim 1, wherein: in step S3.1, the expression of sentences by vectors specifically includes: the operation is carried out on the sentences in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of all the words in the sentence are added, then the average value is taken as the vector expression of the sentence, and the vector expression formula of the sentence is as follows:

wherein, w _i The vector of the ith word in the sentence is represented, n represents the number of words contained in the sentence, and s _ v is the vector representation of the sentence.

3. The method for automatically generating the Chinese document summary according to claim 2, wherein: in step S3.1, the sentence importance score measurement specifically includes: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:

where S denotes a candidate sentence in the reference, S ^* A set of sentences representing corresponding standard review text in the training set, similarity (s, st) representing the similarity between sentence s and sentence st, and cosine distance used to measure the sentencesThe similarity between the two is calculated according to the following formula:

4. The method for automatically generating the Chinese document summary according to claim 3, wherein: in the step S3.1, cross-language processing is performed on sentences in different languages, specifically, foreign language materials are translated into chinese by machine translation, and then text similarity calculation is performed in the same language.

5. The method for automatically generating the Chinese document summary according to claim 4, wherein: in step S3.2, the sentence importance score prediction specifically includes the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores importation _ score(s) of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and then predicting the importance scores of the sentences in a testing set by using the trained regression model.

6. The method for automatically generating the Chinese document summary according to claim 5, wherein: in the step S3.2, keyword features, sentence length features, title features, TD-IDF features, part-of-speech features, professional term features and stop word features are extracted, a random forest model is adopted in the regression model, and the accuracy of the scoring task is compared through learning.

7. The method for automatically generating the Chinese document summary according to claim 6, wherein: in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:

s3.3.1, dividing corpus set; calculating the topic distribution of sentences by adopting implicit Dirichlet distribution, generating corresponding literature summaries by utilizing the contents of a plurality of academic documents, independently using a reference document set of each sample as an LDA training corpus, training an LDA model on the corpus set of each sample, and obtaining the topic distribution of the reference document sentences in the samples by utilizing the model;

s3.3.2, determination of number of topics; defining a topic Z _i And Z _j The similarity formula is as follows:

in the formula, beta _i And beta _j In each case as subject Z _i And Z _j The number of the topics is set as m, and then the average similarity of the topics is defined as follows:

s3.3.3, predicting the topic distribution of the sentence; the sentence topic calculation process is mainly divided into the following steps:

3) And inputting the sentences needing theme calculation into the trained LDA model to obtain the theme distribution of the sentences.

8. The method for automatically generating the Chinese document summary according to claim 7, wherein: in step S5, the steps of ordering among sentences are as follows:

for any two of the sentences a and b,

3) If the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;

respectively representing the quoted number of the article to which the sentence belongs, the influence factor of the published periodical and the contribution degree of the author in the field by using indexes reference, if and constraint, normalizing the indexes to [0,1], and finally weighting and combining each index by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:

score＝λ ₁ *reference+λ ₂ *if+λ ₃ contribution

λ ₁ +λ ₂ +λ ₃ ＝1

0≤λ ₁ ，λ ₂ ，λ ₃ ≤1

wherein λ ₁ 、λ ₂ And λ ₃ The weight parameters of reference, if and contribution, respectively.