CN110852096A

CN110852096A - Method for automatically generating Chinese literature reviews

Info

Publication number: CN110852096A
Application number: CN201910567582.4A
Authority: CN
Inventors: 王会进; 朱蔚恒; 龙舜; 陈俊标
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2020-02-28
Anticipated expiration: 2039-06-27
Also published as: CN110852096B

Abstract

The invention discloses a method for automatically generating a Chinese document review, which particularly relates to the field of document reviews and specifically comprises the following steps: s1, preprocessing data; s2, feature extraction; s3, sentence importance scoring and topic information extraction; s4, sentence selection; and S5, sorting sentences. The solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different requirements of disciplines.

Description

Method for automatically generating Chinese literature reviews

Technical Field

The invention relates to the technical field of document reviews, in particular to a method for automatically generating a Chinese document review.

Background

The literature review is a discourse different from the research paper and is formed by understanding, sorting, confliction through, comprehensive analysis and evaluation after a researcher reads a document of a certain subject in advance. The basic invention review is to summarize and evaluate the prior knowledge about the research topic to summarize the status of the prior knowledge; high-level literature reviews review relevant literature in selected research interests and topics, help researchers find appropriate topics and innovation points. Retrieving and reading literature is an important prerequisite for composing reviews. Researchers often find themselves struggling with a vast amount of existing literature in selected areas, must first understand this literature to discuss how to innovate the breakthrough, and experts in any research area are also faced with the challenge of keeping up with the advances in the rapidly developing field. Today's scientific research is often highly interdisciplinary, which means that researchers need to know many related fields in addition to their own field, which puts higher demands on readability of literature reviews and the like. The traditional way to understand the state of development of research in a certain field is through review articles in that field. Good review articles require researchers in this field to spend a lot of time and effort writing, and research in some fields is developing day by day, so sometimes the needs of researchers cannot be met in time by reading limited review articles, and a lot of time is required for searching and reading related articles by an academic search engine. The academic literature automated review technique is well able to solve the problems set forth above. At present, foreign academic literature automatic review research based on English has already achieved certain achievements, and compared with Chinese academic literature automatic review research, Chinese academic literature automatic review research is still in the starting stage and is not seen yet;

document auto-summarization technology has been an important branch of the natural language processing field, and is now also used for automatic generation of literature summaries. From the generation mode of the abstract, the document abstract method has two modes: an extraction formula and a generation formula. The extraction method firstly uses natural language processing technology to give certain importance scores to the document structure units, and then selects a plurality of most important structure units and orders the structure units to obtain the abstract. The generative method is generally based on a deep learning model, and generates a new abstract sentence through technologies such as rephrasing, synonymous substitution, sentence abbreviation and the like. Because the sentences of academic literature are more rigorous than the general documents, the improper use of a word or symbol often leads to the situation that the expression of the literature generates "poor milli-centimetre," spurious thousands of miles. The generating method cannot guarantee the grammatical correctness and semantic accuracy of the generated sentences, so that the extraction method is generally adopted for the automatic review of academic documents in the research field at present.

The related research of the current academic literature automatic overview is generally based on an English data set, and the academic literature automatic generation method is mainly divided into two categories from the aspect of the number of input documents: single document review generation and multiple document review generation [3 ]; the input to the single document generation method is a single document, while the input to the multiple document generation method is a set of documents.

The disadvantages of the prior art solutions include: 1) lack of support for Chinese; 2) lack of support for cross-language documentation; 3) insufficient support for differences in demand for different disciplines; and 4) do not consider the rational ordering of the various overview presentations provided by the habits of the various disciplines.

Disclosure of Invention

In order to overcome the above-mentioned defects in the prior art, the embodiment of the invention provides a method for automatically generating a Chinese document review, and the solution proposed by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different disciplines and requirements.

In order to achieve the purpose, the invention provides the following technical scheme: a method for automatically generating Chinese document reviews specifically comprises the following steps:

s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary so as to carry out more reasonable evaluation on the importance of the sentences;

s2, feature extraction; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;

s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:

s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;

s3.2, predicting the importance of the sentence by using the trained regression model;

s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;

s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;

s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.

In a preferred embodiment, in step S3.1, the vector representation sentence is specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:

wherein, w_iA vector representing the ith word in the sentence, n represents the number of words contained in the sentence, and s v is a vector representation of the sentence.

In a preferred embodiment, in step S3.1, the sentence importance score measure is specifically: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:

where S denotes a candidate sentence in the reference, S^*The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:

where a represents a vector of sentences (a ═ a₁,A₂,…,A_n) B denotes a vector of the sentence st (B ═ B)₁,B₂,…,B_n))。

In a preferred embodiment, in step S3.1, the sentences in different languages are processed in a cross-language manner, specifically, the foreign language material is translated into chinese by machine translation, and then the text similarity calculation is performed in the same language.

In a preferred embodiment, in step S3.2, the sentence importance score prediction specifically comprises the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.

In a preferred embodiment, in step S3.2, keyword features, sentence length features, title features, TD-IDF features, part of speech features, professional term features, and stop word features are extracted, and the regression model adopts a random forest model to learn and compare the accuracy of the scoring task.

In a preferred embodiment, in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:

s3.3.1, dividing corpus; calculating the topic distribution of sentences by adopting hidden Dirichlet distribution, generating corresponding literature reviews by utilizing the contents of a plurality of academic documents, independently taking the reference document set of each sample as an LDA training corpus, training an LDA model on the corpus of each sample, and obtaining the topic distribution of the sentences of the reference documents in the samples by utilizing the model;

s3.3.2, determination of the number of topics; define topic and Z_jThe similarity formula is as follows:

wherein, and β_jIn (1) are respectively theme and Z_jThe number of the topics is set as m, and then the average similarity of the topics is defined as follows:

s3.3.3, predicting the topic distribution of the sentence; the sentence topic calculation process is mainly divided into the following steps:

1) sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;

2) inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;

3) and inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.

In a preferred embodiment, in step S4, the step of selecting the best sentence is as follows:

in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;

the first partial formula of the objective function is as follows:

wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;

the second partial formula of the objective function is as follows:

where B denotes a bigram set contained in the candidate sentence, B_iRepresenting bigram in set B, representing number of occurrences, y_iA generated summary indicating whether or not it is contained;

adding

As weights for bigrams to include more important bigrams;

by combining the two parts, the objective function formula can be obtained as follows:

x_ij，y_i∈{0,1}

wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L_maxA text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s_iIs selected, then all bigrams thereof should also be selected, B_iRepresenting a bigram set in the candidate sentence i; formula four ensures that, if selected, all sentences containing the bigram should also be selected,

representing the contained sentence sets;

the optimal selection problem of the sentence is converted into a linear programming problem, and then the linear programming problem is solved to obtain the optimal result of the sentence selection.

In a preferred embodiment, in step S5, the step of ordering among sentences is as follows:

for any two of the sentences a and b,

1) if a and b are from the same article, arranging in the order of appearance in the source article;

2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;

3) if the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;

expressing the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of an author in the field by indexes, if and constraint respectively, normalizing the three importance indexes to [0,1], and finally performing weighted combination on each index by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:

score＝λ₁*reference+λ₂*if+contribution

λ₁+λ₁+λ₁＝1

0≤λ₁，λ₂，λ₃≤1

wherein λ₁、λ₂And λ₃Weight parameters of if and constraint, respectively.

The invention has the technical effects and advantages that:

1. the solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be combined with different corpora/dictionaries to carry out review generation of different subject specialties, can automatically generate document reviews according to habits and requirements of different subject specialties, and provides more reasonable and flexible statement arrangement according to different requirements of the subject specialties;

2. the invention can automatically and quickly generate the overview of the given literature, help domestic researchers quickly grasp the development status of the related fields in time, and save precious time.

Drawings

FIG. 1 is a flow chart of the overall scheme of the present invention.

FIG. 2 is a diagram illustrating a sentence scoring prediction process according to the present invention.

FIG. 3 is a diagram illustrating corpus partitioning during LDA model training according to the present invention.

FIG. 4 is a diagram illustrating the main process of sentence topic distribution calculation according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

according to the method for automatically generating the Chinese document summary shown in fig. 1, the method specifically comprises the following steps:

s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed to improve the accuracy of word segmentation, and meanwhile, the professional dictionary is used for extracting features related to the subjects to make more reasonable evaluation on the importance of the sentences;

s2, feature extraction; analyzing text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise semantic features (such as similarity between sentences and titles), non-semantic features (such as sentence length) and disciplinary-related features (features extracted by combining disciplinary professional dictionaries);

s4, sentence selection; whether the sentence selection is reasonable or not directly determines the quality of the generated overview, on the basis of comprehensively considering the importance of the sentence and the subject information of the sentence, an optimization strategy of the sentence selection is designed, and then the sentence selection is carried out;

Example 2:

calculation of importance score for sentence A

A.1 vector representation of sentences

The preprocessed academic literature text is a series of character strings and is not suitable for direct calculation, so that a sentence is represented by a vector, specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:

wherein, w_iThe vector of the ith word in the sentence is represented, n represents the number of words contained in the sentence, and s _ v represents the vector of the sentence; using Word2Vec in Word embedding, a corpus of Chinese Wikipedia was first obtained using the open source genesis of google [36]The library trains initial word vectors; because the Word2Vec model trained by the Chinese Wikipedia corpus is not accurate enough for the expression of words in some academic fields, the crawled Chinese academic literature corpus is input into the trained Word2Vec model for incremental training, so that the Word2Vec can express words in the academic fields more accurately.

A.2 sentence importance score metric

The importance score of the sentence is an important basis for selecting the sentence when finally generating the document review, in order to evaluate the importance of the candidate sentence, similarity calculation is carried out between the candidate sentence and all sentences of the standard review corresponding to a given training set, then the maximum value is selected as the importance score of the candidate sentence, based on the following assumption, if the sentence extracted from the candidate sentence is highly similar to the sentence in the standard document review, the document review generated based on the sentences is closer to the standard document review, and the sentence importance score calculation formula is as follows:

A.3 Cross-language processing

The problem brought by the multi-language reference is mainly that on the aspect of sentence importance evaluation, a machine translation is used for translating a foreign language material into Chinese, and then the similarity of texts is calculated under the same language.

A.4 sentence importance score prediction

Predicting the importance scores of the sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model; through the analysis and research of relevant documents and the combination of the characteristics of academic document texts, the invention extracts a series of characteristics, including:

1) the key character: jones et al consider keyword scores to be a valid feature for text summarization, and the description of national standards for keywords in academic papers is: the keywords are words or terms used for representing the subject information items of the whole text selected from reports and articles for the purpose of document indexing work, obviously, in academic documents, the keywords can clearly and intuitively represent the subjects discussed or expressed by the documents, and important sentences may contain more keywords;

2) sentence length: teufel et al created a binary feature of sentence length in the study of automatic summarization of scientific literature indicating whether the sentence length exceeded a set threshold. The length of a sentence is directly taken as a characteristic, and generally, a long sentence in a text generally contains more information than a short sentence;

3) title characteristics: the title summarizes the content and the core of the whole article, and the similarity between the sentence and the title is used as the value of the title characteristic;

4) TD-IDF characteristics: the word frequency (TF) Inverse Document Frequency (IDF) can measure the importance of a word in a document; if a word appears more frequently in an article and less frequently in a corpus, the word is considered to be more important in the article; calculating the TD-IDF value of each word (neglecting stop words) in the sentence, and then taking the average TF-IDF value of each word in the sentence as the TF-IDF value of the sentence;

5) the part of speech characteristics are as follows: a review of the literature is a summary of the corresponding references and should be strongly informative. The name words have strong identification effect on the information content of the sentence. Calculating the proportion and the absolute number of nouns in the sentence as the part-of-speech characteristic value of the sentence;

6) characteristic of the professional terminology: the term "is a term used in a generic sense to refer to a term that is basically a noun, and includes concepts that relate to and are limited by a subject or an entire concept system in a specific field. Therefore, the professional terms have a certain marking effect on the importance degree of the sentence in the original text;

7) stop word characteristics: in the problem of natural language processing, stop words are generally considered to have no actual meaning. At this time, the proportion of stop words in a sentence can describe the information richness of the sentence to a certain extent, and the lower the proportion of stop words in the sentence is, the more useful information of the sentence is shown, which is helpful for the evaluation of the importance of the sentence. The individual characteristics are described in table 1:

TABLE 1 extracted features and description thereof

By comparing the accuracy of the scoring task through machine learning of several types of machines such as Linear Regression (LR), Support Vector Regression (SVR), classification regression tree (CART), random forest (Randomforest), and the like, the random forest model with the highest accuracy is adopted at the moment, and the specific figure is shown in FIG. 2.

Example 3:

b calculating topic distribution of sentences

B.1 corpus partitioning

Calculating the topic distribution of the sentence by adopting the hidden Dirichlet distribution; since literature reviews typically organize content from different topics, the topics may be different research topics or different aspects of a broad research topic. Therefore, the contents of a plurality of academic documents are utilized to generate corresponding document reviews, each review comprises different topics, and the topics of different sample reviews are independent from each other, so that the reference document set of each sample is independently used as an LDA training corpus;

as shown in fig. 3: and training an LDA model on the corpus of each sample, and obtaining the topic distribution of the reference sentences in the samples by using the model.

B.2 determination of the number of topics

Implicit Dirichlet distribution is an unsupervised learning algorithm, a topic model can be trained only by inputting training corpuses and the number of topics, document reviews illustrate contents from different topics, and when the topic model is trained, the quality of generating the document reviews is reduced when the number of given topics is too large or too small. Therefore, the optimal number of subjects was determined by studying related documents and combining the characteristics of experiments herein, and the method proposed by Cao's elastic and the like, which defines subjects and Z_jThe similarity formula is as follows:

when in use

When the minimum value is obtained, the LDA model is optimal at the moment, namely the set subject number is optimal at the moment, and the detailed calculation process is shown as an algorithm 1;

the invention adopts cosine similarity to measure similarity between themes, which comprises the following steps: after the training corpus and the number of the topics are input into an LDA topic model for training, the topic-Word distribution of each topic can be obtained, after the Word distribution of the topic is obtained, each keyword for representing the topic is converted into a corresponding Word vector by using a Word2Vec model trained in the field, then the key Word vector is multiplied by the corresponding weight coefficient and added (each component corresponding to the vector is added), and the vector obtained after the addition is used as the topic vector for representing.

B.3 predicting topic distribution of sentences

The topic distribution calculation process of the candidate sentences is shown in FIG. 4;

the sentence topic calculation process is mainly divided into the following steps:

Example 4:

c selecting the best sentence

The importance score of the sentence and the topic distribution information of the sentence can be obtained through the sentence importance prediction and the topic calculation. The automatic review of academic documents belongs to the problem of multi-article summarization in the field of academic documents, and the content of texts needs to be close to the standard review in the generation of the review, and sentences describing the same subject also need to be gathered together. Therefore, important scores and sentence topic distribution of sentences are comprehensively considered in the sentence selection process, sentence selection is converted into an optimization problem by referring to an optimization framework provided by Yue Hu and the like, and an optimal sentence set can be obtained by optimally solving an objective function;

the first partial formula of the objective function is as follows:

the present invention adds sentence length to the objective function to penalize short sentences, otherwise the objective function will tend to select more short sentences. Also the objective function should not be inclined to select very long sentences. When optimizing and selecting a sentence using an objective function, it is necessary to set the length of generating a document summary in advance. Thus, if the objective function tends to select a long sentence, then there are fewer choices, which may result in the generated document summary containing less information than is possible to make a comprehensive summary of the reference difficult. To solve the problem, a trade-off needs to be made between the number of the selected sentences and the average length of the sentences, and the addition of the variable plays a role, and if not, in the process of solving the objective function, in order to make the final value of the objective function as large as possible, more short sentences are selected by the objective function.

In order to avoid redundancy in the summary, when a sentence is selected to generate a document summary, it should be avoided that the selected different sentences contain repeated information, so that the generated summary content has the least redundancy, and therefore the formula of the second part of the objective function is as follows:

adding

As weights for bigrams to include more important bigrams;

x_ij，y_i∈{0,1}

wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L_maxA text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s_iIs selected, then all bigrams thereof should also be selected, B_iRepresenting a bigram set in the candidate sentence i; publicEquation four guarantees that, if selected, all sentences containing the bigram should also be selected,

representing the contained sentence sets;

in short, the invention converts the optimal selection problem of sentences into a linear programming problem, and then solves the linear programming problem to obtain the optimal result of sentence selection.

Example 5:

d ordering between sentences

The research overview is an overview of the current situation of research, and the tendency of different subjects varies in the presentation, but generally the content is organized according to time and arranged from far to near according to time, and the reader always wants to see the most important information, i.e. the more valuable viewpoint, first, so the steps of ordering among sentences are as follows:

for any two of the sentences a and b,

3) if the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects:

1. number of quotes of the article to which the sentence belongs. In the academic field, the more a paper is cited, the more valuable the paper is, and naturally, the more valuable the expression viewpoint is in the field. For Chinese documents, acquiring citation numbers of the Chinese documents from Baidu academy; for english literature, its citations are obtained from google academy;

2. and (5) publishing influence factors of the journal. The influence factors become a universal evaluation index of periodicals internationally, and the evaluation index not only is an index for measuring the usefulness and the display degree of the periodicals, but also is an important index for measuring the academic level of the periodicals and even the quality of papers;

3. degree of contribution of the author in the field. The contribution, or influence, of the article author in the academic domain may indicate to some extent the quality of the article. The articles published by the senior experts in the field generally have more influence and reference value, so the number of cited articles of the articles to which the influence sentences of the author belong in the field, the influence factors of published periodicals and the contribution degree of the author in the field are measured by counting the number of the articles published by the author in the field.

Table 2 article significance index Specification

Variables of	Description of the invention
		reference	Number of citations of article
if	Influence factor of publication journal of article
		contribution	Number of articles published by authors in the field

Table 2 is a symbolic description of each index, three indexes of reference, if, and contribution are normalized to [0,1], and finally, each index is weighted and combined by using a formula to obtain a ranking score of an article, wherein the calculation formula is shown as follows; the sentences with different source articles and the same publication year are ranked according to article ranking scores, wherein the sentences with high scores are ranked in the front, and the sentences with low scores are ranked in the back;

score＝λ₁*reference+λ₂*if+contribution

λ₁+λ₁+λ₁＝1

0≤λ₁，λ₂，λ₃≤1

wherein λ₁、λ₂And λ₃Weight parameters of if and constraint, respectively;

the ranking algorithm is shown in algorithm 2:

and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for automatically generating Chinese document reviews is characterized by comprising the following steps: the method specifically comprises the following steps:

2. The method for automatically generating the Chinese document summary according to claim 1, wherein: in step S3.1, the vector representation sentence specifically includes: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:

3. The method for automatically generating the Chinese document summary according to claim 2, wherein: in step S3.1, the sentence importance score measurement specifically includes: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:

4. The method for automatically generating the Chinese document summary according to claim 3, wherein: in the step S3.1, cross-language processing is performed on sentences in different languages, specifically, foreign language materials are translated into chinese by machine translation, and then text similarity calculation is performed in the same language.

5. The method for automatically generating the Chinese document summary according to claim 4, wherein: in step S3.2, the sentence importance score prediction specifically includes the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.

6. The method for automatically generating the Chinese document summary according to claim 5, wherein: in the step S3.2, keyword features, sentence length features, title features, TD-IDF features, part-of-speech features, professional term features and stop word features are extracted, a random forest model is adopted in the regression model, and the accuracy of the scoring task is compared through learning.

7. The method for automatically generating the Chinese document summary according to claim 6, wherein: in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:

8. The method for automatically generating the Chinese document summary according to claim 7, wherein: in step S4, the step of selecting the best sentence is as follows:

the first partial formula of the objective function is as follows:

the second partial formula of the objective function is as follows:

addingAs weights for bigrams to include more important bigrams;

x_ij，y_i∈{0,1}

representing the contained sentence sets;

9. The method for automatically generating the Chinese document summary according to claim 8, wherein: in step S5, the steps of ordering among sentences are as follows:

for any two of the sentences a and b,

score＝λ₁*reference+λ₂*if+contribution

λ₁+λ₁+λ₁＝1

0≤λ₁，λ₂，λ₃≤1