CN110852096B - Method for automatically generating Chinese literature reviews - Google Patents

Method for automatically generating Chinese literature reviews Download PDF

Info

Publication number
CN110852096B
CN110852096B CN201910567582.4A CN201910567582A CN110852096B CN 110852096 B CN110852096 B CN 110852096B CN 201910567582 A CN201910567582 A CN 201910567582A CN 110852096 B CN110852096 B CN 110852096B
Authority
CN
China
Prior art keywords
sentence
sentences
importance
formula
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910567582.4A
Other languages
Chinese (zh)
Other versions
CN110852096A (en
Inventor
王会进
朱蔚恒
龙舜
陈俊标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201910567582.4A priority Critical patent/CN110852096B/en
Publication of CN110852096A publication Critical patent/CN110852096A/en
Application granted granted Critical
Publication of CN110852096B publication Critical patent/CN110852096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method for automatically generating a Chinese literature review, which particularly relates to the field of literature reviews and specifically comprises the following steps: s1, preprocessing data; s2, extracting characteristics; s3, scoring the importance of the sentences and extracting theme information; s4, sentence selection; and S5, sequencing sentences. The solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different requirements of disciplines.

Description

Method for automatically generating Chinese literature reviews
Technical Field
The invention relates to the technical field of document reviews, in particular to a method for automatically generating a Chinese document review.
Background
The literature review is a discourse different from the research paper and is formed by understanding, sorting, confliction through, comprehensive analysis and evaluation after a researcher reads a document of a certain subject in advance. The basic invention review is to summarize and evaluate the prior knowledge about the research topic to summarize and state the prior knowledge; high-level literature reviews review relevant literature in selected research interests and topics, help researchers find appropriate topics and innovation points. Retrieving and reading literature is an important prerequisite for composing reviews. Researchers often find themselves struggling with a vast amount of existing literature in selected areas, must first understand this literature to discuss how to innovate the breakthrough, and experts in any research area are also faced with the challenge of keeping up with the advances in the rapidly developing field. Today's scientific research is often highly interdisciplinary, which means that researchers need to know many related fields in addition to their own field, which puts higher demands on readability of literature reviews and the like. The traditional way to understand the state of development of research in a certain field is through review articles in that field. Good review articles require researchers in this field to spend a lot of time and effort writing, and research in some fields is developing day by day, so sometimes the needs of researchers cannot be met in time by reading limited review articles, and a lot of time is required for searching and reading related articles by an academic search engine. The academic literature automated review technique is well able to solve the problems set forth above. At present, foreign academic literature automatic review research based on English has already achieved certain achievements, and compared with Chinese academic literature automatic review research, chinese academic literature automatic review research is still in the starting stage and is not seen yet;
document auto-summarization technology has been an important branch of the natural language processing field, and is now also used for automatic generation of a document summary. From the generation mode of the abstract, the document abstract method has two modes: an extraction formula and a generation formula. The extraction method firstly uses natural language processing technology to assign certain importance scores to document structure units (sentences, paragraphs and the like of a source document), and then selects a plurality of most important structure units to combine and sort so as to obtain an abstract. The generative method is generally based on a deep learning model, and generates a new abstract sentence through technologies such as rephrasing, synonymous substitution, sentence abbreviation and the like. Because the sentences of academic literature are more rigorous than the general documents, the improper use of a word or symbol often leads to the situation that the expression of the literature generates "poor milli-centimetre," spurious thousands of miles. The generating method cannot guarantee the grammatical correctness and semantic accuracy of the generated sentences, so that the extraction method is generally adopted for the automatic review of academic documents in the research field at present.
The related research of the current academic literature automatic review is generally based on an English data set, and from the aspect of the number of input documents, the academic literature review automatic generation method is mainly divided into two categories: single document review generation and multiple document review generation [3]; the input to the single document generation method is a single document, while the input to the multiple document generation method is a set of documents.
The disadvantages of the prior art solutions include: 1) Lack of support for Chinese; 2) Lack of support for cross-language documentation; 3) Insufficient support for differences in demand for different disciplines; and 4) do not consider the rational ordering of the various overview presentations provided by the habits of the various disciplines.
Therefore, it is necessary to invent a method for automatically generating a Chinese literature review.
Disclosure of Invention
In order to overcome the above-mentioned defects in the prior art, the embodiment of the invention provides a method for automatically generating a Chinese document review, and the solution proposed by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different disciplines and requirements.
In order to achieve the purpose, the invention provides the following technical scheme: a method for automatically generating a Chinese document review specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary so as to carry out more reasonable evaluation on the importance of the sentences;
s2, extracting characteristics; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;
s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.
In a preferred embodiment, in step S3.1, the expression of sentences by vectors is specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:
Figure GDA0004057618540000031
wherein, w i Represents the vector of the ith word in the sentence, n represents the number of words contained in the sentence, and s v is the vector representation of the sentence.
In a preferred embodiment, in step S3.1, the sentence importance score measure is specifically: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:
Figure GDA0004057618540000032
where S denotes a candidate sentence in the reference, S * The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:
Figure GDA0004057618540000041
/>
where a represents the vector of sentence s (a = (a) 1 ,A 2 ,…,A n ) B represents a sentencest vector (B = (B) 1 ,B 2 ,…,B n ))。
In a preferred embodiment, in step S3.1, the sentences in different languages are processed in a cross-language manner, specifically, the foreign language material is translated into chinese by machine translation, and then the text similarity calculation is performed in the same language.
In a preferred embodiment, in step S3.2, the sentence importance score prediction specifically comprises the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.
In a preferred embodiment, in step S3.2, keyword features, sentence length features, title features, TD-IDF features, part of speech features, professional term features, and stop word features are extracted, and the regression model adopts a random forest model to learn and compare the accuracy of the scoring task.
In a preferred embodiment, in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:
s3.3.1, dividing corpus; calculating the topic distribution of sentences by adopting hidden Dirichlet distribution, generating corresponding literature reviews by utilizing the contents of a plurality of academic documents, independently taking the reference document set of each sample as an LDA training corpus, training an LDA model on the corpus of each sample, and obtaining the topic distribution of the sentences of the reference documents in the samples by utilizing the model;
s3.3.2, determination of the number of topics; definition of subject Z i And Z j The similarity formula is as follows:
Figure GDA0004057618540000051
in the formula, beta i And beta j In each case subject Z i And Z j The number of the topics is set as m, and then the average similarity of the topics is defined as follows:
Figure GDA0004057618540000052
s3.3.3, predicting topic distribution of sentences; the sentence topic calculation process is mainly divided into the following steps:
1) Sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) Inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) And inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.
In a preferred embodiment, in step S4, the step of selecting the best sentence is as follows:
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;
the first partial formula of the objective function is as follows:
Figure GDA0004057618540000053
wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;
the second partial formula of the objective function is as follows:
Figure GDA0004057618540000061
where B denotes a bigram set contained in the candidate sentence, B i Representing the bigram in the set B,
Figure GDA0004057618540000062
denotes b i Number of occurrences, y i Denotes b i Whether or not to include the generated overview;
adding
Figure GDA0004057618540000063
As weights for bigrams to include more important bigrams;
by combining the two parts, the objective function formula can be obtained as follows:
Figure GDA0004057618540000064
Figure GDA0004057618540000065
Figure GDA0004057618540000066
Figure GDA0004057618540000067
Figure GDA0004057618540000068
x ij ,y i ∈{0,1}
wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L max A text length representing the generated overview; the formula II ensures that each sentence can only belong to one subject when the text is generated; formula three guarantees if sentence s i Is selected, then all bigrams thereof should also be selectedIs selected, B i Representing a bigram set in the candidate sentence i; formula four guarantees if b k Is selected, then all sentences containing the bigram should also be selected,
Figure GDA0004057618540000069
the representation contains b k The sentence sets of (2);
the optimal selection problem of the sentence is converted into a linear programming problem, and then the linear programming problem is solved to obtain the optimal result of sentence selection.
In a preferred embodiment, in step S5, the step of ordering among sentences is as follows:
for any two of the sentences a and b,
1) If a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;
3) If the years are the same, the source article is ranked according to its importance, and the importance of the article is considered from the following three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;
the quoted number of the article to which the sentence belongs, the influence factor of the published periodical and the contribution degree of the author in the field are respectively represented by index reference, if and distribution, the three importance indexes are normalized to [0,1], and finally, the indexes are weighted and combined by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:
score=λ 1 *reference+λ 2 *if+λ 3 contribution
λ 123 =1
0≤λ 1 ,λ 2 ,λ 3 ≤1
wherein λ is 1 、λ 2 And λ 3 Weight parameters of reference, if and constraint, respectively.
The invention has the technical effects and advantages that:
1. the solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be combined with different corpora/dictionaries to carry out review generation of different subject specialties, can automatically generate document reviews according to habits and requirements of different subject specialties, and provides more reasonable and flexible statement arrangement according to different requirements of the subject specialties;
2. the invention can automatically and quickly generate the overview of the given literature, help domestic researchers quickly grasp the development status of the related fields in time, and save precious time.
Drawings
FIG. 1 is a flow chart of the overall scheme of the present invention.
FIG. 2 is a diagram illustrating a sentence scoring prediction process according to the present invention.
FIG. 3 is a diagram illustrating corpus partitioning during LDA model training according to the present invention.
FIG. 4 is a diagram illustrating the main process of sentence topic distribution calculation according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
according to the method for automatically generating the Chinese document review shown in fig. 1, the method specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed to improve the accuracy of word segmentation, and meanwhile, the professional dictionary is utilized to extract the characteristics related to the subjects to make more reasonable evaluation on the importance of the sentence;
s2, extracting characteristics; analyzing text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise semantic features (such as similarity between sentences and titles), non-semantic features (such as sentence length) and disciplinary-related features (features extracted by combining disciplinary professional dictionaries);
s3, scoring the importance of the sentences and extracting theme information; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; whether the sentence selection is reasonable or not directly determines the quality of the generated overview, on the basis of comprehensively considering the importance of the sentence and the subject information of the sentence, an optimization strategy of the sentence selection is designed, and then the sentence selection is carried out;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.
Example 2:
calculation of importance score for sentence A
A.1 vector representation of sentences
The preprocessed academic literature text is a series of character strings and is not suitable for direct calculation, so that a sentence is represented by a vector, specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:
Figure GDA0004057618540000091
wherein w i A vector representing the ith word in the sentence, n represents the number of words contained in the sentence, sv is a vector representation of a sentence; using Word2Vec in Word embedding, a corpus of Chinese Wikipedia was first obtained using the open source genesis of google [36]The library trains initial word vectors; because the Word2Vec model trained by the Chinese Wikipedia corpus is not accurate enough for the expression of words in some academic fields, the crawled Chinese academic literature corpus is input into the trained Word2Vec model for incremental training, so that the Word2Vec can express words in the academic fields more accurately.
A.2 sentence importance score metric
The importance score of the sentence is an important basis for selecting the sentence when finally generating the document review, in order to evaluate the importance of the candidate sentence, similarity calculation is carried out between the candidate sentence and all sentences of the standard review corresponding to a given training set, then the maximum value is selected as the importance score of the candidate sentence, based on the following assumption, if the sentence extracted from the candidate sentence is highly similar to the sentence in the standard document review, the document review generated based on the sentences is closer to the standard document review, and the sentence importance score calculation formula is as follows:
Figure GDA0004057618540000092
where S denotes a candidate sentence in the reference, S * The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:
Figure GDA0004057618540000101
where a represents the vector of sentence s (a = (a) 1 ,A 2 ,…,A n ) B represents a vector of the sentence st (B = (B)) 1 ,B 2 ,…,B n ))。
A.3 Cross-language processing
The problem brought by the multi-language reference is mainly that on the aspect of sentence importance evaluation, a machine translation is utilized to translate foreign language materials into Chinese, and then the text similarity is calculated under the same language.
A.4 sentence importance score prediction
Predicting the importance scores of the sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and then predicting the importance scores of the sentences in a testing set by using the trained regression model; through the analysis and research of relevant documents and the combination of the characteristics of academic document texts, the invention extracts a series of characteristics, including:
1) The key character: jones et al consider keyword scores to be a valid feature for text summarization, and the description of national standards for keywords in academic papers is: the keywords are words or terms used for representing the subject information items of the whole text selected from reports and articles for the purpose of document indexing work, obviously, in academic documents, the keywords can clearly and intuitively represent the subjects discussed or expressed by the documents, and important sentences may contain more keywords;
2) Sentence length: teufel et al created a binary feature of sentence length in the research of automatic summarization of scientific literature indicating whether the sentence length exceeds a set threshold. The length of the sentence is directly taken as a characteristic, and generally, the long sentence in the text generally contains more information than the short sentence;
3) Title characteristics: the title summarizes the content and the core of the whole article, and the similarity between the sentence and the title is used as the value of the title characteristic;
4) TD-IDF characteristics: the word frequency (TF) Inverse Document Frequency (IDF) can measure the importance of a word in a document; if a word appears more frequently in an article and less frequently in a corpus, the word is considered to be more important in the article; calculating the TD-IDF value of each word (neglecting stop words) in the sentence, and then taking the average TF-IDF value of each word in the sentence as the TF-IDF value of the sentence;
5) The part of speech characteristics: a review of the literature is a summary of the corresponding references and should be strongly informative. The name words have a strong identification effect on the information content of the sentences. Calculating the proportion and the absolute number of nouns in the sentence as the part-of-speech characteristic value of the sentence;
6) Characteristic of the professional terminology: the term "terminology" is used in a generic sense to refer to each subject, and is essentially a term that encompasses and is limited to the entire system of concepts in a given subject or area of expertise. Therefore, the professional terms have a certain marking effect on the importance degree of the sentence in the original text;
7) Stop word characteristics: in the problem of natural language processing, stop words are generally considered to have no actual meaning. At this time, the proportion of stop words in a sentence can describe the information richness of the sentence to a certain extent, and the lower the proportion of stop words in the sentence is, the more useful information of the sentence is shown, which is helpful for the evaluation of the importance of the sentence.
The individual characteristics are described in table 1:
TABLE 1 extracted features and their description
Figure GDA0004057618540000111
/>
Figure GDA0004057618540000121
By comparing the accuracy of the scoring task through machine learning of several types of machines such as Linear Regression (LR), support Vector Regression (SVR), classification regression tree (CART), random forest (Randomforest), and the like, the random forest model with the highest accuracy is adopted at the moment, and the specific figure is shown in FIG. 2.
Example 3:
b calculating topic distribution of sentences
B.1 corpus partitioning
Calculating the topic distribution of the sentence by adopting the hidden Dirichlet distribution; since literature reviews typically organize content from different topics, the topics may be different research topics or different aspects of a broad research topic. Therefore, the contents of a plurality of academic documents are utilized to generate corresponding document reviews, each review comprises different topics, and the topics of different sample reviews are independent from each other, so that the reference document set of each sample is independently used as an LDA training corpus;
as shown in fig. 3: and training an LDA model on the corpus of each sample, and obtaining the topic distribution of the reference sentences in the samples by using the model.
B.2 determination of the number of topics
Implicit Dirichlet distribution is an unsupervised learning algorithm, a topic model can be trained only by inputting training corpuses and the number of topics, document reviews illustrate contents from different topics, and when the topic model is trained, the quality of generating the document reviews is reduced when the number of given topics is too large or too small. Therefore, through research on relevant documents and combination of characteristics of experiments in the text, a method provided by Cao Juan and the like is selected to determine the optimal number of subjects, and the method defines a subject Z i And Z j The similarity formula is as follows:
Figure GDA0004057618540000131
in the formula, beta i And beta j In each case as subject Z i And Z j The number of the topics is set as m, and then the average similarity definition formula of the topics is as follows:
Figure GDA0004057618540000132
/>
when the temperature is higher than the set temperature
Figure GDA0004057618540000133
When the minimum value is obtained, the LDA model is optimal at the moment, namely the set number of subjects is optimal at the moment, and the detailed calculation process is shown as an algorithm 1;
Figure GDA0004057618540000134
Figure GDA0004057618540000141
the invention adopts cosine similarity to measure the similarity between subjects, and the method comprises the following steps: after the training corpus and the number of the topics are input into an LDA topic model for training, the topic-Word distribution of each topic can be obtained, after the Word distribution of the topic is obtained, each keyword for representing the topic is converted into a corresponding Word vector by using a Word2Vec model trained in the field, then the key Word vector is multiplied by the corresponding weight coefficient and added (each component corresponding to the vector is added), and the vector obtained after the addition is used as the topic vector for representing.
B.3 predicting topic distribution of sentences
The topic distribution calculation process of the candidate sentences is shown in FIG. 4;
the sentence topic calculation process is mainly divided into the following steps:
1) Sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) Inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) And inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.
Example 4:
c selecting the best sentence
The importance score of the sentence and the topic distribution information of the sentence can be obtained through the sentence importance prediction and the topic calculation. An automatic review of academic documents belongs to the problem of multi-article summarization in the field of academic documents, and the content of texts needs to be close to the standard review when the review is generated, and sentences describing the same subject also need to be gathered together. Therefore, important scores and sentence topic distribution of sentences are comprehensively considered in the sentence selection process, sentence selection is converted into an optimization problem by referring to an optimization framework provided by Yue Hu and the like, and an optimal sentence set can be obtained by optimally solving an objective function;
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;
the first partial formula of the objective function is as follows:
Figure GDA0004057618540000151
wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;
the present invention adds sentence length to the objective function to penalize short sentences, otherwise the objective function will tend to select more short sentences. Also the objective function should not be inclined to select very long sentences. When optimizing and selecting a sentence using an objective function, it is necessary to set the length of generating a document summary in advance. Thus, if the objective function tends to select a long sentence, then there are fewer choices, which may result in the generated document summary containing less information than is possible to make a comprehensive summary of the reference difficult. To solve this problem, a trade-off needs to be made between the number of sentences selected and the average length of the sentences, and a variable l i The addition of (b) plays a role if l is not present i In the process of solving the objective function, more phrases are selected for the objective function in order to make the final value of the objective function as large as possible.
In order to avoid redundancy in the summary, when a sentence is selected to generate a document summary, it should be avoided that the selected different sentences contain repeated information, so that the generated summary content has the least redundancy, and therefore the formula of the second part of the objective function is as follows:
Figure GDA0004057618540000152
where B denotes a bigram set contained in the candidate sentence, B i Representing the bigram in the set B,
Figure GDA0004057618540000153
denotes b i Number of occurrences, y i Denotes b i Whether or not to include the generated reviews;
adding
Figure GDA0004057618540000161
As weights for bigrams to include more important bigrams;
by combining the two parts, the objective function formula can be obtained as follows:
Figure GDA0004057618540000162
Figure GDA0004057618540000163
/>
Figure GDA0004057618540000164
Figure GDA0004057618540000165
Figure GDA0004057618540000166
x ij ,y i ∈{0,1}
wherein formula one ensures that the length of the generated summary text will not beExceeding a predetermined value, L max A text length representing a generated summary; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s i Is selected, then all bigrams thereof should also be selected, B i Representing a bigram set in the candidate sentence i; formula four guarantees if b k Is selected, then all sentences containing the bigram should also be selected,
Figure GDA0004057618540000167
the representation contains b k The sentence sets of (2);
in short, the invention converts the optimal selection problem of sentences into a linear programming problem, and then solves the linear programming problem to obtain the optimal result of sentence selection.
Example 5:
d ordering between sentences
The research overview is an overview of the current situation of research, and the tendency of different subjects varies in the presentation, but generally the content is organized according to time and arranged from far to near according to time, and the reader always wants to see the most important information, i.e. the more valuable viewpoint, first, so the steps of ordering among sentences are as follows:
for any two of the sentences a and b,
1) If a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, are sorted according to the publication years of respective source articles, and are arranged in front of the article with the date in front;
3) If the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects:
1. number of quotes of the article to which the sentence belongs. In the academic field, the more a document is cited, the more valuable the document is, and naturally, the point of expression is extremely valuable in the field. For Chinese documents, acquiring citation numbers of the Chinese documents from Baidu academy; for english literature, its citations are obtained from google academy;
2. and (5) publishing influence factors of the journal. The influence factors become a universal evaluation index of periodicals internationally, and the evaluation index not only is an index for measuring the usefulness and the display degree of the periodicals, but also is an important index for measuring the academic level of the periodicals and even the quality of papers;
3. degree of contribution of the author in the field. The contribution, or influence, of the article author in the academic domain may indicate to some extent the quality of the article. The articles published by the senior experts in the field generally have more influence and reference value, so the number of cited articles of the articles to which the influence sentences of the author belong in the field, the influence factors of published periodicals and the contribution degree of the author in the field are measured by counting the number of the articles published by the author in the field.
Table 2 article importance index Specification
Variables of Description of the invention
reference Number of citations of article
if Influence factor of publication journal
contribution Number of articles published by authors in the field
Table 2 is a symbolic description of each index, three indexes of reference, if and contribution are normalized to [0,1], and finally, each index is weighted and combined by using a formula to obtain a ranking score of an article, wherein the calculation formula is shown as follows; the sentences with different source articles and the same publication year are ranked according to article ranking scores, wherein the sentences with high scores are ranked in the front, and the sentences with low scores are ranked in the back;
score=λ 1 *reference+λ 2 *if+λ 3 contribution
λ 123 =1
0≤λ 1 ,λ 2 ,λ 3 ≤1
wherein λ 1 、λ 2 And λ 3 Weight parameters of reference, if and constraint, respectively;
the ranking algorithm is shown in algorithm 2:
Figure GDA0004057618540000181
and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (8)

1. A method for automatically generating Chinese document reviews is characterized by comprising the following steps: the method specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary, so that the importance of the sentence is more reasonably evaluated;
s2, extracting characteristics; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;
s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;
in step S4, the step of selecting the best sentence is as follows:
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set is obtained by performing optimization solution on an objective function;
the first partial formula of the objective function is as follows:
Figure FDA0004057618530000011
wherein n represents the number of the candidate sentences, m represents the number of the topics, represents the length of the candidate sentences, represents the importance score of the candidate sentences, represents the correlation degree of the sentence i and the topic j, and represents whether the sentence i is selected and the finally allocated topic is j;
the second partial formula of the objective function is as follows:
Figure FDA0004057618530000021
where B denotes a bigram set contained in the candidate sentence, B i Represents a bigram in the set B,
Figure FDA0004057618530000022
denotes b i Number of occurrences, y i Denotes b i Whether or not to include the generated overview;
adding
Figure FDA0004057618530000023
As weights for bigrams to include more important bigrams;
combining the two parts, obtaining an objective function formula as follows:
Figure FDA0004057618530000024
Figure FDA0004057618530000025
Figure FDA0004057618530000026
Figure FDA0004057618530000027
Figure FDA0004057618530000028
/>
x ij ,y i ∈{0,1}
wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, L max A text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence s i Is selected, then all bigrams thereof should also be selected, B i Representing a bigram set in the candidate sentence i; formula four guarantees if by b k Is selected, then all sentences containing the bigram should also be selected,
Figure FDA0004057618530000029
the representation contains b k The sentence set of (1);
converting the optimal selection problem of the sentence into a linear programming problem, and then solving the linear programming problem to obtain the optimal result of sentence selection;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review.
2. The method for automatically generating the Chinese document review according to claim 1, wherein: in step S3.1, the expression of sentences by vectors specifically includes: the operation is carried out on the sentences in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of all the words in the sentence are added, then the average value is taken as the vector expression of the sentence, and the vector expression formula of the sentence is as follows:
Figure FDA0004057618530000031
wherein, w i The vector of the ith word in the sentence is represented, n represents the number of words contained in the sentence, and s _ v is the vector representation of the sentence.
3. The method for automatically generating the Chinese document summary according to claim 2, wherein: in step S3.1, the sentence importance score measurement specifically includes: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:
Figure FDA0004057618530000032
where S denotes a candidate sentence in the reference, S * A set of sentences representing corresponding standard review text in the training set, similarity (s, st) representing the similarity between sentence s and sentence st, and cosine distance used to measure the sentencesThe similarity between the two is calculated according to the following formula:
Figure FDA0004057618530000033
where a represents the vector of sentence s (a = (a) 1 ,A 2 ,…,A n ) B represents a vector of the sentence st (B = (B)) 1 ,B 2 ,…,B n ))。
4. The method for automatically generating the Chinese document summary according to claim 3, wherein: in the step S3.1, cross-language processing is performed on sentences in different languages, specifically, foreign language materials are translated into chinese by machine translation, and then text similarity calculation is performed in the same language.
5. The method for automatically generating the Chinese document summary according to claim 4, wherein: in step S3.2, the sentence importance score prediction specifically includes the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores importation _ score(s) of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and then predicting the importance scores of the sentences in a testing set by using the trained regression model.
6. The method for automatically generating the Chinese document summary according to claim 5, wherein: in the step S3.2, keyword features, sentence length features, title features, TD-IDF features, part-of-speech features, professional term features and stop word features are extracted, a random forest model is adopted in the regression model, and the accuracy of the scoring task is compared through learning.
7. The method for automatically generating the Chinese document summary according to claim 6, wherein: in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:
s3.3.1, dividing corpus set; calculating the topic distribution of sentences by adopting implicit Dirichlet distribution, generating corresponding literature summaries by utilizing the contents of a plurality of academic documents, independently using a reference document set of each sample as an LDA training corpus, training an LDA model on the corpus set of each sample, and obtaining the topic distribution of the reference document sentences in the samples by utilizing the model;
s3.3.2, determination of number of topics; defining a topic Z i And Z j The similarity formula is as follows:
Figure FDA0004057618530000041
in the formula, beta i And beta j In each case as subject Z i And Z j The number of the topics is set as m, and then the average similarity of the topics is defined as follows:
Figure FDA0004057618530000042
s3.3.3, predicting the topic distribution of the sentence; the sentence topic calculation process is mainly divided into the following steps:
1) Sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) Inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) And inputting the sentences needing theme calculation into the trained LDA model to obtain the theme distribution of the sentences.
8. The method for automatically generating the Chinese document summary according to claim 7, wherein: in step S5, the steps of ordering among sentences are as follows:
for any two of the sentences a and b,
1) If a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, are sorted according to the publication years of respective source articles, and are arranged in front of the article with the date in front;
3) If the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;
respectively representing the quoted number of the article to which the sentence belongs, the influence factor of the published periodical and the contribution degree of the author in the field by using indexes reference, if and constraint, normalizing the indexes to [0,1], and finally weighting and combining each index by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:
score=λ 1 *reference+λ 2 *if+λ 3 contribution
λ 123 =1
0≤λ 1 ,λ 2 ,λ 3 ≤1
wherein λ 1 、λ 2 And λ 3 The weight parameters of reference, if and contribution, respectively.
CN201910567582.4A 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews Active CN110852096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910567582.4A CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910567582.4A CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Publications (2)

Publication Number Publication Date
CN110852096A CN110852096A (en) 2020-02-28
CN110852096B true CN110852096B (en) 2023-04-18

Family

ID=69595762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910567582.4A Active CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Country Status (1)

Country Link
CN (1) CN110852096B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310478B (en) * 2020-03-18 2023-09-19 电子科技大学 Similar sentence detection method based on TF-IDF and word vector
CN111666472B (en) * 2020-06-12 2023-03-28 郑州轻工业大学 Intelligent identification method for academic chain nodes

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system

Also Published As

Publication number Publication date
CN110852096A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
US9201957B2 (en) Method to build a document semantic model
EP0889417A2 (en) Text genre identification
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
Efat et al. Automated Bangla text summarization by sentence scoring and ranking
CN108920455A (en) A kind of Chinese automatically generates the automatic evaluation method of text
JP2014106665A (en) Document retrieval device and document retrieval method
CN110852096B (en) Method for automatically generating Chinese literature reviews
Lin et al. A simple but effective method for Indonesian automatic text summarisation
Akther et al. Compilation, analysis and application of a comprehensive Bangla Corpus KUMono
Pettersson et al. HistSearch-Implementation and Evaluation of a Web-based Tool for Automatic Information Extraction from Historical Text.
US6973423B1 (en) Article and method of automatically determining text genre using surface features of untagged texts
JP2005196572A (en) Summary making method of multiple documents
JP4428703B2 (en) Information retrieval method and system, and computer program
JP2002278982A (en) Information extracting method and information retrieving method
Shaikh et al. An intelligent framework for e-recruitment system based on text categorization and semantic analysis
Suzen et al. LScDC-new large scientific dictionary
BAZRFKAN et al. Using machine learning methods to summarize persian texts
Erbs et al. Hierarchy identification for automatically generating table-of-contents
Matias Mendoza et al. Ground truth Spanish automatic extractive text summarization bounds
Luo et al. Extract domain terminologies for knowledge graph construction using domain feature vectors
Fujii et al. Cyclone: An encyclopedic Web search site
Yakymenko et al. Methods and means of intelligent analysis of text documents
Reiner et al. Similarities Between Human Structured Subject Indexing and Probabilistic Topic Models
Zaragoza et al. Translating Knowledge Representations with Monolingual Word Embeddings: The Case of a Thesaurus on Corporate Non-Financial Reporting
Mallek et al. Accurate Context Extraction from Unstructured Text Based on Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant