CN110852096A - Method for automatically generating Chinese literature reviews - Google Patents

Method for automatically generating Chinese literature reviews Download PDF

Info

Publication number
CN110852096A
CN110852096A CN201910567582.4A CN201910567582A CN110852096A CN 110852096 A CN110852096 A CN 110852096A CN 201910567582 A CN201910567582 A CN 201910567582A CN 110852096 A CN110852096 A CN 110852096A
Authority
CN
China
Prior art keywords
sentence
sentences
importance
topic
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910567582.4A
Other languages
Chinese (zh)
Other versions
CN110852096B (en
Inventor
王会进
朱蔚恒
龙舜
陈俊标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201910567582.4A priority Critical patent/CN110852096B/en
Publication of CN110852096A publication Critical patent/CN110852096A/en
Application granted granted Critical
Publication of CN110852096B publication Critical patent/CN110852096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for automatically generating a Chinese document review, which particularly relates to the field of document reviews and specifically comprises the following steps: s1, preprocessing data; s2, feature extraction; s3, sentence importance scoring and topic information extraction; s4, sentence selection; and S5, sorting sentences. The solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different requirements of disciplines.

Description

Method for automatically generating Chinese literature reviews
Technical Field
The invention relates to the technical field of document reviews, in particular to a method for automatically generating a Chinese document review.
Background
The literature review is a discourse different from the research paper and is formed by understanding, sorting, confliction through, comprehensive analysis and evaluation after a researcher reads a document of a certain subject in advance. The basic invention review is to summarize and evaluate the prior knowledge about the research topic to summarize the status of the prior knowledge; high-level literature reviews review relevant literature in selected research interests and topics, help researchers find appropriate topics and innovation points. Retrieving and reading literature is an important prerequisite for composing reviews. Researchers often find themselves struggling with a vast amount of existing literature in selected areas, must first understand this literature to discuss how to innovate the breakthrough, and experts in any research area are also faced with the challenge of keeping up with the advances in the rapidly developing field. Today's scientific research is often highly interdisciplinary, which means that researchers need to know many related fields in addition to their own field, which puts higher demands on readability of literature reviews and the like. The traditional way to understand the state of development of research in a certain field is through review articles in that field. Good review articles require researchers in this field to spend a lot of time and effort writing, and research in some fields is developing day by day, so sometimes the needs of researchers cannot be met in time by reading limited review articles, and a lot of time is required for searching and reading related articles by an academic search engine. The academic literature automated review technique is well able to solve the problems set forth above. At present, foreign academic literature automatic review research based on English has already achieved certain achievements, and compared with Chinese academic literature automatic review research, Chinese academic literature automatic review research is still in the starting stage and is not seen yet;
document auto-summarization technology has been an important branch of the natural language processing field, and is now also used for automatic generation of literature summaries. From the generation mode of the abstract, the document abstract method has two modes: an extraction formula and a generation formula. The extraction method firstly uses natural language processing technology to give certain importance scores to the document structure units, and then selects a plurality of most important structure units and orders the structure units to obtain the abstract. The generative method is generally based on a deep learning model, and generates a new abstract sentence through technologies such as rephrasing, synonymous substitution, sentence abbreviation and the like. Because the sentences of academic literature are more rigorous than the general documents, the improper use of a word or symbol often leads to the situation that the expression of the literature generates "poor milli-centimetre," spurious thousands of miles. The generating method cannot guarantee the grammatical correctness and semantic accuracy of the generated sentences, so that the extraction method is generally adopted for the automatic review of academic documents in the research field at present.
The related research of the current academic literature automatic overview is generally based on an English data set, and the academic literature automatic generation method is mainly divided into two categories from the aspect of the number of input documents: single document review generation and multiple document review generation [3 ]; the input to the single document generation method is a single document, while the input to the multiple document generation method is a set of documents.
The disadvantages of the prior art solutions include: 1) lack of support for Chinese; 2) lack of support for cross-language documentation; 3) insufficient support for differences in demand for different disciplines; and 4) do not consider the rational ordering of the various overview presentations provided by the habits of the various disciplines.
Disclosure of Invention
In order to overcome the above-mentioned defects in the prior art, the embodiment of the invention provides a method for automatically generating a Chinese document review, and the solution proposed by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be used for generating reviews of different disciplines by combining different corpora/dictionaries, can automatically generate document reviews according to habits and requirements of different disciplines, and provides more reasonable and flexible statement arrangement according to different disciplines and requirements.
In order to achieve the purpose, the invention provides the following technical scheme: a method for automatically generating Chinese document reviews specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary so as to carry out more reasonable evaluation on the importance of the sentences;
s2, feature extraction; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;
s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.
In a preferred embodiment, in step S3.1, the vector representation sentence is specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:
wherein, wiA vector representing the ith word in the sentence, n represents the number of words contained in the sentence, and s v is a vector representation of the sentence.
In a preferred embodiment, in step S3.1, the sentence importance score measure is specifically: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:
Figure RE-GDA0002362198490000032
where S denotes a candidate sentence in the reference, S*The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:
Figure RE-GDA0002362198490000041
where a represents a vector of sentences (a ═ a1,A2,…,An) B denotes a vector of the sentence st (B ═ B)1,B2,…,Bn))。
In a preferred embodiment, in step S3.1, the sentences in different languages are processed in a cross-language manner, specifically, the foreign language material is translated into chinese by machine translation, and then the text similarity calculation is performed in the same language.
In a preferred embodiment, in step S3.2, the sentence importance score prediction specifically comprises the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.
In a preferred embodiment, in step S3.2, keyword features, sentence length features, title features, TD-IDF features, part of speech features, professional term features, and stop word features are extracted, and the regression model adopts a random forest model to learn and compare the accuracy of the scoring task.
In a preferred embodiment, in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:
s3.3.1, dividing corpus; calculating the topic distribution of sentences by adopting hidden Dirichlet distribution, generating corresponding literature reviews by utilizing the contents of a plurality of academic documents, independently taking the reference document set of each sample as an LDA training corpus, training an LDA model on the corpus of each sample, and obtaining the topic distribution of the sentences of the reference documents in the samples by utilizing the model;
s3.3.2, determination of the number of topics; define topic and ZjThe similarity formula is as follows:
Figure RE-GDA0002362198490000051
wherein, and βjIn (1) are respectively theme and ZjThe number of the topics is set as m, and then the average similarity of the topics is defined as follows:
Figure RE-GDA0002362198490000052
s3.3.3, predicting the topic distribution of the sentence; the sentence topic calculation process is mainly divided into the following steps:
1) sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) and inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.
In a preferred embodiment, in step S4, the step of selecting the best sentence is as follows:
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;
the first partial formula of the objective function is as follows:
Figure RE-GDA0002362198490000053
wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;
the second partial formula of the objective function is as follows:
Figure RE-GDA0002362198490000054
where B denotes a bigram set contained in the candidate sentence, BiRepresenting bigram in set B, representing number of occurrences, yiA generated summary indicating whether or not it is contained;
adding
Figure RE-GDA0002362198490000061
As weights for bigrams to include more important bigrams;
by combining the two parts, the objective function formula can be obtained as follows:
Figure RE-GDA0002362198490000062
Figure RE-GDA0002362198490000063
Figure RE-GDA0002362198490000065
Figure RE-GDA0002362198490000066
xij,yi∈{0,1}
wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, LmaxA text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence siIs selected, then all bigrams thereof should also be selected, BiRepresenting a bigram set in the candidate sentence i; formula four ensures that, if selected, all sentences containing the bigram should also be selected,
Figure RE-GDA0002362198490000067
representing the contained sentence sets;
the optimal selection problem of the sentence is converted into a linear programming problem, and then the linear programming problem is solved to obtain the optimal result of the sentence selection.
In a preferred embodiment, in step S5, the step of ordering among sentences is as follows:
for any two of the sentences a and b,
1) if a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;
3) if the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;
expressing the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of an author in the field by indexes, if and constraint respectively, normalizing the three importance indexes to [0,1], and finally performing weighted combination on each index by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:
score=λ1*reference+λ2*if+contribution
λ111=1
0≤λ1,λ2,λ3≤1
wherein λ1、λ2And λ3Weight parameters of if and constraint, respectively.
The invention has the technical effects and advantages that:
1. the solution provided by the invention is suitable for Chinese and supports Chinese and English mixed review generation, can be combined with different corpora/dictionaries to carry out review generation of different subject specialties, can automatically generate document reviews according to habits and requirements of different subject specialties, and provides more reasonable and flexible statement arrangement according to different requirements of the subject specialties;
2. the invention can automatically and quickly generate the overview of the given literature, help domestic researchers quickly grasp the development status of the related fields in time, and save precious time.
Drawings
FIG. 1 is a flow chart of the overall scheme of the present invention.
FIG. 2 is a diagram illustrating a sentence scoring prediction process according to the present invention.
FIG. 3 is a diagram illustrating corpus partitioning during LDA model training according to the present invention.
FIG. 4 is a diagram illustrating the main process of sentence topic distribution calculation according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
according to the method for automatically generating the Chinese document summary shown in fig. 1, the method specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed to improve the accuracy of word segmentation, and meanwhile, the professional dictionary is used for extracting features related to the subjects to make more reasonable evaluation on the importance of the sentences;
s2, feature extraction; analyzing text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise semantic features (such as similarity between sentences and titles), non-semantic features (such as sentence length) and disciplinary-related features (features extracted by combining disciplinary professional dictionaries);
s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; whether the sentence selection is reasonable or not directly determines the quality of the generated overview, on the basis of comprehensively considering the importance of the sentence and the subject information of the sentence, an optimization strategy of the sentence selection is designed, and then the sentence selection is carried out;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.
Example 2:
calculation of importance score for sentence A
A.1 vector representation of sentences
The preprocessed academic literature text is a series of character strings and is not suitable for direct calculation, so that a sentence is represented by a vector, specifically: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:
Figure RE-GDA0002362198490000091
wherein, wiThe vector of the ith word in the sentence is represented, n represents the number of words contained in the sentence, and s _ v represents the vector of the sentence; using Word2Vec in Word embedding, a corpus of Chinese Wikipedia was first obtained using the open source genesis of google [36]The library trains initial word vectors; because the Word2Vec model trained by the Chinese Wikipedia corpus is not accurate enough for the expression of words in some academic fields, the crawled Chinese academic literature corpus is input into the trained Word2Vec model for incremental training, so that the Word2Vec can express words in the academic fields more accurately.
A.2 sentence importance score metric
The importance score of the sentence is an important basis for selecting the sentence when finally generating the document review, in order to evaluate the importance of the candidate sentence, similarity calculation is carried out between the candidate sentence and all sentences of the standard review corresponding to a given training set, then the maximum value is selected as the importance score of the candidate sentence, based on the following assumption, if the sentence extracted from the candidate sentence is highly similar to the sentence in the standard document review, the document review generated based on the sentences is closer to the standard document review, and the sentence importance score calculation formula is as follows:
Figure RE-GDA0002362198490000092
where S denotes a candidate sentence in the reference, S*The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:
Figure RE-GDA0002362198490000101
where a represents a vector of sentences (a ═ a1,A2,…,An) B denotes a vector of the sentence st (B ═ B)1,B2,…,Bn))。
A.3 Cross-language processing
The problem brought by the multi-language reference is mainly that on the aspect of sentence importance evaluation, a machine translation is used for translating a foreign language material into Chinese, and then the similarity of texts is calculated under the same language.
A.4 sentence importance score prediction
Predicting the importance scores of the sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model; through the analysis and research of relevant documents and the combination of the characteristics of academic document texts, the invention extracts a series of characteristics, including:
1) the key character: jones et al consider keyword scores to be a valid feature for text summarization, and the description of national standards for keywords in academic papers is: the keywords are words or terms used for representing the subject information items of the whole text selected from reports and articles for the purpose of document indexing work, obviously, in academic documents, the keywords can clearly and intuitively represent the subjects discussed or expressed by the documents, and important sentences may contain more keywords;
2) sentence length: teufel et al created a binary feature of sentence length in the study of automatic summarization of scientific literature indicating whether the sentence length exceeded a set threshold. The length of a sentence is directly taken as a characteristic, and generally, a long sentence in a text generally contains more information than a short sentence;
3) title characteristics: the title summarizes the content and the core of the whole article, and the similarity between the sentence and the title is used as the value of the title characteristic;
4) TD-IDF characteristics: the word frequency (TF) Inverse Document Frequency (IDF) can measure the importance of a word in a document; if a word appears more frequently in an article and less frequently in a corpus, the word is considered to be more important in the article; calculating the TD-IDF value of each word (neglecting stop words) in the sentence, and then taking the average TF-IDF value of each word in the sentence as the TF-IDF value of the sentence;
5) the part of speech characteristics are as follows: a review of the literature is a summary of the corresponding references and should be strongly informative. The name words have strong identification effect on the information content of the sentence. Calculating the proportion and the absolute number of nouns in the sentence as the part-of-speech characteristic value of the sentence;
6) characteristic of the professional terminology: the term "is a term used in a generic sense to refer to a term that is basically a noun, and includes concepts that relate to and are limited by a subject or an entire concept system in a specific field. Therefore, the professional terms have a certain marking effect on the importance degree of the sentence in the original text;
7) stop word characteristics: in the problem of natural language processing, stop words are generally considered to have no actual meaning. At this time, the proportion of stop words in a sentence can describe the information richness of the sentence to a certain extent, and the lower the proportion of stop words in the sentence is, the more useful information of the sentence is shown, which is helpful for the evaluation of the importance of the sentence. The individual characteristics are described in table 1:
TABLE 1 extracted features and description thereof
Figure RE-GDA0002362198490000111
Figure RE-GDA0002362198490000121
By comparing the accuracy of the scoring task through machine learning of several types of machines such as Linear Regression (LR), Support Vector Regression (SVR), classification regression tree (CART), random forest (Randomforest), and the like, the random forest model with the highest accuracy is adopted at the moment, and the specific figure is shown in FIG. 2.
Example 3:
b calculating topic distribution of sentences
B.1 corpus partitioning
Calculating the topic distribution of the sentence by adopting the hidden Dirichlet distribution; since literature reviews typically organize content from different topics, the topics may be different research topics or different aspects of a broad research topic. Therefore, the contents of a plurality of academic documents are utilized to generate corresponding document reviews, each review comprises different topics, and the topics of different sample reviews are independent from each other, so that the reference document set of each sample is independently used as an LDA training corpus;
as shown in fig. 3: and training an LDA model on the corpus of each sample, and obtaining the topic distribution of the reference sentences in the samples by using the model.
B.2 determination of the number of topics
Implicit Dirichlet distribution is an unsupervised learning algorithm, a topic model can be trained only by inputting training corpuses and the number of topics, document reviews illustrate contents from different topics, and when the topic model is trained, the quality of generating the document reviews is reduced when the number of given topics is too large or too small. Therefore, the optimal number of subjects was determined by studying related documents and combining the characteristics of experiments herein, and the method proposed by Cao's elastic and the like, which defines subjects and ZjThe similarity formula is as follows:
Figure RE-GDA0002362198490000131
wherein, and βjIn (1) are respectively theme and ZjThe number of the topics is set as m, and then the average similarity of the topics is defined as follows:
Figure RE-GDA0002362198490000132
when in use
Figure RE-GDA0002362198490000133
When the minimum value is obtained, the LDA model is optimal at the moment, namely the set subject number is optimal at the moment, and the detailed calculation process is shown as an algorithm 1;
Figure RE-GDA0002362198490000134
the invention adopts cosine similarity to measure similarity between themes, which comprises the following steps: after the training corpus and the number of the topics are input into an LDA topic model for training, the topic-Word distribution of each topic can be obtained, after the Word distribution of the topic is obtained, each keyword for representing the topic is converted into a corresponding Word vector by using a Word2Vec model trained in the field, then the key Word vector is multiplied by the corresponding weight coefficient and added (each component corresponding to the vector is added), and the vector obtained after the addition is used as the topic vector for representing.
B.3 predicting topic distribution of sentences
The topic distribution calculation process of the candidate sentences is shown in FIG. 4;
the sentence topic calculation process is mainly divided into the following steps:
1) sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) and inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.
Example 4:
c selecting the best sentence
The importance score of the sentence and the topic distribution information of the sentence can be obtained through the sentence importance prediction and the topic calculation. The automatic review of academic documents belongs to the problem of multi-article summarization in the field of academic documents, and the content of texts needs to be close to the standard review in the generation of the review, and sentences describing the same subject also need to be gathered together. Therefore, important scores and sentence topic distribution of sentences are comprehensively considered in the sentence selection process, sentence selection is converted into an optimization problem by referring to an optimization framework provided by Yue Hu and the like, and an optimal sentence set can be obtained by optimally solving an objective function;
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;
the first partial formula of the objective function is as follows:
Figure RE-GDA0002362198490000151
wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;
the present invention adds sentence length to the objective function to penalize short sentences, otherwise the objective function will tend to select more short sentences. Also the objective function should not be inclined to select very long sentences. When optimizing and selecting a sentence using an objective function, it is necessary to set the length of generating a document summary in advance. Thus, if the objective function tends to select a long sentence, then there are fewer choices, which may result in the generated document summary containing less information than is possible to make a comprehensive summary of the reference difficult. To solve the problem, a trade-off needs to be made between the number of the selected sentences and the average length of the sentences, and the addition of the variable plays a role, and if not, in the process of solving the objective function, in order to make the final value of the objective function as large as possible, more short sentences are selected by the objective function.
In order to avoid redundancy in the summary, when a sentence is selected to generate a document summary, it should be avoided that the selected different sentences contain repeated information, so that the generated summary content has the least redundancy, and therefore the formula of the second part of the objective function is as follows:
Figure RE-GDA0002362198490000152
where B denotes a bigram set contained in the candidate sentence, BiRepresenting bigram in set B, representing number of occurrences, yiA generated summary indicating whether or not it is contained;
adding
Figure RE-GDA0002362198490000153
As weights for bigrams to include more important bigrams;
by combining the two parts, the objective function formula can be obtained as follows:
Figure RE-GDA0002362198490000154
Figure RE-GDA0002362198490000155
Figure RE-GDA0002362198490000161
xij,yi∈{0,1}
wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, LmaxA text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence siIs selected, then all bigrams thereof should also be selected, BiRepresenting a bigram set in the candidate sentence i; publicEquation four guarantees that, if selected, all sentences containing the bigram should also be selected,
Figure RE-GDA0002362198490000164
representing the contained sentence sets;
in short, the invention converts the optimal selection problem of sentences into a linear programming problem, and then solves the linear programming problem to obtain the optimal result of sentence selection.
Example 5:
d ordering between sentences
The research overview is an overview of the current situation of research, and the tendency of different subjects varies in the presentation, but generally the content is organized according to time and arranged from far to near according to time, and the reader always wants to see the most important information, i.e. the more valuable viewpoint, first, so the steps of ordering among sentences are as follows:
for any two of the sentences a and b,
1) if a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;
3) if the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects:
1. number of quotes of the article to which the sentence belongs. In the academic field, the more a paper is cited, the more valuable the paper is, and naturally, the more valuable the expression viewpoint is in the field. For Chinese documents, acquiring citation numbers of the Chinese documents from Baidu academy; for english literature, its citations are obtained from google academy;
2. and (5) publishing influence factors of the journal. The influence factors become a universal evaluation index of periodicals internationally, and the evaluation index not only is an index for measuring the usefulness and the display degree of the periodicals, but also is an important index for measuring the academic level of the periodicals and even the quality of papers;
3. degree of contribution of the author in the field. The contribution, or influence, of the article author in the academic domain may indicate to some extent the quality of the article. The articles published by the senior experts in the field generally have more influence and reference value, so the number of cited articles of the articles to which the influence sentences of the author belong in the field, the influence factors of published periodicals and the contribution degree of the author in the field are measured by counting the number of the articles published by the author in the field.
Table 2 article significance index Specification
Variables of Description of the invention
reference Number of citations of article
if Influence factor of publication journal of article
contribution Number of articles published by authors in the field
Table 2 is a symbolic description of each index, three indexes of reference, if, and contribution are normalized to [0,1], and finally, each index is weighted and combined by using a formula to obtain a ranking score of an article, wherein the calculation formula is shown as follows; the sentences with different source articles and the same publication year are ranked according to article ranking scores, wherein the sentences with high scores are ranked in the front, and the sentences with low scores are ranked in the back;
score=λ1*reference+λ2*if+contribution
λ111=1
0≤λ1,λ2,λ3≤1
wherein λ1、λ2And λ3Weight parameters of if and constraint, respectively;
the ranking algorithm is shown in algorithm 2:
and finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (9)

1. A method for automatically generating Chinese document reviews is characterized by comprising the following steps: the method specifically comprises the following steps:
s1, preprocessing data; sentence and word segmentation is carried out on the text, a professional dictionary of each subject is constructed, and meanwhile, the characteristics related to the subjects are extracted by utilizing the professional dictionary so as to carry out more reasonable evaluation on the importance of the sentences;
s2, feature extraction; analyzing the text characteristics of academic documents, and extracting features by taking sentences as units, wherein the extracted features comprise sentence semantic features, non-semantic features and subject related features;
s3, sentence importance scoring and topic information extraction; the method specifically comprises the following steps:
s3.1, using the sentence similarity of the candidate sentences and the standard summary as the measure of the importance of the sentences, and inputting the sentence similarity obtained by calculation and the extracted sentence characteristics into a regression model;
s3.2, predicting the importance of the sentence by using the trained regression model;
s3.3, inputting the candidate sentences into an LDA topic model, and calculating topic distribution of the sentences by using the trained LDA model;
s4, sentence selection; designing an optimization strategy for sentence selection on the basis of comprehensively considering the importance of sentences and the subject information of the sentences, and then selecting the sentences;
s5, sentence sequencing; and sequencing the sentences according to a sequencing strategy to generate a domestic and foreign literature review with good readability.
2. The method for automatically generating the Chinese document summary according to claim 1, wherein: in step S3.1, the vector representation sentence specifically includes: the sentence is operated in the vector space, each sentence is regarded as the combination of the word sequence, so the vectors of each word in the sentence are added (each component of the word vector is added respectively), and then the average value is taken as the vector representation of the sentence, and the vector representation formula of the sentence is as follows:
wherein, wiA vector representing the ith word in the sentence, n represents the number of words contained in the sentence, and s v is a vector representation of the sentence.
3. The method for automatically generating the Chinese document summary according to claim 2, wherein: in step S3.1, the sentence importance score measurement specifically includes: and performing similarity calculation on the candidate sentences and all sentences of the corresponding standard summaries in the given training set, and then selecting the maximum value as the importance score of the candidate sentences, wherein the sentence importance score calculation formula is as follows:
Figure RE-FDA0002362198480000021
where S denotes a candidate sentence in the reference, S*The sentence set of the corresponding standard summary text in the training set is represented, similarity (s, st) represents the similarity between the sentence s and the sentence st, and the cosine distance is used for measuring the similarity between the sentences, and the calculation formula is as follows:
Figure RE-FDA0002362198480000022
where a represents a vector of sentences (a ═ a1,A2,…,An) B denotes a vector of the sentence st (B ═ B)1,B2,…,Bn))。
4. The method for automatically generating the Chinese document summary according to claim 3, wherein: in the step S3.1, cross-language processing is performed on sentences in different languages, specifically, foreign language materials are translated into chinese by machine translation, and then text similarity calculation is performed in the same language.
5. The method for automatically generating the Chinese document summary according to claim 4, wherein: in step S3.2, the sentence importance score prediction specifically includes the following steps: the method comprises the steps of predicting importance scores of sentences by using a regression model, taking each sentence as a sample, taking the corresponding importance score as the output of the regression model, inputting the importance scores of the sentences in a training set and the characteristics of the sentences into the regression model, training the regression model, and predicting the importance scores of the sentences in a testing set by using the trained regression model.
6. The method for automatically generating the Chinese document summary according to claim 5, wherein: in the step S3.2, keyword features, sentence length features, title features, TD-IDF features, part-of-speech features, professional term features and stop word features are extracted, a random forest model is adopted in the regression model, and the accuracy of the scoring task is compared through learning.
7. The method for automatically generating the Chinese document summary according to claim 6, wherein: in step S3.3, the specific steps of calculating the topic distribution of the sentence are as follows:
s3.3.1, dividing corpus; calculating the topic distribution of sentences by adopting hidden Dirichlet distribution, generating corresponding literature reviews by utilizing the contents of a plurality of academic documents, independently taking the reference document set of each sample as an LDA training corpus, training an LDA model on the corpus of each sample, and obtaining the topic distribution of the sentences of the reference documents in the samples by utilizing the model;
s3.3.2, determination of the number of topics; define topic and ZjThe similarity formula is as follows:
Figure RE-FDA0002362198480000031
wherein, and βjIn (1) are respectively theme and ZjThe number of the topics is set as m, and then the average similarity of the topics is defined as follows:
Figure RE-FDA0002362198480000032
s3.3.3, predicting the topic distribution of the sentence; the sentence topic calculation process is mainly divided into the following steps:
1) sentence and word segmentation is carried out on a reference document set, and a word set obtained after words are removed is used as a training corpus;
2) inputting the corpus into an LDA model, and performing iterative training on the LDA model until an optimal LDA model is obtained;
3) and inputting the sentences needing theme calculation into the trained LDA model, so as to obtain the theme distribution of the sentences.
8. The method for automatically generating the Chinese document summary according to claim 7, wherein: in step S4, the step of selecting the best sentence is as follows:
in the sentence selection process, the important scores and the sentence topic distribution of the sentences are comprehensively considered, the sentence selection is converted into an optimization problem, and an optimal sentence set can be obtained by performing optimization solution on the objective function;
the first partial formula of the objective function is as follows:
Figure RE-FDA0002362198480000041
wherein n represents the number of candidate sentences, m represents the number of topics, represents the length of the candidate sentences, represents the importance scores of the candidate sentences, represents the correlation degree of the sentence i and the topic j, represents whether the sentence i is selected and finally assigns the topic j;
the second partial formula of the objective function is as follows:
Figure RE-FDA0002362198480000042
where B denotes a bigram set contained in the candidate sentence, BiRepresenting bigram in set B, representing number of occurrences, yiA generated summary indicating whether or not it is contained;
addingAs weights for bigrams to include more important bigrams;
by combining the two parts, the objective function formula can be obtained as follows:
Figure RE-FDA0002362198480000044
Figure RE-FDA0002362198480000045
Figure RE-FDA0002362198480000047
Figure RE-FDA0002362198480000048
xij,yi∈{0,1}
wherein, formula one ensures that the length of the generated summary text does not exceed the preset value, LmaxA text length representing the generated overview; the formula II ensures that each sentence can only belong to one topic when the text is generated; formula three guarantees if sentence siIs selected, then all bigrams thereof should also be selected, BiRepresenting a bigram set in the candidate sentence i; formula four ensures that, if selected, all sentences containing the bigram should also be selected,
Figure RE-FDA0002362198480000051
representing the contained sentence sets;
the optimal selection problem of the sentence is converted into a linear programming problem, and then the linear programming problem is solved to obtain the optimal result of the sentence selection.
9. The method for automatically generating the Chinese document summary according to claim 8, wherein: in step S5, the steps of ordering among sentences are as follows:
for any two of the sentences a and b,
1) if a and b are from the same article, arranging in the order of appearance in the source article;
2) a and b do not belong to the same article, and are sorted according to the publication years of respective source articles, and the articles with the dates in front are arranged in front;
3) if the years are the same, the importance of the source article is ranked, and the importance of the article is considered from three aspects: the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of authors in the field;
expressing the number of articles to which the sentence belongs, influence factors of published periodicals and contribution degree of an author in the field by indexes, if and constraint respectively, normalizing the three importance indexes to [0,1], and finally performing weighted combination on each index by using a formula to obtain the ranking score of the article, wherein the calculation formula is as follows:
score=λ1*reference+λ2*if+contribution
λ111=1
0≤λ1,λ2,λ3≤1
wherein λ1、λ2And λ3Weight parameters of if and constraint, respectively.
CN201910567582.4A 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews Active CN110852096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910567582.4A CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910567582.4A CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Publications (2)

Publication Number Publication Date
CN110852096A true CN110852096A (en) 2020-02-28
CN110852096B CN110852096B (en) 2023-04-18

Family

ID=69595762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910567582.4A Active CN110852096B (en) 2019-06-27 2019-06-27 Method for automatically generating Chinese literature reviews

Country Status (1)

Country Link
CN (1) CN110852096B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310478A (en) * 2020-03-18 2020-06-19 电子科技大学 Similar sentence detection method based on TF-IDF and word vector
CN111666472A (en) * 2020-06-12 2020-09-15 郑州轻工业大学 Intelligent identification method for academic chain nodes
CN117708545A (en) * 2024-02-01 2024-03-15 华中师范大学 Viewpoint contribution degree evaluation method and system integrating theme extraction and cosine similarity

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310478A (en) * 2020-03-18 2020-06-19 电子科技大学 Similar sentence detection method based on TF-IDF and word vector
CN111310478B (en) * 2020-03-18 2023-09-19 电子科技大学 Similar sentence detection method based on TF-IDF and word vector
CN111666472A (en) * 2020-06-12 2020-09-15 郑州轻工业大学 Intelligent identification method for academic chain nodes
CN111666472B (en) * 2020-06-12 2023-03-28 郑州轻工业大学 Intelligent identification method for academic chain nodes
CN117708545A (en) * 2024-02-01 2024-03-15 华中师范大学 Viewpoint contribution degree evaluation method and system integrating theme extraction and cosine similarity
CN117708545B (en) * 2024-02-01 2024-04-30 华中师范大学 Viewpoint contribution degree evaluation method and system integrating theme extraction and cosine similarity

Also Published As

Publication number Publication date
CN110852096B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US10936824B2 (en) Detecting literary elements in literature and their importance through semantic analysis and literary correlation
US9201957B2 (en) Method to build a document semantic model
EP0889417A2 (en) Text genre identification
CN110852096B (en) Method for automatically generating Chinese literature reviews
US8498983B1 (en) Assisting search with semantic context and automated search options
CN108920455A (en) A kind of Chinese automatically generates the automatic evaluation method of text
Efat et al. Automated Bangla text summarization by sentence scoring and ranking
Cvrček et al. From extra-to intratextual characteristics: Charting the space of variation in Czech through MDA
JP2014106665A (en) Document retrieval device and document retrieval method
Lin et al. A simple but effective method for Indonesian automatic text summarisation
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
Albeer et al. Automatic summarization of YouTube video transcription text using term frequency-inverse document frequency
Mykowiecka et al. Recognition of irrelevant phrases<? br?> in automatically extracted lists<? br?> of domain terms
Pettersson et al. HistSearch-Implementation and Evaluation of a Web-based Tool for Automatic Information Extraction from Historical Text.
US6973423B1 (en) Article and method of automatically determining text genre using surface features of untagged texts
JP4428703B2 (en) Information retrieval method and system, and computer program
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing
Shaikh et al. An intelligent framework for e-recruitment system based on text categorization and semantic analysis
JP2002278982A (en) Information extracting method and information retrieving method
Suzen et al. LScDC-new large scientific dictionary
Nacinovic Prskalo et al. Identification of Metaphorical Collocations in Different Languages–Similarities and Differences
Erbs et al. Hierarchy identification for automatically generating table-of-contents
Osochkin et al. Comparative research of index frequency-Morphological methods of automatic text summarisation
Zarrad et al. Concepts extraction based on HTML documents structure
Li et al. PolyU at TAC 2008.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant