CN109815328A - A kind of abstraction generating method and device - Google Patents

A kind of abstraction generating method and device Download PDF

Info

Publication number
CN109815328A
CN109815328A CN201811626213.XA CN201811626213A CN109815328A CN 109815328 A CN109815328 A CN 109815328A CN 201811626213 A CN201811626213 A CN 201811626213A CN 109815328 A CN109815328 A CN 109815328A
Authority
CN
China
Prior art keywords
abstract
sentence
collection
diversity factor
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811626213.XA
Other languages
Chinese (zh)
Other versions
CN109815328B (en
Inventor
董超
崔朝辉
赵立军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811626213.XA priority Critical patent/CN109815328B/en
Publication of CN109815328A publication Critical patent/CN109815328A/en
Application granted granted Critical
Publication of CN109815328B publication Critical patent/CN109815328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of abstraction generating method and devices, this method comprises: obtaining multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, and it is different that any two abstract collects corresponding time slice;The abstract sentence for including is concentrated according to abstract, is obtained the first abstract collection to the diversity factor of the second abstract collection, is obtained the corresponding abstract diversity factor of the first abstract collection;It based on obtained each abstract diversity factor, is concentrated from multiple abstracts and chooses the abstract collection for meeting the first screening conditions, rejected and repeat more or redundancy abstract collection with other abstract collection;Combining the abstract selected concentrates the abstract sentence for including to generate abstract, can reduce the clip Text of repeatability and redundancy on the basis of guaranteeing to make a summary coverage rate, is conducive to reader and accurately grasps the central idea of expectation text and the development of event.

Description

A kind of abstraction generating method and device
Technical field
This application involves natural language processing field more particularly to a kind of abstraction generating methods and device.
Background technique
Autoabstract technology automatically extracts keyword therein (or critical sentence), so that is, to a given original long text Afterwards by certain rule or means, the keyword extracted (or critical sentence) is organized into segment text, for summarizing The central idea of original long text.However in today of information explosion, people touch the text information of magnanimity daily, such as The largely newsletter archive from different media, different channels can be all generated daily.In this context, traditional to be directed to single The summarization generation of long text often to people promptly and accurately grasp key message help very little, be difficult to have place to show one's prowess.This When, it is desirable to provide one kind carries out summarization generation technology for more long texts.
More text snippet (multi-document summarization, MDS) technologies are come into being, with more long texts As input, the summary texts of specific length are automatically generated as required.More text snippets can be divided into no time sequencing The dynamic text autoabstract of static text autoabstract and having time sequence.
By taking news in brief as an example, because the generation and follow-up developments of each theme news are not in single time slice It is interior, can all may have new development trend daily at the early stage of production, so people often really need be to same subject or Event is formed by the abstract of having time sequence, i.e. dynamic text autoabstract according to the process of development.But it is traditional dynamic The more text summarizations of state can carry out static text abstract to the data in each time slice first, then obtained static state is plucked Simply to be spliced sequentially in time, the clip Text of many repeatability and redundancy can be generated.
Summary of the invention
In view of this, the embodiment of the present application provides a kind of abstraction generating method and device, it is able to solve in the prior art There is the clip Text of repeatability and redundancy in the abstract extracted.
The abstraction generating method that the embodiment of the present application first aspect provides, comprising:
Obtain multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, arbitrarily It is different that two abstracts collect corresponding time slice;
The abstract sentence for including is concentrated according to the abstract, the first abstract collection is obtained to the diversity factor of the second abstract collection, obtains First abstract collects corresponding abstract diversity factor;The first abstract collection and the second abstract collection are that the multiple abstract collects In any two;
Based on the abstract diversity factor each of is obtained, concentrates to choose from the multiple abstract and meet the first screening conditions Abstract collection;
Combining the abstract selected concentrates the abstract sentence for including to generate abstract.
Optionally, described that the abstract sentence for including is concentrated according to the abstract, determine the first abstract collection to the second abstract collection Diversity factor obtains first abstract and collects corresponding abstract diversity factor, specifically includes:
Based on it is described first abstract concentrate it is each abstract sentence in character or character string it is described second abstract concentrate weight Present condition obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair;
Comprehensive first abstract concentrates the diversity factor of the second abstract collection described in each abstract sentence pair, obtains described first and plucks Collect corresponding abstract diversity factor.
Optionally, the character or character string based in each abstract sentence of the first abstract concentration is plucked described second The reproduction state to be concentrated obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair, specifically Include:
Multiple character strings are extracted from target abstract sentence according to preset rules;The target abstract sentence is first abstract Any one the abstract sentence concentrated;
It counts in the multiple character string and does not concentrate the quantity for the character string reappeared in second abstract, counted Value;
According to the quantity of the statistical value and the multiple character string, the second abstract described in the target abstract sentence pair is obtained The diversity factor of collection.
Optionally, described to obtain multiple abstract collection, it specifically includes:
The subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set;The first time piece Section is that first abstract collects corresponding time slice;
Using topic model, descriptor is extracted from the first sentence set, obtains the first theme set of words;
It is expected word frequency in text and theme co-occurrence word in institute in the first time segment according to sentence co-occurrence word The word frequency in first time segment in expectation text is stated, it is related to obtain the corresponding article of each sentence in the first sentence set Degree;The sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence, The theme co-occurrence word is while appearing in the first sentence set and character or character string in the descriptor, the text The chapter degree of correlation represents a possibility that expectation text centric thought in the reflection first time segment;
According to the article degree of correlation, the sentence that selection meets the second screening conditions from the first sentence set is obtained The first abstract collection.
Optionally, the word frequency and the theme in text it is expected in the first time segment according to sentence co-occurrence word Co-occurrence word it is expected the word frequency in text in the first time segment, and it is corresponding to obtain each sentence in the first sentence set The article degree of correlation, specifically include:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
It is expected in the first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset The article degree of correlation, the target sentences and the target topic word of each sentence in word frequency, the sentence subset in text The theme co-occurrence word of each descriptor it is expected word frequency and first master in text in the first time segment in set The article degree of correlation of each descriptor, obtains the article degree of correlation of the target sentences in epigraph set;
Wherein, the target sentences are any one abstract sentence in the first sentence set, the sentence subset packet Include in the first sentence set remaining abstract sentence in addition to the target sentences, the article degree of correlation of the descriptor according to It include institute in the word frequency and the first sentence set of the descriptor it is expected in text in the first time segment The article degree of correlation for stating the sentence of descriptor obtains.
Optionally, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the The abstract collection of one screening conditions, specifically includes:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, obtains first abstract and collects corresponding text The chapter degree of correlation;
Based on the abstract diversity factor and each article degree of correlation each of is obtained, concentrates and select from the multiple abstract Take the abstract collection for meeting the first screening conditions.
Optionally, described to obtain multiple abstract collection, later further include:
Based on the reproduction state of character or character string in the first abstract subset in each abstract sentence in target abstract sentence, obtain Obtain the novel degree of the target abstract sentence;The target abstract sentence is any one abstract sentence that first abstract is concentrated, institute Stating the first abstract subset includes remaining the abstract sentence of the first abstract concentration in addition to the target makes a summary sentence;
Comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection;
Then, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the first sieve The abstract collection for selecting condition, specifically includes:
Based on the novel degree for each of obtaining the abstract diversity factor and each abstract collection, from the multiple abstract Concentrate the abstract collection chosen and meet the first screening conditions.
Optionally, the novel degree that first abstract concentrates abstract sentence is obtained, later further include:
The novel degree that each abstract sentence is concentrated based on first abstract, is concentrated to reject and is not met the from first abstract The abstract sentence of three screening conditions.
The summarization generation device that the embodiment of the present application second aspect provides, comprising: first obtains unit, second obtain list Member, screening unit and assembled unit;
The first obtains unit, for obtaining multiple abstract collection;Each abstract collection includes in corresponding time slice It is expected that the abstract sentence of text, it is different to collect corresponding time slice for abstract described in any two;
Second obtaining unit obtains the first abstract collection to for concentrating the abstract sentence for including according to the abstract The diversity factor of two abstract collection obtains first abstract and collects corresponding abstract diversity factor;The first abstract collection and described second Abstract collection is any two that the multiple abstract is concentrated;
The screening unit, for concentrating and choosing from the multiple abstract based on the abstract diversity factor each of is obtained Meet the abstract collection of the first screening conditions;
The assembled unit concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
The embodiment of the present application third aspect additionally provides a kind of computer readable storage medium, is stored thereon with computer journey Sequence is realized any in the abstraction generating method provided such as above-mentioned first aspect when the computer program is executed by processor It is a kind of.
The embodiment of the present application fourth aspect additionally provides a kind of summarization generation equipment, comprising: processor and memory;
The memory is transferred to the processor for storing program code, and by said program code;
The processor, for executing the abstract provided such as above-mentioned first aspect according to the instruction in said program code Any one in generation method.
Compared with prior art, the application has at least the following advantages:
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection, It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection. Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate, Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application, for those of ordinary skill in the art, without creative efforts, It can also be obtained according to these attached drawings other attached drawings.
Fig. 1 is a kind of flow diagram of abstraction generating method provided by the embodiments of the present application;
Fig. 2 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 3 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 4 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 5 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 6 is a kind of structural schematic diagram of summarization generation device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c (a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also To be multiple.
In today of information explosion, people touch the text information of magnanimity daily, for example can all generate daily a large amount of Newsletter archive from different media, different channels.In order to quickly and effectively obtain the central idea of article and the development of theme Train of thought, more text snippets come into being.Then, existing more text summarization techniques can indicate event development grain in acquisition When abstract, only by the abstract sentence that Text Feature Extraction goes out out of different time segment, simply spliced sequentially in time. But the ordinary circumstance developed according to event, previous time slice is interior it is expected content described in text, it is possible to work as It is repeated in expectation text in preceding time slice, such as meeting carries out letter to the media event of the previous day identical theme in the news on the same day Cause slightly is looked back, and existing more text summarization techniques do not consider that abstract theme is gradual in different time segment And variability, lead to the clip Text that there are many repeatability and redundancy in the abstract of different time segment text, Du Zhewu Method in time, effectively grasps the development grain of abstract theme.
For this purpose, the embodiment of the present application provides a kind of abstraction generating method and device, it is contemplated that abstract theme is when different Between gradual and variability in segment, judge the repetition situation between the corresponding abstract collection of two different time segments, reject More or redundancy abstract collection is repeated with other collection of making a summary, the abstract sentence combination producing that remaining abstract is concentrated is made a summary, it can On the basis of guaranteeing the coverage rate to abstract theme, the clip Text of repeatability and redundancy is reduced, it is effective to be conducive to reader Acquisition theme development grain.
Based on above-mentioned thought, in order to make the above objects, features, and advantages of the present application more apparent, below with reference to Attached drawing is described in detail the specific embodiment of the application.
Referring to Figure 1, which is a kind of flow diagram of abstraction generating method provided by the embodiments of the present application.
Abstraction generating method provided by the embodiments of the present application, comprising:
S101: multiple abstract collection are obtained.
Wherein, each abstract collection includes the abstract sentence of expectation text in corresponding time slice, and any two abstract collection corresponds to Time slice it is different.
It should be noted that time slice can carry out rough division to the developing stage of abstract theme, it is embodied When can set according to specific needs, such as time slice can be a few houres, one day, one week or one month etc., and the application is real Example is applied to this without limiting.In one example, time slice can be set according to the development speed of abstract theme It is fixed.For example, time slice can be set when fast speed (such as the process of meeting etc.) for development of a theme variation of making a summary It is shorter, such as a few houres or one day;It, can when speed relatively slow (such as the development of train technology etc.) for development of a theme variation of making a summary It is longer to set time slice, such as 1 year.
It in the embodiment of the present application, can be from database (such as internet it is expected that text is text relevant to abstract theme Database) or knowledge base in retrieve and obtain.It should be noted that the embodiment of the present application does not do language used by desired text It limits, can be Chinese, English, Japanese etc..Each expectation text includes corresponding time attribute, such as writes time, the when of delivering Between and record content time of origin etc., according to the time attribute i.e. can determine that the expectation text in which time slice. In the specific implementation, can be drawn first according to the query word (i.e. abstract theme) carried in the inquiry request received by search Hold up component searched for from database or knowledge base obtain it is related to the query word (as include the query word) text, obtain at least One expectation text.
It is understood that it includes the abstract sentence for corresponding to expectation text in time slice that each abstract, which is concentrated, to pluck The abstract sentence to be concentrated combines to obtain the abstract for indicating abstract theme central idea or development grain.It is described more detail below specific The abstract sentence of expectation text how is obtained, is not repeated first.
S102: concentrating the abstract sentence for including according to abstract, determines that the first abstract collection to the diversity factor of the second abstract collection, obtains First abstract collects corresponding abstract diversity factor.
It is understood that the first abstract collection and the second abstract collection are any two that multiple abstracts are concentrated.Specific real It is poor can to obtain its corresponding abstract in order to reduce duplicate contents or redundant content in abstract to each abstract collection by Shi Shi Different degree.
In the embodiment of the present application, abstract diversity factor reflects difference condition of two abstract collection on expression content.By It can be expressed by the abstract sentence that it includes in the expressed content of each abstract collection, therefore concentrate the abstract for including according to abstract Sentence can determine that the first abstract collection to the diversity factor of the second abstract collection.If the expressed content and second of the first abstract collection is plucked The content deltas for collecting expressed is larger, i.e., duplicate contents or redundancy in the first abstract collection and the second included abstract sentence of abstract collection Content is less, or does not have duplicate contents or redundant content, then the abstract diversity factor of the first abstract collection is big;If the first abstract collection institute Content deltas expressed by the content of expression and the second abstract collection is smaller, i.e., the first abstract collection and the second abstract collect included abstract Duplicate contents or redundant content are more in sentence, then the abstract diversity factor of the first abstract collection is small.
It should be noted that as it is expected content described in text in previous time slice, it is possible to current It is expected to repeat in text in time slice, such as meeting carries out simply the media event of the previous day identical theme in the news on the same day Cause look back, cause the corresponding abstract of previous time slice to collect abstract corresponding with current time segment and concentrate and duplicate Inside perhaps redundant content.Therefore, in order to reduce generate abstract repeatability and/or redundancy, the embodiment of the present application is some can Can implementation in, the initial time that the first abstract collects corresponding time slice, which can be, to be later than the second abstract collection and corresponds to timeslice The initial time of section.As an example, the first abstract collection can correspond to current time segment (such as the same day), the second abstract collection pair Answer previous time slice (such as the previous day).
S103: based on obtained each abstract diversity factor, selection is concentrated to meet plucking for the first screening conditions from multiple abstracts Collect.
In the embodiment of the present application, abstract diversity factor reflects difference condition of two abstract collection on expression content, benefit With the abstract diversity factor of each abstract collection, that is, it can determine that content expressed by the abstract collection and other abstracts collect expressed content Difference condition, can be concentrated from multiple abstracts select that diversity factor is larger (meets the first screening item with abstract on this basis Part) abstract collection, reject it is multiple abstract concentrate duplicate contents or redundant content, combination obtain repeatability it is low with redundancy Abstract, that simplifies summarizes the central idea and development grain of abstract theme, understands convenient for reader.
In practical applications, the first screening conditions can be set according to specific needs, for example abstract diversity factor needs Greater than certain threshold value etc., here without limiting.
S104: it combines the abstract selected and the abstract sentence for including is concentrated to generate abstract.
In the embodiment of the present application, the abstract collection selected and other abstract collection (such as its previous time slice is corresponding plucks Collect) content deltas it is larger, duplicate contents or redundant content are less, combine select abstract concentrate include abstract sentence given birth to At abstract in duplicate contents or redundant content it is also corresponding less, reduce the repeatability and redundancy of abstract, be conducive to reader According to the accurate central idea for grasping expectation text of the abstract of generation and the development of event.
In practical applications, the abstract collection selected directly can be subjected to group according to the time sequencing of corresponding time slice It closes, the abstract sentence for concentration of making a summary can be combined according to its sequence in desired text, and same abstract concentrates identical or phase As abstract sentence can be spliced, to reduce duplicate contents, the embodiment of the present application does not do specifically the combined method of abstract sentence It limits, also will not enumerate here.
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection, It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection. Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate, Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
Above content concentrates screening to meet the first screening conditions to according to the abstract diversity factor of abstract collection from multiple abstracts Abstract collects combination producing abstract and is illustrated, below by taking the first abstract collection (any one of i.e. multiple abstract concentrations) as an example, Illustrate specifically how to obtain abstract collection abstract diversity factor, other abstract collection abstract diversity factor acquisition pattern with it is such Seemingly, it repeats no more.
Referring to fig. 2, which is the flow diagram of another abstraction generating method provided by the embodiments of the present application.
In some possible implementations of the embodiment of the present application, step S102 be can specifically include:
S201: based on first abstract concentrate it is each abstract sentence in character or character string second abstract concentrate reproduction shape State obtains the diversity factor that the first abstract concentrates each abstract sentence pair second abstract collection.
In the embodiment of the present application, reproduction state refers specifically to occur or do not occur.When what the first abstract collection included plucks When character or character string (such as a word or multiple continuous words) in sentence being wanted to appear in the abstract sentence of the second abstract collection, reappear State be;And work as the abstract that character or character string in the abstract sentence that the first abstract collection includes do not appear in the second abstract collection When in sentence, reproduction state is as no.
It is understood that diversity factor reflects the difference condition of the two in terms of content.When what the first abstract collection included plucks When character or character string in sentence being wanted to appear in the abstract sentence of the second abstract collection, illustrate that the abstract sentence and the second abstract are concentrated and deposited In duplicate contents or redundant content, diversity factor is smaller;Conversely, when character or character string in the abstract sentence that the first abstract collection includes When not appearing in the abstract sentence of the second abstract collection, illustrate that the abstract sentence and the second abstract are concentrated and the duplicate contents or superfluous are not present Remaining content, diversity factor are larger.Therefore, according to first abstract concentrate abstract sentence in character or character string second abstract concentrate Reproduction state, available first abstract concentrate the diversity factor of each abstract sentence pair second abstract collection.
In some possible implementations of the embodiment of the present application, as shown in figure 3, step S201 can specifically include:
S2011: multiple character strings are extracted from target abstract sentence according to preset rules.
It is understood that target abstract sentence is any one abstract sentence that the first abstract is concentrated.In practical applications, may be used Specifically to be set to preset rules, to determine the length and extracting method of the character string extracted.
As an example, in order to guarantee that preset rules can be N- to diversity factor statistical accuracy and comprehensive Gram rule extracts character string from target abstract sentence using N-gram rule.
N-gram: being the Duan Wen given in natural language processing (Natural Language Processing, NLP) The sequence of N number of project (item) in this.Project (item) can be letter or word etc..As N=1, unigram can be described as; As N=2, bigram can be described as;As N=3, trigram can be described as, and so on.It, can be from mesh by taking trigram as an example The first character of mark abstract sentence starts that the 1-3 word is taken to obtain a character string, takes the 2-4 word to obtain a character string, takes 3-6 word obtains a character string, and so on, until taking n-th -2 to n word to obtain a character string, obtain plucking from target Want the multiple character strings extracted in sentence.
S2012: the character string reappeared is not concentrated in the second abstract in the corresponding multiple character strings of statistics target abstract sentence Quantity obtains statistical value.
In practical applications, the character string and the second abstract that can first count target abstract sentence are concentrated included by abstract sentence The intersection of character string counts the quantity for being not belonging to the character string of the intersection in the corresponding multiple character strings of target abstract sentence, obtains Statistical value.
It is understood that character string does not concentrate the quantity for the character string reappeared to get in the second abstract in target abstract sentence More, the diversity factor of target abstract sentence and the second abstract collection is bigger.
S2013: according to the quantity of statistical value and the corresponding multiple character strings of target abstract sentence, target abstract sentence pair the is obtained The diversity factor of two abstract collection.
In the embodiment of the present application, the diversity factor positive correlation of statistical value and the target abstract abstract collection of sentence pair second. Statistical value is bigger, then illustrates that the character string in target abstract sentence reappears, target abstract sentence fewer in the quantity of the second abstract concentration It is bigger with the diversity factor of the second abstract collection;Conversely, statistical value is smaller, then illustrate that the character string in target abstract sentence is reappeared second The quantity concentrated of making a summary is more, and the diversity factor of target abstract sentence and the second abstract collection is smaller.
As an example, it does not concentrate and reappears in the second abstract in the corresponding multiple character strings of sentence that can be made a summary according to target The corresponding multiple character strings of quantity and target the abstract sentence of character string quantity ratio, obtain target abstract sentence pair second and pluck The diversity factor to be collected.
Then, the diversity factor con of target abstract sentence pair second abstract collection (1) can be obtained according to the following formula:
In formula, si,jFor j-th of abstract sentence that the corresponding abstract of i-th of time slice is concentrated, DSi-1For (i-1)-th time Segment corresponding abstract collection, n-gram are character or character string, | | | | quantity is sought in expression.
S202: comprehensive first abstract concentrates the diversity factor of each abstract sentence pair second abstract collection, obtains the first abstract collection pair The abstract diversity factor answered.
It is understood that the first abstract collection collects the diversity factor (i.e. abstract diversity factor) of the second abstract collection with the first abstract Including abstract sentence in character string it is related to the reproduction situation of the second abstract collection, therefore, comprehensive first abstract concentrates each abstract The diversity factor of the abstract collection of sentence pair second, can be obtained the corresponding abstract diversity factor of the first abstract collection.
As an example, the first abstract collects corresponding abstract diversity factor C and (2) can obtain according to the following formula:
In formula, DSiFor the corresponding abstract collection of i-th of time slice, si,aIt is concentrated for the corresponding abstract of i-th of time slice A-th of abstract sentence, N DSiIn include abstract sentence sum.
Above content concentrates screening to meet the first screening conditions to according to the abstract diversity factor of abstract collection from multiple abstracts Abstract, which collects combination producing abstract and how to obtain abstract diversity factor, to be illustrated, (i.e. multiple to pluck with the first abstract collection below Any one that be concentrated) for, illustrate specifically how to obtain abstract collection, other abstract collection acquisition pattern with it is such Seemingly, it repeats no more.
Referring to fig. 4, which is the flow diagram of another abstraction generating method provided by the embodiments of the present application.
In some possible implementations of the embodiment of the present application, step S101 be can specifically include:
S401: the subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set.
It is understood that first time segment is the corresponding time slice of the first abstract collection.To first time segment After interior expectation text carries out subordinate sentence processing, the first sentence set can be obtained.
In practical applications, can the punctuate desirably in text meet and carry out subordinate sentence processing, i.e. the logic of subordinate sentence is Using punctuation marks such as fullstop, comma, exclamation mark, question mark, branch, colons as separator, every expectation text, which is split into, to be had The sentence (i.e. subordinate sentence result) for individually indicating meaning, obtains the first sentence set.
It should be noted that the information of expression is also corresponding less when the number of words for including in sentence is very few, it cannot be effective The central idea to desired text accurately summarized.Therefore, in some possible implementations of the embodiment of the present application, Number of words is deleted from the first sentence set less than the sentence of number of words threshold value in the sentence that can be split.Number of words threshold value can root It is set according to actual needs, such as 10, here without limiting.
In some possible implementations, it can also be obtained to it is expected that text carries out segment processing in first time segment To the first sentence set, concrete principle is similar with subordinate sentence processing, and which is not described herein again.
S402: utilizing topic model, extract descriptor from the first sentence set, obtains first time segment corresponding the One theme set of words.
Descriptor is that have generalities and standardization to the artificial language for expressing text subject in index and retrieval Feature.In the embodiment of the present application, theme can be extracted from the first sentence set using any one main body extracting method Word, such as implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) topic model etc..The master extracted Epigraph constitutes the corresponding first theme set of words of first time segment.
It should be noted that general topic model needs the number for first setting descriptor to exist by taking LDA topic model as an example It is clustered, obtains descriptor.But in practical applications, the number of descriptor is indefinite in each time slice, utilizes LDA Topic model is easy to appear error.Therefore, in some possible implementations of the embodiment of the present application, level Di Li can be used Cray process (Hierarchical Dirichlet Processes, HDP) generates the corresponding descriptor of each time slice Set.HDP is a kind of non-moduli type, is a kind of mutation of LDA topic model, and benefit is not have to preset themes number, is guaranteed The accuracy of key phrases extraction.
In theme set of words corresponding using each time slice of HDP generation, first the first sentence set can be carried out Word segmentation processing, and the stop words in the word segmentation result filtered out.Stop words refers in information retrieval, empty to save storage Between and improve search efficiency, before or after handle natural language data (or text) understand automatic fitration fall certain words or word, It may include auxiliary words of mood, adverbial word, preposition, conjunction etc., can be manually set according to actual needs.Then, it is carried out by HDP Theme modeling, forms topic model, is calculated by obtained topic model the word segmentation result after filtering out stop words, obtained To the first theme set of words.
In some possible implementations, in order to improve the accuracy rate and treatment effeciency of abstract, can according to weight from High to low sequence is chosen from the descriptor extracted and is no more than a certain number of descriptor, obtains the first theme set of words, Here to the selection quantity of descriptor without limiting.
S403: it is expected word frequency in text and theme co-occurrence word in first time segment according to sentence co-occurrence word Word frequency in one time slice in expectation text obtains the corresponding article degree of correlation of each sentence in the first sentence set.
In the embodiment of the present application, sentence co-occurrence word is while appearing in the first sentence set in any two sentence Character or character string, theme co-occurrence word are while appearing in the first sentence set and character or character string in descriptor, article The degree of correlation represents a possibility that it reflects expectation text centric thought in first time segment.According to the article degree of correlation Determine the abstract sentence of expectation text in first time segment.
It is understood that character described in step S403 or character string can be and directly extract, be also possible to by Obtaining according to certain rule (such as N-gram rule), which is not described herein again.When character or character string in first time segment the phase A possibility that hoping the frequency of occurrence in text more, it is expected text centric thought including the character or the expression of the sentence of character string It is higher.
It therefore, can be according to target sentences in the first sentence set (i.e. any one sentence in the first sentence set) institute Including sentence co-occurrence word the word frequency in text it is expected in first time segment, obtain the target sentences and the first sentence set In degree of correlation between other sentences, reflecting the target sentences can summarize in the first sentence set in the expression of other sentences The ability of appearance.The theme co-occurrence word according to included by the target sentences it is expected the word in text in first time segment Frequently, the degree of correlation in the target sentences and the first sentence set between descriptor is obtained, reflecting the target sentences being capable of table The ability of text centric thought it is expected in up to first time segment.
It should be noted that when sentence co-occurrence word is not present between other sentences in target sentences and sentence subset, Degree of correlation between target sentences and other sentences can be set as 0;When target sentences do not include in the first descriptor When descriptor, the degree of correlation between the target sentences and the descriptor can be set as 0.
Then, the comprehensive target sentences summarize the ability and the target of other sentences expression meaning in the first sentence set Sentence expresses the ability of expectation text centric thought in first time segment, when also can be obtained by target sentences reflection first Between in segment a possibility that expectation text centric thought, i.e. its article degree of correlation.
It, can be according to sentence co-occurrence word in first time segment in some possible implementations of the embodiment of the present application It is expected that the word frequency and theme co-occurrence word in text it is expected the word frequency in text in first time segment, changed using optimization algorithm In generation, obtains the article degree of correlation.For example, to maximize the overall quality (may include coverage rate, repeatability and redundancy etc.) of abstract For target, optimized using genetic algorithm.As its name suggests, genetic algorithm is the heredity of simulation biology in a natural environment The adaptive global optimization search with one kind of evolutionary process by natural selection, is lost by the principle by science of heredity The effects of passing, making a variation mechanism filters out adaptable higher individual in turn.
Then, for the i-th iteration of optimization algorithm, step S403 be can specifically include:
Text it is expected in first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset In word frequency, each descriptor in the article degree of correlation, target sentences and target topic set of words of each sentence in sentence subset Theme co-occurrence word the text of each descriptor in word frequency and the first theme set of words in text it is expected in first time segment The chapter degree of correlation obtains the article degree of correlation of target sentences.
In the embodiment of the present application, target sentences are any one abstract sentence in the first sentence set, sentence subset packet Include remaining abstract sentence in the first sentence set in addition to target sentences.It is similar with the article degree of correlation of sentence, the text of descriptor The chapter degree of correlation represent its reflect first time segment in expectation text centric thought a possibility that, can the descriptor first The article of word frequency in time slice in expectation text and the sentence including in the first sentence set including the descriptor is related Degree obtains.
In one example, the article degree of correlation of target sentences (3) can obtain according to the following formula:
In formula, sα,βThe β sentence in sentence set, s are corresponded to for the α time sliceα,γFor the α time slice pair Answer the γ sentence in sentence subset, rel (sα,β) it is sentence sα,βThe article degree of correlation, rel (sα,γ) it is sentence sα,γText The chapter degree of correlation, tα,kK-th of descriptor in theme set of words, rel (t are corresponded to for the α time sliceα,k) based on write inscription tα,k The article degree of correlation, WST(sα,β,tα,k) according to sentence sα,βWith descriptor tα,kTheme co-occurrence word in first time segment the phase The word frequency in text is hoped to obtain, WSS(sα,β,sα,γ) according to sentence sα,βWith sentence sα,γSentence co-occurrence word in first time segment Word frequency in interior expectation text obtains, and M is that the α time slice corresponds to the descriptor quantity in theme set of words, and P is sentence Concentrate the sentence quantity of sentence.
Wherein, descriptor tα,kArticle degree of correlation rel (tα,k) (4) can obtain according to the following formula:
In formula, sα,xX-th of sentence in sentence set, rel (s are corresponded to for the α time sliceα,x) it is sentence sα,x's The article degree of correlation, WTS(tα,k,sα,x) according to descriptor tα,kWith sentence sα,xTheme co-occurrence word it is expected in first time segment Word frequency in text obtains, WTS(tα,k,sα,x) and WST(sα,x,tα,k) equal, Q is that the α time slice corresponds in sentence set Sentence quantity.
As an example, WST(sα,β,tα,k) (5) can obtain according to the following formula:
In formula, wt(sα,β,tα,k) according to sentence sα,βWith descriptor tα,kT-th of theme co-occurrence word in first time segment Word frequency in interior expectation text obtains, and can be the TF-IDF weight of the theme co-occurrence word.
It should be noted that working as sentence sα,βWith descriptor tα,kBetween be not present theme co-occurrence word (i.e. sentence sα,βDo not include Descriptor tα,k) when, it can be by WST(sα,β,tα,k) it is set as 0.
As an example, WSS(sα,β,sα,γ) (6) can obtain according to the following formula:
In formula, wt(sα,β,sα,γ) according to sentence sα,βWith sentence sα,γT-th of sentence co-occurrence word in first time segment It is expected that the word frequency in text obtains, the TF-IDF weight of the sentence co-occurrence word can be.
It should be noted that working as sentence sα,βWith sentence sα,γBetween be not present sentence co-occurrence word (i.e. sentence sα,βAnd sentence sα,γThere is no identical character string) when, it can be by WSS(sα,β,sα,γ) it is set as 0.
TF-IDF, i.e. word frequency-inverse document frequency (term frequency-inverse document It frequency), is a kind of common weighting technique for information retrieval and data mining, to assess a words for one The significance level of file set or a copy of it file in a corpus.The importance of words occurs hereof with it The directly proportional increase of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.
In practical applications, normalized can be done to the article degree of correlation of each sentence in the first sentence set, and Normalized is done to each descriptor in the first theme set of words.
S404: according to the article degree of correlation of sentence, the sentence for meeting the second screening conditions is chosen from the first sentence set Obtain the first abstract collection.
It is understood that the article degree of correlation of sentence represents in sentence reflection first time segment in expectation text A possibility that thought is thought carries out the sentence for meeting the second screening conditions in the first sentence set using the article degree of correlation of sentence Screening, the abstract sentence (the i.e. first abstract collection) of the interior expectation text of available first time segment.
It in practical applications, can be according to the article degree of correlation from high to low sequence, to the sentence in the first sentence set It is ranked up, J sentence constitutes the first abstract collection before choosing wherein, and J may be set according to actual conditions, be not listed.
It should also be noted that, above content is illustrated to how to obtain abstract collection.Due to obtain abstract collection when, The article degree of correlation of each abstract sentence is obtained, therefore, in some possible implementations of the embodiment of the present application, in order to improve The abstract sentence that can not accurately indicate expectation text centric thought is rejected in the accuracy of the abstract of generation, can also be according to abstract sentence The article degree of correlation to abstract collection screen.Then, above-mentioned steps S103 can specifically include:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, and it is related to obtain the corresponding article of the first abstract collection Degree;Based on obtained each abstract diversity factor and each article degree of correlation, selection is concentrated to meet the first screening item from multiple abstracts The abstract collection of part.
As an example, abstract collects the corresponding article degree of correlation and (6) can obtain according to the following formula:
In formula, R (DSi) it is that the corresponding abstract of i-th of time slice collects DSiThe article degree of correlation, rel (s1,y) it is abstract collection DSiIn y-th abstract sentence the article degree of correlation, V be abstract collection DSiThe quantity of middle abstract sentence.
In the specific implementation, whether the first sieve can be met to each abstract collection with comprehensive summary diversity factor and the article degree of correlation Condition is selected to be judged.For example, judging whether corresponding abstract the sum of the diversity factor and the article degree of correlation of abstract collection is greater than centainly Threshold value etc..Further, it is also possible to which phase is set separately to the corresponding abstract diversity factor of abstract collection and the article degree of correlation according to the actual situation The weight answered.
Above content carries out the abstract diversity factor and the article degree of correlation that how to obtain abstract collection and abstract collection Detailed description can remove multiple concentrations of making a summary and collect the abstract collection of repetition or redundancy with other abstracts and can not accurately express the phase Hope the abstract collection of text centric thought.But in practical applications, to some abstract collection for, including abstract sentence There may be duplicate contents or redundant contents.For this purpose, in some possible implementations of the embodiment of the present application, it can also be to list One abstract concentrates whether the abstract sentence for including repeats or redundancy is judged, further decreases the repeated and superfluous of the abstract of generation Yu Xing.
Specifically, the figure is the process signal of another abstraction generating method provided by the embodiments of the present application referring to Fig. 5 Figure.
On the basis of the abstraction generating method that above content provides, by taking method shown in FIG. 1 as an example, some possible In implementation, after obtaining multiple abstract collection (i.e. step S101), can also include:
S501: based on the reproduction shape of character or character string in the first abstract subset in each abstract sentence in target abstract sentence State obtains the novel degree of target abstract sentence.
It is understood that target abstract sentence is any one abstract sentence that the first abstract is concentrated, the first abstract subset packet Include remaining the abstract sentence of the first abstract concentration in addition to target makes a summary sentence.
In the embodiment of the present application, reproduction state refers to whether the character string in target abstract sentence appears in the first abstract Situation in subset in other sentences.In practical applications, the character or character string that can first count target sentence are made a summary with first The intersection for character or character string included by sentence of making a summary in subset, according in the corresponding multiple characters of target abstract sentence or character string It is not belonging to the character of the intersection or the quantity of character string, obtains the novel degree of target abstract sentence.
Novel degree reflect target abstract sentence pair first abstract concentrate other abstract sentences for introduce new content number, mesh The quantity of the character reappeared or character string is not more in the first abstract subset for character or character string in mark abstract sentence, target abstract The novel degree of sentence is bigger.When in target abstract sentence character or character string appear in the first abstract subset in other sentences (i.e. Target make a summary sentence and first abstract concentrate other sentences between there are identical character strings) when, two sentences include identical letter Breath, it is understood that there may be duplicate contents or redundant content, target is made a summary, and the new content that sentence introduces is few, and novelty degree is lower;Conversely, working as mesh Character or character string in mark abstract sentence do not appear in the first abstract subset in other sentences that (i.e. target abstract sentence and first is plucked Concentrate and identical character string be not present between other sentences) when, two sentences do not include identical information, and target abstract sentence draws The new content entered is more, and novelty degree is higher.It therefore, can be according to character in target abstract sentence or character string in the first abstract subset In it is each abstract sentence in reproduction state, obtain target abstract sentence novel degree.
It is arbitrarily extracted from target abstract sentence it should be noted that character or character string described here can be, It can be being extracted according to certain regular (such as N-gram), repeat no more.
As an example, it is not plucked in the first abstract subset in the corresponding multiple character strings of sentence that can be made a summary according to target The sum of the ratio of quantity of the quantity for the character string for wanting sentence to reappear multiple character strings corresponding with target abstract sentence, obtains target and plucks Want the novel degree of sentence.
Then, the novel degree of target abstract sentence can be obtained according to such as following formula (7):
In formula, si,aA-th of abstract sentence, s are concentrated for the corresponding abstract of i-th of time slicei,bFor i-th of time slice pair B-th of abstract sentence, nov (s in the abstract subset answeredi,a,si,b) it is abstract sentence si,aTo abstract sentence si,bNovel degree, i-th of H Time slice it is corresponding abstract subset in make a summary sentence quantity, n-gram be character or character string, | | | | expression seek quantity.
In some possible designs, it can be screened according to the abstract sentence that the novel degree of abstract sentence concentrates abstract, The abstract sentence that novelty degree is lower than given threshold is rejected, duplicate contents or redundant content that abstract is concentrated are excluded.Then, in step S501 It later can also include: the novel degree for concentrating each abstract sentence based on the first abstract, concentrate to reject from the first abstract and do not meet the The abstract sentence of three screening conditions.In practical application, third screening conditions can be carried out with the setting of adaptability, it is no longer superfluous here It states.
S502: comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection.
It is understood that the abstract sentence that the first abstract collection includes to the novel degree of the second abstract collection and the first abstract collection Novel degree is related, and therefore, comprehensive first abstract concentrates the novel degree of each abstract sentence, and the novelty of the first abstract collection can be obtained Degree.
As an example, the novel degree of the first abstract collection (8) can obtain according to the following formula:
In formula, DSiFor the corresponding abstract collection of i-th of time slice, si,cIt is concentrated for the corresponding abstract of i-th of time slice C-th of abstract sentence, N DSiIn include abstract sentence sum.
Then, in order to reduce generation abstract repeatability and redundancy, step S103 can specifically include:
Based on the novel degree of obtained each abstract diversity factor and each abstract collection, selection is concentrated to meet from multiple abstracts The abstract collection of first screening conditions.
It is understood that can be removed based on abstract diversity factor, multiple abstracts are concentrated and other abstract collection repeat or redundancy Abstract collection, based on novel degree can remove content itself repeat or redundancy abstract collection, further decrease the abstract of generation Repeatability and redundancy.
In practical applications, the first screening conditions can be set according to specific needs, for example, abstract diversity factor and The sum of novel degree need to be greater than certain threshold value etc., here without limiting.Further, it is also possible to distinguish abstract diversity factor and novelty degree Corresponding weight is arranged to be judged.
In some possible implementations, in order to reduce generation abstract repeatability and redundancy, step S103 tool Body can also include:
Novel degree based on obtained each abstract diversity factor and the article degree of correlation and each abstract collection, from multiple abstracts Concentrate the abstract collection chosen and meet the first screening conditions.
It is understood that can be removed based on abstract diversity factor, multiple abstracts are concentrated and other abstract collection repeat or redundancy Abstract collection, the abstract collection that can not accurately express desired text centric thought can be removed based on the article degree of correlation, based on novelty Degree can remove the abstract collection of the repetition of content itself or redundancy, and description that can be accurate, terse according to the abstract collection selected it is expected The central idea of text.
The abstraction generating method provided based on the above embodiment, the embodiment of the present application also provides a kind of summarization generation dresses It sets.
Referring to Fig. 6, which is a kind of structural schematic diagram of summarization generation device provided by the embodiments of the present application.
Summarization generation device provided by the embodiments of the present application, comprising: first obtains unit 100, the second obtaining unit 200, Screening unit 300 and assembled unit 400;
First obtains unit 100, for obtaining multiple abstract collection;Each abstract collection includes expectation text in corresponding time slice This abstract sentence, it is different that any two abstract collects corresponding time slice;
Second obtaining unit 200 obtains the first abstract collection and makes a summary to second for concentrating the abstract sentence for including according to abstract The diversity factor of collection obtains the corresponding abstract diversity factor of the first abstract collection;First abstract collection and the second abstract collection are that multiple abstracts collect In any two;
Screening unit 300 meets the first sieve for concentrating to choose from multiple abstracts based on obtained each abstract diversity factor Select the abstract collection of condition;
Assembled unit 400 concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
In some possible implementations of the embodiment of the present application, the second obtaining unit 200 be can specifically include: difference Degree obtains subelement and comprehensive subelement;
Diversity factor obtain subelement, for based on first abstract concentrate it is each abstract sentence in character or character string second The reproduction state for concentration of making a summary obtains the diversity factor that the first abstract concentrates each abstract sentence pair second abstract collection;
Comprehensive subelement concentrates each abstract sentence pair second to make a summary the diversity factor of collection for integrating the first abstract, obtains the One abstract collects corresponding abstract diversity factor.
In some possible implementations of the embodiment of the present application, diversity factor obtains subelement, can specifically include: extracting Subelement, statistics subelement and acquisition subelement;
Subelement is extracted, for extracting multiple character strings from target abstract sentence according to preset rules;Target abstract sentence be Any one abstract sentence that first abstract is concentrated;
Subelement is counted, the quantity for the character string reappeared is not concentrated in the second abstract for counting in multiple character strings, obtains To statistical value;
Subelement is obtained, for the quantity according to statistical value and multiple character strings, target abstract sentence pair second is obtained and makes a summary The diversity factor of collection.
In some possible implementations of the embodiment of the present application, first obtains unit 100 be can specifically include: subordinate sentence Obtain subelement, theme obtains subelement, the degree of correlation obtains subelement and abstract sentence chooses subelement;
Subordinate sentence obtains subelement, it is expected the subordinate sentence of text as a result, obtaining the first sentence in first time segment for obtaining Set;First time segment is that the first abstract collects corresponding time slice;
Theme obtains subelement, for utilizing topic model, extracts descriptor from the first sentence set, it is main to obtain first Epigraph set;
The degree of correlation obtain subelement, for it is expected in first time segment according to sentence co-occurrence word the word frequency in text with And theme co-occurrence word it is expected the word frequency in text in first time segment, and it is corresponding to obtain each sentence in the first sentence set The article degree of correlation;Sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence, Theme co-occurrence word is while appearing in the first sentence set and character or character string in descriptor, and the article degree of correlation represents instead A possibility that reflecting expectation text centric thought in first time segment;
Sentence of making a summary chooses subelement, for choosing from the first sentence set and meeting the second screening according to the article degree of correlation The sentence of condition obtains the first abstract collection.
Optionally, the degree of correlation obtains subelement, can be specifically used for:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
Text it is expected in first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset In word frequency, each descriptor in the article degree of correlation, target sentences and target topic set of words of each sentence in sentence subset Theme co-occurrence word the text of each descriptor in word frequency and the first theme set of words in text it is expected in first time segment The chapter degree of correlation obtains the article degree of correlation of target sentences;
Wherein, target sentences are any one abstract sentence in the first sentence set, and sentence subset includes the first sentence collection Remaining abstract sentence in conjunction in addition to target sentences, the article degree of correlation of descriptor is according to descriptor in first time segment It is expected that including that the article degree of correlation of the sentence of descriptor obtains in word frequency and the first sentence set in text.
In some possible implementations of the embodiment of the present application, screening unit 300 can be specifically used for:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, and it is related to obtain the corresponding article of the first abstract collection Degree;Based on obtained each abstract diversity factor and each article degree of correlation, selection is concentrated to meet the first screening item from multiple abstracts The abstract collection of part.
In some possible implementations of the embodiment of the present application, which can also include: third obtaining unit and comprehensive Close unit;
Third obtaining unit, for being based on character or character string each abstract in the first abstract subset in target abstract sentence Reproduction state in sentence obtains the novel degree of target abstract sentence;Target abstract sentence is any one abstract that the first abstract is concentrated Sentence, the first abstract subset include remaining the abstract sentence of the first abstract concentration in addition to target makes a summary sentence;
Comprehensive unit concentrates the novel degree of each abstract sentence for integrating the first abstract, obtains the novelty of the first abstract collection Degree;
Then, screening unit 300 can be specifically used for:
Based on the novel degree of obtained each abstract diversity factor and each abstract collection, selection is concentrated to meet from multiple abstracts The abstract collection of first screening conditions.
In some possible implementations of the embodiment of the present application, screening unit 300 be can be also used for based on the first abstract The novel degree for concentrating each abstract sentence, concentrates from the first abstract and rejects the abstract sentence for not meeting third screening conditions.
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection, It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection. Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate, Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
The abstraction generating method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of computers Readable storage medium storing program for executing is stored thereon with computer program, when the computer program is executed by processor, realizes such as above-mentioned implementation Example provide abstraction generating method in any one.
The abstraction generating method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of lifes of abstract Forming apparatus, comprising: processor and memory;The memory is transferred to for storing program code, and by said program code The processor;The processor, for executing such as abstract provided by the above embodiment according to the instruction in said program code Any one in generation method.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality For applying system or device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, it is related Place is referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The above is only the preferred embodiment of the application, not makes any form of restriction to the application.Though Right the application has been disclosed in a preferred embodiment above, however is not limited to the application.It is any to be familiar with those skilled in the art Member, in the case where not departing from technical scheme ambit, all using the methods and technical content of the disclosure above to the application Technical solution makes many possible changes and modifications or equivalent example modified to equivalent change.Therefore, it is all without departing from The content of technical scheme, any simple modification made to the above embodiment of the technical spirit of foundation the application are equal Variation and modification, still fall within technical scheme protection in the range of.

Claims (10)

1. a kind of abstraction generating method, which is characterized in that the described method includes:
Obtain multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, any two It is different that the abstract collects corresponding time slice;
The abstract sentence for including is concentrated according to the abstract, the first abstract collection is obtained to the diversity factor of the second abstract collection, obtains described First abstract collects corresponding abstract diversity factor;The first abstract collection and the second abstract collection are that the multiple abstract is concentrated Any two;
Based on the abstract diversity factor each of is obtained, is concentrated from the multiple abstract and choose the abstract for meeting the first screening conditions Collection;
Combining the abstract selected concentrates the abstract sentence for including to generate abstract.
2. the method according to claim 1, wherein described concentrate the abstract sentence for including according to the abstract, really Fixed first abstract collection obtains first abstract and collects corresponding abstract diversity factor, specifically include to the diversity factor of the second abstract collection:
Based on it is described first abstract concentrate it is each abstract sentence in character or character string it is described second abstract concentrate reproduction shape State obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair;
Comprehensive first abstract concentrates the diversity factor of the second abstract collection described in each abstract sentence pair, obtains the first abstract collection Corresponding abstract diversity factor.
3. according to the method described in claim 2, it is characterized in that, described concentrated in each abstract sentence based on first abstract Character or character string it is described second abstract concentrate reproduction state, obtain it is described first abstract concentrate each abstract sentence pair institute The diversity factor for stating the second abstract collection, specifically includes:
Multiple character strings are extracted from target abstract sentence according to preset rules;The target abstract sentence is that first abstract is concentrated Any one abstract sentence;
It counts in the multiple character string and does not concentrate the quantity for the character string reappeared in second abstract, obtain statistical value;
According to the quantity of the statistical value and the multiple character string, the second abstract collection described in the target abstract sentence pair is obtained Diversity factor.
4. method according to claim 1 to 3, which is characterized in that described to obtain multiple abstract collection, specific packet It includes:
The subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set;The first time segment is First abstract collects corresponding time slice;
Using topic model, descriptor is extracted from the first sentence set, obtains the first theme set of words;
It is expected word frequency in text and theme co-occurrence word described in the first time segment according to sentence co-occurrence word Word frequency in one time slice in expectation text obtains the corresponding article degree of correlation of each sentence in the first sentence set; The sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence, described Theme co-occurrence word is while appearing in the first sentence set and character or character string in the descriptor, the article phase Guan Du represents a possibility that expectation text centric thought in the reflection first time segment;
According to the article degree of correlation, the sentence that selection meets the second screening conditions from the first sentence set obtains described First abstract collection.
5. according to the method described in claim 4, it is characterized in that, according to sentence co-occurrence word in the first time segment phase The word frequency and the theme co-occurrence word hoped in text it is expected the word frequency in text in the first time segment, described in acquisition The corresponding article degree of correlation of each sentence in first sentence set, specifically includes:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
Text it is expected in the first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset In word frequency, in the sentence subset each sentence the article degree of correlation, the target sentences and the target topic set of words In the theme co-occurrence word of each descriptor word frequency and first descriptor in text it is expected in the first time segment The article degree of correlation of each descriptor in set, obtains the article degree of correlation of the target sentences;
Wherein, the target sentences are any one abstract sentence in the first sentence set, and the sentence subset includes institute Remaining abstract sentence in the first sentence set in addition to the target sentences is stated, the article degree of correlation of the descriptor is according to It include the master in the word frequency and the first sentence set of descriptor it is expected in text in the first time segment The article degree of correlation of the sentence of epigraph obtains.
6. according to the method described in claim 4, it is characterized in that, described based on each of obtaining the abstract diversity factor, from The multiple abstract concentrates the abstract collection chosen and meet the first screening conditions, specifically includes:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, obtains first abstract and collects corresponding article phase Guan Du;
Based on the abstract diversity factor and each article degree of correlation each of is obtained, is concentrated from the multiple abstract and choose symbol Close the abstract collection of the first screening conditions.
7. method according to claim 1 to 3, which is characterized in that it is described to obtain multiple abstract collection, later also Include:
Based on the reproduction state of character or character string in the first abstract subset in each abstract sentence in target abstract sentence, institute is obtained State the novel degree of target abstract sentence;Target abstract sentence is any one abstract sentence that first abstract is concentrated, described the One abstract subset includes remaining the abstract sentence of the first abstract concentration in addition to the target makes a summary sentence;
Comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection;
Then, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the first screening item The abstract collection of part, specifically includes:
Based on the novel degree for each of obtaining the abstract diversity factor and each abstract collection, from the multiple abstract concentration Choose the abstract collection for meeting the first screening conditions.
8. a kind of summarization generation device, which is characterized in that described device includes: first obtains unit, the second obtaining unit, screening Unit and assembled unit;
The first obtains unit, for obtaining multiple abstract collection;Each abstract collection includes expectation in corresponding time slice It is different to collect corresponding time slice for the abstract sentence of text, abstract described in any two;
Second obtaining unit obtains the first abstract collection and plucks to second for concentrating the abstract sentence for including according to the abstract The diversity factor to be collected obtains first abstract and collects corresponding abstract diversity factor;The first abstract collection and second abstract Collection is any two that the multiple abstract is concentrated;
The screening unit meets for concentrating to choose from the multiple abstract based on the abstract diversity factor each of is obtained The abstract collection of first screening conditions;
The assembled unit concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
9. a kind of computer readable storage medium, which is characterized in that computer program is stored thereon with, when the computer program quilt When processor executes, such as the described in any item abstraction generating methods of claim 1-7 are realized.
10. a kind of summarization generation equipment characterized by comprising processor and memory;
The memory is transferred to the processor for storing program code, and by said program code;
The processor, for executing such as the described in any item abstracts of claim 1-7 according to the instruction in said program code Generation method.
CN201811626213.XA 2018-12-28 2018-12-28 Abstract generation method and device Active CN109815328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811626213.XA CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811626213.XA CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Publications (2)

Publication Number Publication Date
CN109815328A true CN109815328A (en) 2019-05-28
CN109815328B CN109815328B (en) 2021-05-25

Family

ID=66602705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811626213.XA Active CN109815328B (en) 2018-12-28 2018-12-28 Abstract generation method and device

Country Status (1)

Country Link
CN (1) CN109815328B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN112328783A (en) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 Abstract determining method and related device
CN114722194A (en) * 2022-03-15 2022-07-08 电子科技大学 Automatic construction method of emergency time sequence based on abstract generation algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN106874469A (en) * 2017-02-16 2017-06-20 北京大学 A kind of news roundup generation method and system
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446940A (en) * 2007-11-27 2009-06-03 北京大学 Method and device of automatically generating a summary for document set
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN102043851A (en) * 2010-12-22 2011-05-04 四川大学 Multiple-document automatic abstracting method based on frequent itemset
CN106874469A (en) * 2017-02-16 2017-06-20 北京大学 A kind of news roundup generation method and system
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李辉: "《基于时间线的事件组织与摘要技术的研究与应用》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
盛雅东: "《基于Google_Map的地理位置查询系统》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813925A (en) * 2020-07-14 2020-10-23 混沌时代(北京)教育科技有限公司 Semantic-based unsupervised automatic summarization method and system
CN112328783A (en) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 Abstract determining method and related device
CN114722194A (en) * 2022-03-15 2022-07-08 电子科技大学 Automatic construction method of emergency time sequence based on abstract generation algorithm
CN114722194B (en) * 2022-03-15 2023-05-09 电子科技大学 Automatic construction method for emergency time sequence based on abstract generation algorithm

Also Published As

Publication number Publication date
CN109815328B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
CN1871597B (en) System and method for associating documents with contextual advertisements
CN103870001B (en) A kind of method and electronic device for generating candidates of input method
US20130041652A1 (en) Cross-language text clustering
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN110232112A (en) Keyword extracting method and device in article
CN109815328A (en) A kind of abstraction generating method and device
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
Awajan Keyword extraction from Arabic documents using term equivalence classes
Kutter Corpus analysis
Tandel et al. Multi-document text summarization-a survey
CN109284389A (en) A kind of information processing method of text data, device
Mustafa et al. Optimizing document classification: Unleashing the power of genetic algorithms
JP4931114B2 (en) Data display device, data display method, and data display program
CN117291192A (en) Government affair text semantic understanding analysis method and system
Saeed et al. An abstractive summarization technique with variable length keywords as per document diversity
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Eggi Afaan oromo text retrieval system
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
Nie et al. Social Emotion Analysis System for Online News
Reades et al. Clustering and Visualising Documents Using Word Embeddings
Wolfe ChronoNLP: Exploration and Analysis of Chronological Textual Corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant