CN109815328A - A kind of abstraction generating method and device - Google Patents
A kind of abstraction generating method and device Download PDFInfo
- Publication number
- CN109815328A CN109815328A CN201811626213.XA CN201811626213A CN109815328A CN 109815328 A CN109815328 A CN 109815328A CN 201811626213 A CN201811626213 A CN 201811626213A CN 109815328 A CN109815328 A CN 109815328A
- Authority
- CN
- China
- Prior art keywords
- abstract
- sentence
- collection
- diversity factor
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of abstraction generating method and devices, this method comprises: obtaining multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, and it is different that any two abstract collects corresponding time slice;The abstract sentence for including is concentrated according to abstract, is obtained the first abstract collection to the diversity factor of the second abstract collection, is obtained the corresponding abstract diversity factor of the first abstract collection;It based on obtained each abstract diversity factor, is concentrated from multiple abstracts and chooses the abstract collection for meeting the first screening conditions, rejected and repeat more or redundancy abstract collection with other abstract collection;Combining the abstract selected concentrates the abstract sentence for including to generate abstract, can reduce the clip Text of repeatability and redundancy on the basis of guaranteeing to make a summary coverage rate, is conducive to reader and accurately grasps the central idea of expectation text and the development of event.
Description
Technical field
This application involves natural language processing field more particularly to a kind of abstraction generating methods and device.
Background technique
Autoabstract technology automatically extracts keyword therein (or critical sentence), so that is, to a given original long text
Afterwards by certain rule or means, the keyword extracted (or critical sentence) is organized into segment text, for summarizing
The central idea of original long text.However in today of information explosion, people touch the text information of magnanimity daily, such as
The largely newsletter archive from different media, different channels can be all generated daily.In this context, traditional to be directed to single
The summarization generation of long text often to people promptly and accurately grasp key message help very little, be difficult to have place to show one's prowess.This
When, it is desirable to provide one kind carries out summarization generation technology for more long texts.
More text snippet (multi-document summarization, MDS) technologies are come into being, with more long texts
As input, the summary texts of specific length are automatically generated as required.More text snippets can be divided into no time sequencing
The dynamic text autoabstract of static text autoabstract and having time sequence.
By taking news in brief as an example, because the generation and follow-up developments of each theme news are not in single time slice
It is interior, can all may have new development trend daily at the early stage of production, so people often really need be to same subject or
Event is formed by the abstract of having time sequence, i.e. dynamic text autoabstract according to the process of development.But it is traditional dynamic
The more text summarizations of state can carry out static text abstract to the data in each time slice first, then obtained static state is plucked
Simply to be spliced sequentially in time, the clip Text of many repeatability and redundancy can be generated.
Summary of the invention
In view of this, the embodiment of the present application provides a kind of abstraction generating method and device, it is able to solve in the prior art
There is the clip Text of repeatability and redundancy in the abstract extracted.
The abstraction generating method that the embodiment of the present application first aspect provides, comprising:
Obtain multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, arbitrarily
It is different that two abstracts collect corresponding time slice;
The abstract sentence for including is concentrated according to the abstract, the first abstract collection is obtained to the diversity factor of the second abstract collection, obtains
First abstract collects corresponding abstract diversity factor;The first abstract collection and the second abstract collection are that the multiple abstract collects
In any two;
Based on the abstract diversity factor each of is obtained, concentrates to choose from the multiple abstract and meet the first screening conditions
Abstract collection;
Combining the abstract selected concentrates the abstract sentence for including to generate abstract.
Optionally, described that the abstract sentence for including is concentrated according to the abstract, determine the first abstract collection to the second abstract collection
Diversity factor obtains first abstract and collects corresponding abstract diversity factor, specifically includes:
Based on it is described first abstract concentrate it is each abstract sentence in character or character string it is described second abstract concentrate weight
Present condition obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair;
Comprehensive first abstract concentrates the diversity factor of the second abstract collection described in each abstract sentence pair, obtains described first and plucks
Collect corresponding abstract diversity factor.
Optionally, the character or character string based in each abstract sentence of the first abstract concentration is plucked described second
The reproduction state to be concentrated obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair, specifically
Include:
Multiple character strings are extracted from target abstract sentence according to preset rules;The target abstract sentence is first abstract
Any one the abstract sentence concentrated;
It counts in the multiple character string and does not concentrate the quantity for the character string reappeared in second abstract, counted
Value;
According to the quantity of the statistical value and the multiple character string, the second abstract described in the target abstract sentence pair is obtained
The diversity factor of collection.
Optionally, described to obtain multiple abstract collection, it specifically includes:
The subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set;The first time piece
Section is that first abstract collects corresponding time slice;
Using topic model, descriptor is extracted from the first sentence set, obtains the first theme set of words;
It is expected word frequency in text and theme co-occurrence word in institute in the first time segment according to sentence co-occurrence word
The word frequency in first time segment in expectation text is stated, it is related to obtain the corresponding article of each sentence in the first sentence set
Degree;The sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence,
The theme co-occurrence word is while appearing in the first sentence set and character or character string in the descriptor, the text
The chapter degree of correlation represents a possibility that expectation text centric thought in the reflection first time segment;
According to the article degree of correlation, the sentence that selection meets the second screening conditions from the first sentence set is obtained
The first abstract collection.
Optionally, the word frequency and the theme in text it is expected in the first time segment according to sentence co-occurrence word
Co-occurrence word it is expected the word frequency in text in the first time segment, and it is corresponding to obtain each sentence in the first sentence set
The article degree of correlation, specifically include:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
It is expected in the first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset
The article degree of correlation, the target sentences and the target topic word of each sentence in word frequency, the sentence subset in text
The theme co-occurrence word of each descriptor it is expected word frequency and first master in text in the first time segment in set
The article degree of correlation of each descriptor, obtains the article degree of correlation of the target sentences in epigraph set;
Wherein, the target sentences are any one abstract sentence in the first sentence set, the sentence subset packet
Include in the first sentence set remaining abstract sentence in addition to the target sentences, the article degree of correlation of the descriptor according to
It include institute in the word frequency and the first sentence set of the descriptor it is expected in text in the first time segment
The article degree of correlation for stating the sentence of descriptor obtains.
Optionally, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the
The abstract collection of one screening conditions, specifically includes:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, obtains first abstract and collects corresponding text
The chapter degree of correlation;
Based on the abstract diversity factor and each article degree of correlation each of is obtained, concentrates and select from the multiple abstract
Take the abstract collection for meeting the first screening conditions.
Optionally, described to obtain multiple abstract collection, later further include:
Based on the reproduction state of character or character string in the first abstract subset in each abstract sentence in target abstract sentence, obtain
Obtain the novel degree of the target abstract sentence;The target abstract sentence is any one abstract sentence that first abstract is concentrated, institute
Stating the first abstract subset includes remaining the abstract sentence of the first abstract concentration in addition to the target makes a summary sentence;
Comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection;
Then, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the first sieve
The abstract collection for selecting condition, specifically includes:
Based on the novel degree for each of obtaining the abstract diversity factor and each abstract collection, from the multiple abstract
Concentrate the abstract collection chosen and meet the first screening conditions.
Optionally, the novel degree that first abstract concentrates abstract sentence is obtained, later further include:
The novel degree that each abstract sentence is concentrated based on first abstract, is concentrated to reject and is not met the from first abstract
The abstract sentence of three screening conditions.
The summarization generation device that the embodiment of the present application second aspect provides, comprising: first obtains unit, second obtain list
Member, screening unit and assembled unit;
The first obtains unit, for obtaining multiple abstract collection;Each abstract collection includes in corresponding time slice
It is expected that the abstract sentence of text, it is different to collect corresponding time slice for abstract described in any two;
Second obtaining unit obtains the first abstract collection to for concentrating the abstract sentence for including according to the abstract
The diversity factor of two abstract collection obtains first abstract and collects corresponding abstract diversity factor;The first abstract collection and described second
Abstract collection is any two that the multiple abstract is concentrated;
The screening unit, for concentrating and choosing from the multiple abstract based on the abstract diversity factor each of is obtained
Meet the abstract collection of the first screening conditions;
The assembled unit concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
The embodiment of the present application third aspect additionally provides a kind of computer readable storage medium, is stored thereon with computer journey
Sequence is realized any in the abstraction generating method provided such as above-mentioned first aspect when the computer program is executed by processor
It is a kind of.
The embodiment of the present application fourth aspect additionally provides a kind of summarization generation equipment, comprising: processor and memory;
The memory is transferred to the processor for storing program code, and by said program code;
The processor, for executing the abstract provided such as above-mentioned first aspect according to the instruction in said program code
Any one in generation method.
Compared with prior art, the application has at least the following advantages:
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple
Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection,
It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection.
Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions
Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment
Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated
Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate,
Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The some embodiments recorded in application, for those of ordinary skill in the art, without creative efforts,
It can also be obtained according to these attached drawings other attached drawings.
Fig. 1 is a kind of flow diagram of abstraction generating method provided by the embodiments of the present application;
Fig. 2 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 3 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 4 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 5 is the flow diagram of another abstraction generating method provided by the embodiments of the present application;
Fig. 6 is a kind of structural schematic diagram of summarization generation device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this
Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two
More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner
It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word
Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to
Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c
(a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also
To be multiple.
In today of information explosion, people touch the text information of magnanimity daily, for example can all generate daily a large amount of
Newsletter archive from different media, different channels.In order to quickly and effectively obtain the central idea of article and the development of theme
Train of thought, more text snippets come into being.Then, existing more text summarization techniques can indicate event development grain in acquisition
When abstract, only by the abstract sentence that Text Feature Extraction goes out out of different time segment, simply spliced sequentially in time.
But the ordinary circumstance developed according to event, previous time slice is interior it is expected content described in text, it is possible to work as
It is repeated in expectation text in preceding time slice, such as meeting carries out letter to the media event of the previous day identical theme in the news on the same day
Cause slightly is looked back, and existing more text summarization techniques do not consider that abstract theme is gradual in different time segment
And variability, lead to the clip Text that there are many repeatability and redundancy in the abstract of different time segment text, Du Zhewu
Method in time, effectively grasps the development grain of abstract theme.
For this purpose, the embodiment of the present application provides a kind of abstraction generating method and device, it is contemplated that abstract theme is when different
Between gradual and variability in segment, judge the repetition situation between the corresponding abstract collection of two different time segments, reject
More or redundancy abstract collection is repeated with other collection of making a summary, the abstract sentence combination producing that remaining abstract is concentrated is made a summary, it can
On the basis of guaranteeing the coverage rate to abstract theme, the clip Text of repeatability and redundancy is reduced, it is effective to be conducive to reader
Acquisition theme development grain.
Based on above-mentioned thought, in order to make the above objects, features, and advantages of the present application more apparent, below with reference to
Attached drawing is described in detail the specific embodiment of the application.
Referring to Figure 1, which is a kind of flow diagram of abstraction generating method provided by the embodiments of the present application.
Abstraction generating method provided by the embodiments of the present application, comprising:
S101: multiple abstract collection are obtained.
Wherein, each abstract collection includes the abstract sentence of expectation text in corresponding time slice, and any two abstract collection corresponds to
Time slice it is different.
It should be noted that time slice can carry out rough division to the developing stage of abstract theme, it is embodied
When can set according to specific needs, such as time slice can be a few houres, one day, one week or one month etc., and the application is real
Example is applied to this without limiting.In one example, time slice can be set according to the development speed of abstract theme
It is fixed.For example, time slice can be set when fast speed (such as the process of meeting etc.) for development of a theme variation of making a summary
It is shorter, such as a few houres or one day;It, can when speed relatively slow (such as the development of train technology etc.) for development of a theme variation of making a summary
It is longer to set time slice, such as 1 year.
It in the embodiment of the present application, can be from database (such as internet it is expected that text is text relevant to abstract theme
Database) or knowledge base in retrieve and obtain.It should be noted that the embodiment of the present application does not do language used by desired text
It limits, can be Chinese, English, Japanese etc..Each expectation text includes corresponding time attribute, such as writes time, the when of delivering
Between and record content time of origin etc., according to the time attribute i.e. can determine that the expectation text in which time slice.
In the specific implementation, can be drawn first according to the query word (i.e. abstract theme) carried in the inquiry request received by search
Hold up component searched for from database or knowledge base obtain it is related to the query word (as include the query word) text, obtain at least
One expectation text.
It is understood that it includes the abstract sentence for corresponding to expectation text in time slice that each abstract, which is concentrated, to pluck
The abstract sentence to be concentrated combines to obtain the abstract for indicating abstract theme central idea or development grain.It is described more detail below specific
The abstract sentence of expectation text how is obtained, is not repeated first.
S102: concentrating the abstract sentence for including according to abstract, determines that the first abstract collection to the diversity factor of the second abstract collection, obtains
First abstract collects corresponding abstract diversity factor.
It is understood that the first abstract collection and the second abstract collection are any two that multiple abstracts are concentrated.Specific real
It is poor can to obtain its corresponding abstract in order to reduce duplicate contents or redundant content in abstract to each abstract collection by Shi Shi
Different degree.
In the embodiment of the present application, abstract diversity factor reflects difference condition of two abstract collection on expression content.By
It can be expressed by the abstract sentence that it includes in the expressed content of each abstract collection, therefore concentrate the abstract for including according to abstract
Sentence can determine that the first abstract collection to the diversity factor of the second abstract collection.If the expressed content and second of the first abstract collection is plucked
The content deltas for collecting expressed is larger, i.e., duplicate contents or redundancy in the first abstract collection and the second included abstract sentence of abstract collection
Content is less, or does not have duplicate contents or redundant content, then the abstract diversity factor of the first abstract collection is big;If the first abstract collection institute
Content deltas expressed by the content of expression and the second abstract collection is smaller, i.e., the first abstract collection and the second abstract collect included abstract
Duplicate contents or redundant content are more in sentence, then the abstract diversity factor of the first abstract collection is small.
It should be noted that as it is expected content described in text in previous time slice, it is possible to current
It is expected to repeat in text in time slice, such as meeting carries out simply the media event of the previous day identical theme in the news on the same day
Cause look back, cause the corresponding abstract of previous time slice to collect abstract corresponding with current time segment and concentrate and duplicate
Inside perhaps redundant content.Therefore, in order to reduce generate abstract repeatability and/or redundancy, the embodiment of the present application is some can
Can implementation in, the initial time that the first abstract collects corresponding time slice, which can be, to be later than the second abstract collection and corresponds to timeslice
The initial time of section.As an example, the first abstract collection can correspond to current time segment (such as the same day), the second abstract collection pair
Answer previous time slice (such as the previous day).
S103: based on obtained each abstract diversity factor, selection is concentrated to meet plucking for the first screening conditions from multiple abstracts
Collect.
In the embodiment of the present application, abstract diversity factor reflects difference condition of two abstract collection on expression content, benefit
With the abstract diversity factor of each abstract collection, that is, it can determine that content expressed by the abstract collection and other abstracts collect expressed content
Difference condition, can be concentrated from multiple abstracts select that diversity factor is larger (meets the first screening item with abstract on this basis
Part) abstract collection, reject it is multiple abstract concentrate duplicate contents or redundant content, combination obtain repeatability it is low with redundancy
Abstract, that simplifies summarizes the central idea and development grain of abstract theme, understands convenient for reader.
In practical applications, the first screening conditions can be set according to specific needs, for example abstract diversity factor needs
Greater than certain threshold value etc., here without limiting.
S104: it combines the abstract selected and the abstract sentence for including is concentrated to generate abstract.
In the embodiment of the present application, the abstract collection selected and other abstract collection (such as its previous time slice is corresponding plucks
Collect) content deltas it is larger, duplicate contents or redundant content are less, combine select abstract concentrate include abstract sentence given birth to
At abstract in duplicate contents or redundant content it is also corresponding less, reduce the repeatability and redundancy of abstract, be conducive to reader
According to the accurate central idea for grasping expectation text of the abstract of generation and the development of event.
In practical applications, the abstract collection selected directly can be subjected to group according to the time sequencing of corresponding time slice
It closes, the abstract sentence for concentration of making a summary can be combined according to its sequence in desired text, and same abstract concentrates identical or phase
As abstract sentence can be spliced, to reduce duplicate contents, the embodiment of the present application does not do specifically the combined method of abstract sentence
It limits, also will not enumerate here.
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple
Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection,
It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection.
Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions
Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment
Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated
Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate,
Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
Above content concentrates screening to meet the first screening conditions to according to the abstract diversity factor of abstract collection from multiple abstracts
Abstract collects combination producing abstract and is illustrated, below by taking the first abstract collection (any one of i.e. multiple abstract concentrations) as an example,
Illustrate specifically how to obtain abstract collection abstract diversity factor, other abstract collection abstract diversity factor acquisition pattern with it is such
Seemingly, it repeats no more.
Referring to fig. 2, which is the flow diagram of another abstraction generating method provided by the embodiments of the present application.
In some possible implementations of the embodiment of the present application, step S102 be can specifically include:
S201: based on first abstract concentrate it is each abstract sentence in character or character string second abstract concentrate reproduction shape
State obtains the diversity factor that the first abstract concentrates each abstract sentence pair second abstract collection.
In the embodiment of the present application, reproduction state refers specifically to occur or do not occur.When what the first abstract collection included plucks
When character or character string (such as a word or multiple continuous words) in sentence being wanted to appear in the abstract sentence of the second abstract collection, reappear
State be;And work as the abstract that character or character string in the abstract sentence that the first abstract collection includes do not appear in the second abstract collection
When in sentence, reproduction state is as no.
It is understood that diversity factor reflects the difference condition of the two in terms of content.When what the first abstract collection included plucks
When character or character string in sentence being wanted to appear in the abstract sentence of the second abstract collection, illustrate that the abstract sentence and the second abstract are concentrated and deposited
In duplicate contents or redundant content, diversity factor is smaller;Conversely, when character or character string in the abstract sentence that the first abstract collection includes
When not appearing in the abstract sentence of the second abstract collection, illustrate that the abstract sentence and the second abstract are concentrated and the duplicate contents or superfluous are not present
Remaining content, diversity factor are larger.Therefore, according to first abstract concentrate abstract sentence in character or character string second abstract concentrate
Reproduction state, available first abstract concentrate the diversity factor of each abstract sentence pair second abstract collection.
In some possible implementations of the embodiment of the present application, as shown in figure 3, step S201 can specifically include:
S2011: multiple character strings are extracted from target abstract sentence according to preset rules.
It is understood that target abstract sentence is any one abstract sentence that the first abstract is concentrated.In practical applications, may be used
Specifically to be set to preset rules, to determine the length and extracting method of the character string extracted.
As an example, in order to guarantee that preset rules can be N- to diversity factor statistical accuracy and comprehensive
Gram rule extracts character string from target abstract sentence using N-gram rule.
N-gram: being the Duan Wen given in natural language processing (Natural Language Processing, NLP)
The sequence of N number of project (item) in this.Project (item) can be letter or word etc..As N=1, unigram can be described as;
As N=2, bigram can be described as;As N=3, trigram can be described as, and so on.It, can be from mesh by taking trigram as an example
The first character of mark abstract sentence starts that the 1-3 word is taken to obtain a character string, takes the 2-4 word to obtain a character string, takes
3-6 word obtains a character string, and so on, until taking n-th -2 to n word to obtain a character string, obtain plucking from target
Want the multiple character strings extracted in sentence.
S2012: the character string reappeared is not concentrated in the second abstract in the corresponding multiple character strings of statistics target abstract sentence
Quantity obtains statistical value.
In practical applications, the character string and the second abstract that can first count target abstract sentence are concentrated included by abstract sentence
The intersection of character string counts the quantity for being not belonging to the character string of the intersection in the corresponding multiple character strings of target abstract sentence, obtains
Statistical value.
It is understood that character string does not concentrate the quantity for the character string reappeared to get in the second abstract in target abstract sentence
More, the diversity factor of target abstract sentence and the second abstract collection is bigger.
S2013: according to the quantity of statistical value and the corresponding multiple character strings of target abstract sentence, target abstract sentence pair the is obtained
The diversity factor of two abstract collection.
In the embodiment of the present application, the diversity factor positive correlation of statistical value and the target abstract abstract collection of sentence pair second.
Statistical value is bigger, then illustrates that the character string in target abstract sentence reappears, target abstract sentence fewer in the quantity of the second abstract concentration
It is bigger with the diversity factor of the second abstract collection;Conversely, statistical value is smaller, then illustrate that the character string in target abstract sentence is reappeared second
The quantity concentrated of making a summary is more, and the diversity factor of target abstract sentence and the second abstract collection is smaller.
As an example, it does not concentrate and reappears in the second abstract in the corresponding multiple character strings of sentence that can be made a summary according to target
The corresponding multiple character strings of quantity and target the abstract sentence of character string quantity ratio, obtain target abstract sentence pair second and pluck
The diversity factor to be collected.
Then, the diversity factor con of target abstract sentence pair second abstract collection (1) can be obtained according to the following formula:
In formula, si,jFor j-th of abstract sentence that the corresponding abstract of i-th of time slice is concentrated, DSi-1For (i-1)-th time
Segment corresponding abstract collection, n-gram are character or character string, | | | | quantity is sought in expression.
S202: comprehensive first abstract concentrates the diversity factor of each abstract sentence pair second abstract collection, obtains the first abstract collection pair
The abstract diversity factor answered.
It is understood that the first abstract collection collects the diversity factor (i.e. abstract diversity factor) of the second abstract collection with the first abstract
Including abstract sentence in character string it is related to the reproduction situation of the second abstract collection, therefore, comprehensive first abstract concentrates each abstract
The diversity factor of the abstract collection of sentence pair second, can be obtained the corresponding abstract diversity factor of the first abstract collection.
As an example, the first abstract collects corresponding abstract diversity factor C and (2) can obtain according to the following formula:
In formula, DSiFor the corresponding abstract collection of i-th of time slice, si,aIt is concentrated for the corresponding abstract of i-th of time slice
A-th of abstract sentence, N DSiIn include abstract sentence sum.
Above content concentrates screening to meet the first screening conditions to according to the abstract diversity factor of abstract collection from multiple abstracts
Abstract, which collects combination producing abstract and how to obtain abstract diversity factor, to be illustrated, (i.e. multiple to pluck with the first abstract collection below
Any one that be concentrated) for, illustrate specifically how to obtain abstract collection, other abstract collection acquisition pattern with it is such
Seemingly, it repeats no more.
Referring to fig. 4, which is the flow diagram of another abstraction generating method provided by the embodiments of the present application.
In some possible implementations of the embodiment of the present application, step S101 be can specifically include:
S401: the subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set.
It is understood that first time segment is the corresponding time slice of the first abstract collection.To first time segment
After interior expectation text carries out subordinate sentence processing, the first sentence set can be obtained.
In practical applications, can the punctuate desirably in text meet and carry out subordinate sentence processing, i.e. the logic of subordinate sentence is
Using punctuation marks such as fullstop, comma, exclamation mark, question mark, branch, colons as separator, every expectation text, which is split into, to be had
The sentence (i.e. subordinate sentence result) for individually indicating meaning, obtains the first sentence set.
It should be noted that the information of expression is also corresponding less when the number of words for including in sentence is very few, it cannot be effective
The central idea to desired text accurately summarized.Therefore, in some possible implementations of the embodiment of the present application,
Number of words is deleted from the first sentence set less than the sentence of number of words threshold value in the sentence that can be split.Number of words threshold value can root
It is set according to actual needs, such as 10, here without limiting.
In some possible implementations, it can also be obtained to it is expected that text carries out segment processing in first time segment
To the first sentence set, concrete principle is similar with subordinate sentence processing, and which is not described herein again.
S402: utilizing topic model, extract descriptor from the first sentence set, obtains first time segment corresponding the
One theme set of words.
Descriptor is that have generalities and standardization to the artificial language for expressing text subject in index and retrieval
Feature.In the embodiment of the present application, theme can be extracted from the first sentence set using any one main body extracting method
Word, such as implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) topic model etc..The master extracted
Epigraph constitutes the corresponding first theme set of words of first time segment.
It should be noted that general topic model needs the number for first setting descriptor to exist by taking LDA topic model as an example
It is clustered, obtains descriptor.But in practical applications, the number of descriptor is indefinite in each time slice, utilizes LDA
Topic model is easy to appear error.Therefore, in some possible implementations of the embodiment of the present application, level Di Li can be used
Cray process (Hierarchical Dirichlet Processes, HDP) generates the corresponding descriptor of each time slice
Set.HDP is a kind of non-moduli type, is a kind of mutation of LDA topic model, and benefit is not have to preset themes number, is guaranteed
The accuracy of key phrases extraction.
In theme set of words corresponding using each time slice of HDP generation, first the first sentence set can be carried out
Word segmentation processing, and the stop words in the word segmentation result filtered out.Stop words refers in information retrieval, empty to save storage
Between and improve search efficiency, before or after handle natural language data (or text) understand automatic fitration fall certain words or word,
It may include auxiliary words of mood, adverbial word, preposition, conjunction etc., can be manually set according to actual needs.Then, it is carried out by HDP
Theme modeling, forms topic model, is calculated by obtained topic model the word segmentation result after filtering out stop words, obtained
To the first theme set of words.
In some possible implementations, in order to improve the accuracy rate and treatment effeciency of abstract, can according to weight from
High to low sequence is chosen from the descriptor extracted and is no more than a certain number of descriptor, obtains the first theme set of words,
Here to the selection quantity of descriptor without limiting.
S403: it is expected word frequency in text and theme co-occurrence word in first time segment according to sentence co-occurrence word
Word frequency in one time slice in expectation text obtains the corresponding article degree of correlation of each sentence in the first sentence set.
In the embodiment of the present application, sentence co-occurrence word is while appearing in the first sentence set in any two sentence
Character or character string, theme co-occurrence word are while appearing in the first sentence set and character or character string in descriptor, article
The degree of correlation represents a possibility that it reflects expectation text centric thought in first time segment.According to the article degree of correlation
Determine the abstract sentence of expectation text in first time segment.
It is understood that character described in step S403 or character string can be and directly extract, be also possible to by
Obtaining according to certain rule (such as N-gram rule), which is not described herein again.When character or character string in first time segment the phase
A possibility that hoping the frequency of occurrence in text more, it is expected text centric thought including the character or the expression of the sentence of character string
It is higher.
It therefore, can be according to target sentences in the first sentence set (i.e. any one sentence in the first sentence set) institute
Including sentence co-occurrence word the word frequency in text it is expected in first time segment, obtain the target sentences and the first sentence set
In degree of correlation between other sentences, reflecting the target sentences can summarize in the first sentence set in the expression of other sentences
The ability of appearance.The theme co-occurrence word according to included by the target sentences it is expected the word in text in first time segment
Frequently, the degree of correlation in the target sentences and the first sentence set between descriptor is obtained, reflecting the target sentences being capable of table
The ability of text centric thought it is expected in up to first time segment.
It should be noted that when sentence co-occurrence word is not present between other sentences in target sentences and sentence subset,
Degree of correlation between target sentences and other sentences can be set as 0;When target sentences do not include in the first descriptor
When descriptor, the degree of correlation between the target sentences and the descriptor can be set as 0.
Then, the comprehensive target sentences summarize the ability and the target of other sentences expression meaning in the first sentence set
Sentence expresses the ability of expectation text centric thought in first time segment, when also can be obtained by target sentences reflection first
Between in segment a possibility that expectation text centric thought, i.e. its article degree of correlation.
It, can be according to sentence co-occurrence word in first time segment in some possible implementations of the embodiment of the present application
It is expected that the word frequency and theme co-occurrence word in text it is expected the word frequency in text in first time segment, changed using optimization algorithm
In generation, obtains the article degree of correlation.For example, to maximize the overall quality (may include coverage rate, repeatability and redundancy etc.) of abstract
For target, optimized using genetic algorithm.As its name suggests, genetic algorithm is the heredity of simulation biology in a natural environment
The adaptive global optimization search with one kind of evolutionary process by natural selection, is lost by the principle by science of heredity
The effects of passing, making a variation mechanism filters out adaptable higher individual in turn.
Then, for the i-th iteration of optimization algorithm, step S403 be can specifically include:
Text it is expected in first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset
In word frequency, each descriptor in the article degree of correlation, target sentences and target topic set of words of each sentence in sentence subset
Theme co-occurrence word the text of each descriptor in word frequency and the first theme set of words in text it is expected in first time segment
The chapter degree of correlation obtains the article degree of correlation of target sentences.
In the embodiment of the present application, target sentences are any one abstract sentence in the first sentence set, sentence subset packet
Include remaining abstract sentence in the first sentence set in addition to target sentences.It is similar with the article degree of correlation of sentence, the text of descriptor
The chapter degree of correlation represent its reflect first time segment in expectation text centric thought a possibility that, can the descriptor first
The article of word frequency in time slice in expectation text and the sentence including in the first sentence set including the descriptor is related
Degree obtains.
In one example, the article degree of correlation of target sentences (3) can obtain according to the following formula:
In formula, sα,βThe β sentence in sentence set, s are corresponded to for the α time sliceα,γFor the α time slice pair
Answer the γ sentence in sentence subset, rel (sα,β) it is sentence sα,βThe article degree of correlation, rel (sα,γ) it is sentence sα,γText
The chapter degree of correlation, tα,kK-th of descriptor in theme set of words, rel (t are corresponded to for the α time sliceα,k) based on write inscription tα,k
The article degree of correlation, WST(sα,β,tα,k) according to sentence sα,βWith descriptor tα,kTheme co-occurrence word in first time segment the phase
The word frequency in text is hoped to obtain, WSS(sα,β,sα,γ) according to sentence sα,βWith sentence sα,γSentence co-occurrence word in first time segment
Word frequency in interior expectation text obtains, and M is that the α time slice corresponds to the descriptor quantity in theme set of words, and P is sentence
Concentrate the sentence quantity of sentence.
Wherein, descriptor tα,kArticle degree of correlation rel (tα,k) (4) can obtain according to the following formula:
In formula, sα,xX-th of sentence in sentence set, rel (s are corresponded to for the α time sliceα,x) it is sentence sα,x's
The article degree of correlation, WTS(tα,k,sα,x) according to descriptor tα,kWith sentence sα,xTheme co-occurrence word it is expected in first time segment
Word frequency in text obtains, WTS(tα,k,sα,x) and WST(sα,x,tα,k) equal, Q is that the α time slice corresponds in sentence set
Sentence quantity.
As an example, WST(sα,β,tα,k) (5) can obtain according to the following formula:
In formula, wt(sα,β,tα,k) according to sentence sα,βWith descriptor tα,kT-th of theme co-occurrence word in first time segment
Word frequency in interior expectation text obtains, and can be the TF-IDF weight of the theme co-occurrence word.
It should be noted that working as sentence sα,βWith descriptor tα,kBetween be not present theme co-occurrence word (i.e. sentence sα,βDo not include
Descriptor tα,k) when, it can be by WST(sα,β,tα,k) it is set as 0.
As an example, WSS(sα,β,sα,γ) (6) can obtain according to the following formula:
In formula, wt(sα,β,sα,γ) according to sentence sα,βWith sentence sα,γT-th of sentence co-occurrence word in first time segment
It is expected that the word frequency in text obtains, the TF-IDF weight of the sentence co-occurrence word can be.
It should be noted that working as sentence sα,βWith sentence sα,γBetween be not present sentence co-occurrence word (i.e. sentence sα,βAnd sentence
sα,γThere is no identical character string) when, it can be by WSS(sα,β,sα,γ) it is set as 0.
TF-IDF, i.e. word frequency-inverse document frequency (term frequency-inverse document
It frequency), is a kind of common weighting technique for information retrieval and data mining, to assess a words for one
The significance level of file set or a copy of it file in a corpus.The importance of words occurs hereof with it
The directly proportional increase of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.
In practical applications, normalized can be done to the article degree of correlation of each sentence in the first sentence set, and
Normalized is done to each descriptor in the first theme set of words.
S404: according to the article degree of correlation of sentence, the sentence for meeting the second screening conditions is chosen from the first sentence set
Obtain the first abstract collection.
It is understood that the article degree of correlation of sentence represents in sentence reflection first time segment in expectation text
A possibility that thought is thought carries out the sentence for meeting the second screening conditions in the first sentence set using the article degree of correlation of sentence
Screening, the abstract sentence (the i.e. first abstract collection) of the interior expectation text of available first time segment.
It in practical applications, can be according to the article degree of correlation from high to low sequence, to the sentence in the first sentence set
It is ranked up, J sentence constitutes the first abstract collection before choosing wherein, and J may be set according to actual conditions, be not listed.
It should also be noted that, above content is illustrated to how to obtain abstract collection.Due to obtain abstract collection when,
The article degree of correlation of each abstract sentence is obtained, therefore, in some possible implementations of the embodiment of the present application, in order to improve
The abstract sentence that can not accurately indicate expectation text centric thought is rejected in the accuracy of the abstract of generation, can also be according to abstract sentence
The article degree of correlation to abstract collection screen.Then, above-mentioned steps S103 can specifically include:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, and it is related to obtain the corresponding article of the first abstract collection
Degree;Based on obtained each abstract diversity factor and each article degree of correlation, selection is concentrated to meet the first screening item from multiple abstracts
The abstract collection of part.
As an example, abstract collects the corresponding article degree of correlation and (6) can obtain according to the following formula:
In formula, R (DSi) it is that the corresponding abstract of i-th of time slice collects DSiThe article degree of correlation, rel (s1,y) it is abstract collection
DSiIn y-th abstract sentence the article degree of correlation, V be abstract collection DSiThe quantity of middle abstract sentence.
In the specific implementation, whether the first sieve can be met to each abstract collection with comprehensive summary diversity factor and the article degree of correlation
Condition is selected to be judged.For example, judging whether corresponding abstract the sum of the diversity factor and the article degree of correlation of abstract collection is greater than centainly
Threshold value etc..Further, it is also possible to which phase is set separately to the corresponding abstract diversity factor of abstract collection and the article degree of correlation according to the actual situation
The weight answered.
Above content carries out the abstract diversity factor and the article degree of correlation that how to obtain abstract collection and abstract collection
Detailed description can remove multiple concentrations of making a summary and collect the abstract collection of repetition or redundancy with other abstracts and can not accurately express the phase
Hope the abstract collection of text centric thought.But in practical applications, to some abstract collection for, including abstract sentence
There may be duplicate contents or redundant contents.For this purpose, in some possible implementations of the embodiment of the present application, it can also be to list
One abstract concentrates whether the abstract sentence for including repeats or redundancy is judged, further decreases the repeated and superfluous of the abstract of generation
Yu Xing.
Specifically, the figure is the process signal of another abstraction generating method provided by the embodiments of the present application referring to Fig. 5
Figure.
On the basis of the abstraction generating method that above content provides, by taking method shown in FIG. 1 as an example, some possible
In implementation, after obtaining multiple abstract collection (i.e. step S101), can also include:
S501: based on the reproduction shape of character or character string in the first abstract subset in each abstract sentence in target abstract sentence
State obtains the novel degree of target abstract sentence.
It is understood that target abstract sentence is any one abstract sentence that the first abstract is concentrated, the first abstract subset packet
Include remaining the abstract sentence of the first abstract concentration in addition to target makes a summary sentence.
In the embodiment of the present application, reproduction state refers to whether the character string in target abstract sentence appears in the first abstract
Situation in subset in other sentences.In practical applications, the character or character string that can first count target sentence are made a summary with first
The intersection for character or character string included by sentence of making a summary in subset, according in the corresponding multiple characters of target abstract sentence or character string
It is not belonging to the character of the intersection or the quantity of character string, obtains the novel degree of target abstract sentence.
Novel degree reflect target abstract sentence pair first abstract concentrate other abstract sentences for introduce new content number, mesh
The quantity of the character reappeared or character string is not more in the first abstract subset for character or character string in mark abstract sentence, target abstract
The novel degree of sentence is bigger.When in target abstract sentence character or character string appear in the first abstract subset in other sentences (i.e.
Target make a summary sentence and first abstract concentrate other sentences between there are identical character strings) when, two sentences include identical letter
Breath, it is understood that there may be duplicate contents or redundant content, target is made a summary, and the new content that sentence introduces is few, and novelty degree is lower;Conversely, working as mesh
Character or character string in mark abstract sentence do not appear in the first abstract subset in other sentences that (i.e. target abstract sentence and first is plucked
Concentrate and identical character string be not present between other sentences) when, two sentences do not include identical information, and target abstract sentence draws
The new content entered is more, and novelty degree is higher.It therefore, can be according to character in target abstract sentence or character string in the first abstract subset
In it is each abstract sentence in reproduction state, obtain target abstract sentence novel degree.
It is arbitrarily extracted from target abstract sentence it should be noted that character or character string described here can be,
It can be being extracted according to certain regular (such as N-gram), repeat no more.
As an example, it is not plucked in the first abstract subset in the corresponding multiple character strings of sentence that can be made a summary according to target
The sum of the ratio of quantity of the quantity for the character string for wanting sentence to reappear multiple character strings corresponding with target abstract sentence, obtains target and plucks
Want the novel degree of sentence.
Then, the novel degree of target abstract sentence can be obtained according to such as following formula (7):
In formula, si,aA-th of abstract sentence, s are concentrated for the corresponding abstract of i-th of time slicei,bFor i-th of time slice pair
B-th of abstract sentence, nov (s in the abstract subset answeredi,a,si,b) it is abstract sentence si,aTo abstract sentence si,bNovel degree, i-th of H
Time slice it is corresponding abstract subset in make a summary sentence quantity, n-gram be character or character string, | | | | expression seek quantity.
In some possible designs, it can be screened according to the abstract sentence that the novel degree of abstract sentence concentrates abstract,
The abstract sentence that novelty degree is lower than given threshold is rejected, duplicate contents or redundant content that abstract is concentrated are excluded.Then, in step S501
It later can also include: the novel degree for concentrating each abstract sentence based on the first abstract, concentrate to reject from the first abstract and do not meet the
The abstract sentence of three screening conditions.In practical application, third screening conditions can be carried out with the setting of adaptability, it is no longer superfluous here
It states.
S502: comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection.
It is understood that the abstract sentence that the first abstract collection includes to the novel degree of the second abstract collection and the first abstract collection
Novel degree is related, and therefore, comprehensive first abstract concentrates the novel degree of each abstract sentence, and the novelty of the first abstract collection can be obtained
Degree.
As an example, the novel degree of the first abstract collection (8) can obtain according to the following formula:
In formula, DSiFor the corresponding abstract collection of i-th of time slice, si,cIt is concentrated for the corresponding abstract of i-th of time slice
C-th of abstract sentence, N DSiIn include abstract sentence sum.
Then, in order to reduce generation abstract repeatability and redundancy, step S103 can specifically include:
Based on the novel degree of obtained each abstract diversity factor and each abstract collection, selection is concentrated to meet from multiple abstracts
The abstract collection of first screening conditions.
It is understood that can be removed based on abstract diversity factor, multiple abstracts are concentrated and other abstract collection repeat or redundancy
Abstract collection, based on novel degree can remove content itself repeat or redundancy abstract collection, further decrease the abstract of generation
Repeatability and redundancy.
In practical applications, the first screening conditions can be set according to specific needs, for example, abstract diversity factor and
The sum of novel degree need to be greater than certain threshold value etc., here without limiting.Further, it is also possible to distinguish abstract diversity factor and novelty degree
Corresponding weight is arranged to be judged.
In some possible implementations, in order to reduce generation abstract repeatability and redundancy, step S103 tool
Body can also include:
Novel degree based on obtained each abstract diversity factor and the article degree of correlation and each abstract collection, from multiple abstracts
Concentrate the abstract collection chosen and meet the first screening conditions.
It is understood that can be removed based on abstract diversity factor, multiple abstracts are concentrated and other abstract collection repeat or redundancy
Abstract collection, the abstract collection that can not accurately express desired text centric thought can be removed based on the article degree of correlation, based on novelty
Degree can remove the abstract collection of the repetition of content itself or redundancy, and description that can be accurate, terse according to the abstract collection selected it is expected
The central idea of text.
The abstraction generating method provided based on the above embodiment, the embodiment of the present application also provides a kind of summarization generation dresses
It sets.
Referring to Fig. 6, which is a kind of structural schematic diagram of summarization generation device provided by the embodiments of the present application.
Summarization generation device provided by the embodiments of the present application, comprising: first obtains unit 100, the second obtaining unit 200,
Screening unit 300 and assembled unit 400;
First obtains unit 100, for obtaining multiple abstract collection;Each abstract collection includes expectation text in corresponding time slice
This abstract sentence, it is different that any two abstract collects corresponding time slice;
Second obtaining unit 200 obtains the first abstract collection and makes a summary to second for concentrating the abstract sentence for including according to abstract
The diversity factor of collection obtains the corresponding abstract diversity factor of the first abstract collection;First abstract collection and the second abstract collection are that multiple abstracts collect
In any two;
Screening unit 300 meets the first sieve for concentrating to choose from multiple abstracts based on obtained each abstract diversity factor
Select the abstract collection of condition;
Assembled unit 400 concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
In some possible implementations of the embodiment of the present application, the second obtaining unit 200 be can specifically include: difference
Degree obtains subelement and comprehensive subelement;
Diversity factor obtain subelement, for based on first abstract concentrate it is each abstract sentence in character or character string second
The reproduction state for concentration of making a summary obtains the diversity factor that the first abstract concentrates each abstract sentence pair second abstract collection;
Comprehensive subelement concentrates each abstract sentence pair second to make a summary the diversity factor of collection for integrating the first abstract, obtains the
One abstract collects corresponding abstract diversity factor.
In some possible implementations of the embodiment of the present application, diversity factor obtains subelement, can specifically include: extracting
Subelement, statistics subelement and acquisition subelement;
Subelement is extracted, for extracting multiple character strings from target abstract sentence according to preset rules;Target abstract sentence be
Any one abstract sentence that first abstract is concentrated;
Subelement is counted, the quantity for the character string reappeared is not concentrated in the second abstract for counting in multiple character strings, obtains
To statistical value;
Subelement is obtained, for the quantity according to statistical value and multiple character strings, target abstract sentence pair second is obtained and makes a summary
The diversity factor of collection.
In some possible implementations of the embodiment of the present application, first obtains unit 100 be can specifically include: subordinate sentence
Obtain subelement, theme obtains subelement, the degree of correlation obtains subelement and abstract sentence chooses subelement;
Subordinate sentence obtains subelement, it is expected the subordinate sentence of text as a result, obtaining the first sentence in first time segment for obtaining
Set;First time segment is that the first abstract collects corresponding time slice;
Theme obtains subelement, for utilizing topic model, extracts descriptor from the first sentence set, it is main to obtain first
Epigraph set;
The degree of correlation obtain subelement, for it is expected in first time segment according to sentence co-occurrence word the word frequency in text with
And theme co-occurrence word it is expected the word frequency in text in first time segment, and it is corresponding to obtain each sentence in the first sentence set
The article degree of correlation;Sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence,
Theme co-occurrence word is while appearing in the first sentence set and character or character string in descriptor, and the article degree of correlation represents instead
A possibility that reflecting expectation text centric thought in first time segment;
Sentence of making a summary chooses subelement, for choosing from the first sentence set and meeting the second screening according to the article degree of correlation
The sentence of condition obtains the first abstract collection.
Optionally, the degree of correlation obtains subelement, can be specifically used for:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
Text it is expected in first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset
In word frequency, each descriptor in the article degree of correlation, target sentences and target topic set of words of each sentence in sentence subset
Theme co-occurrence word the text of each descriptor in word frequency and the first theme set of words in text it is expected in first time segment
The chapter degree of correlation obtains the article degree of correlation of target sentences;
Wherein, target sentences are any one abstract sentence in the first sentence set, and sentence subset includes the first sentence collection
Remaining abstract sentence in conjunction in addition to target sentences, the article degree of correlation of descriptor is according to descriptor in first time segment
It is expected that including that the article degree of correlation of the sentence of descriptor obtains in word frequency and the first sentence set in text.
In some possible implementations of the embodiment of the present application, screening unit 300 can be specifically used for:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, and it is related to obtain the corresponding article of the first abstract collection
Degree;Based on obtained each abstract diversity factor and each article degree of correlation, selection is concentrated to meet the first screening item from multiple abstracts
The abstract collection of part.
In some possible implementations of the embodiment of the present application, which can also include: third obtaining unit and comprehensive
Close unit;
Third obtaining unit, for being based on character or character string each abstract in the first abstract subset in target abstract sentence
Reproduction state in sentence obtains the novel degree of target abstract sentence;Target abstract sentence is any one abstract that the first abstract is concentrated
Sentence, the first abstract subset include remaining the abstract sentence of the first abstract concentration in addition to target makes a summary sentence;
Comprehensive unit concentrates the novel degree of each abstract sentence for integrating the first abstract, obtains the novelty of the first abstract collection
Degree;
Then, screening unit 300 can be specifically used for:
Based on the novel degree of obtained each abstract diversity factor and each abstract collection, selection is concentrated to meet from multiple abstracts
The abstract collection of first screening conditions.
In some possible implementations of the embodiment of the present application, screening unit 300 be can be also used for based on the first abstract
The novel degree for concentrating each abstract sentence, concentrates from the first abstract and rejects the abstract sentence for not meeting third screening conditions.
In the embodiment of the present application, the abstract collection for obtaining expectation text in multiple and different time slices first, obtains multiple
Abstract collection.The abstract sentence for including is concentrated further according to abstract, obtains the diversity factor between multiple abstract concentration any two abstract collection,
It obtains each abstract and collects corresponding abstract diversity factor, the repetition situation for reflecting between each abstract collection and other abstract collection.
Then, corresponding abstract diversity factor is collected based on each abstract, is concentrated from multiple abstracts and chooses the abstract for meeting the first screening conditions
Collection, it is contemplated that gradual and variability of the abstract theme in different time segment judges the corresponding abstract of different time segment
Repetition situation between collection is rejected after repeating more or redundancy abstract collection with other abstract collection, the abstract selected is concentrated
Sentence combination producing of making a summary is made a summary, and can reduce the clip Text of repeatability with redundancy on the basis of guaranteeing to make a summary coverage rate,
Be conducive to reader and accurately grasp the central idea of expectation text and the development of event.
The abstraction generating method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of computers
Readable storage medium storing program for executing is stored thereon with computer program, when the computer program is executed by processor, realizes such as above-mentioned implementation
Example provide abstraction generating method in any one.
The abstraction generating method and device provided based on the above embodiment, the embodiment of the present application also provides a kind of lifes of abstract
Forming apparatus, comprising: processor and memory;The memory is transferred to for storing program code, and by said program code
The processor;The processor, for executing such as abstract provided by the above embodiment according to the instruction in said program code
Any one in generation method.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said
Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality
For applying system or device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, it is related
Place is referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The above is only the preferred embodiment of the application, not makes any form of restriction to the application.Though
Right the application has been disclosed in a preferred embodiment above, however is not limited to the application.It is any to be familiar with those skilled in the art
Member, in the case where not departing from technical scheme ambit, all using the methods and technical content of the disclosure above to the application
Technical solution makes many possible changes and modifications or equivalent example modified to equivalent change.Therefore, it is all without departing from
The content of technical scheme, any simple modification made to the above embodiment of the technical spirit of foundation the application are equal
Variation and modification, still fall within technical scheme protection in the range of.
Claims (10)
1. a kind of abstraction generating method, which is characterized in that the described method includes:
Obtain multiple abstract collection;Each abstract collection includes the abstract sentence of expectation text in corresponding time slice, any two
It is different that the abstract collects corresponding time slice;
The abstract sentence for including is concentrated according to the abstract, the first abstract collection is obtained to the diversity factor of the second abstract collection, obtains described
First abstract collects corresponding abstract diversity factor;The first abstract collection and the second abstract collection are that the multiple abstract is concentrated
Any two;
Based on the abstract diversity factor each of is obtained, is concentrated from the multiple abstract and choose the abstract for meeting the first screening conditions
Collection;
Combining the abstract selected concentrates the abstract sentence for including to generate abstract.
2. the method according to claim 1, wherein described concentrate the abstract sentence for including according to the abstract, really
Fixed first abstract collection obtains first abstract and collects corresponding abstract diversity factor, specifically include to the diversity factor of the second abstract collection:
Based on it is described first abstract concentrate it is each abstract sentence in character or character string it is described second abstract concentrate reproduction shape
State obtains the diversity factor that first abstract concentrates the second abstract collection described in each abstract sentence pair;
Comprehensive first abstract concentrates the diversity factor of the second abstract collection described in each abstract sentence pair, obtains the first abstract collection
Corresponding abstract diversity factor.
3. according to the method described in claim 2, it is characterized in that, described concentrated in each abstract sentence based on first abstract
Character or character string it is described second abstract concentrate reproduction state, obtain it is described first abstract concentrate each abstract sentence pair institute
The diversity factor for stating the second abstract collection, specifically includes:
Multiple character strings are extracted from target abstract sentence according to preset rules;The target abstract sentence is that first abstract is concentrated
Any one abstract sentence;
It counts in the multiple character string and does not concentrate the quantity for the character string reappeared in second abstract, obtain statistical value;
According to the quantity of the statistical value and the multiple character string, the second abstract collection described in the target abstract sentence pair is obtained
Diversity factor.
4. method according to claim 1 to 3, which is characterized in that described to obtain multiple abstract collection, specific packet
It includes:
The subordinate sentence of expectation text in first time segment is obtained as a result, obtaining the first sentence set;The first time segment is
First abstract collects corresponding time slice;
Using topic model, descriptor is extracted from the first sentence set, obtains the first theme set of words;
It is expected word frequency in text and theme co-occurrence word described in the first time segment according to sentence co-occurrence word
Word frequency in one time slice in expectation text obtains the corresponding article degree of correlation of each sentence in the first sentence set;
The sentence co-occurrence word is while appearing in character or character string in the first sentence set in any two sentence, described
Theme co-occurrence word is while appearing in the first sentence set and character or character string in the descriptor, the article phase
Guan Du represents a possibility that expectation text centric thought in the reflection first time segment;
According to the article degree of correlation, the sentence that selection meets the second screening conditions from the first sentence set obtains described
First abstract collection.
5. according to the method described in claim 4, it is characterized in that, according to sentence co-occurrence word in the first time segment phase
The word frequency and the theme co-occurrence word hoped in text it is expected the word frequency in text in the first time segment, described in acquisition
The corresponding article degree of correlation of each sentence in first sentence set, specifically includes:
The article degree of correlation is obtained using optimization algorithm iteration, for i-th iteration:
Text it is expected in the first time segment according to the sentence co-occurrence word of each sentence in target sentences and sentence subset
In word frequency, in the sentence subset each sentence the article degree of correlation, the target sentences and the target topic set of words
In the theme co-occurrence word of each descriptor word frequency and first descriptor in text it is expected in the first time segment
The article degree of correlation of each descriptor in set, obtains the article degree of correlation of the target sentences;
Wherein, the target sentences are any one abstract sentence in the first sentence set, and the sentence subset includes institute
Remaining abstract sentence in the first sentence set in addition to the target sentences is stated, the article degree of correlation of the descriptor is according to
It include the master in the word frequency and the first sentence set of descriptor it is expected in text in the first time segment
The article degree of correlation of the sentence of epigraph obtains.
6. according to the method described in claim 4, it is characterized in that, described based on each of obtaining the abstract diversity factor, from
The multiple abstract concentrates the abstract collection chosen and meet the first screening conditions, specifically includes:
Comprehensive first abstract concentrates the article degree of correlation of each abstract sentence, obtains first abstract and collects corresponding article phase
Guan Du;
Based on the abstract diversity factor and each article degree of correlation each of is obtained, is concentrated from the multiple abstract and choose symbol
Close the abstract collection of the first screening conditions.
7. method according to claim 1 to 3, which is characterized in that it is described to obtain multiple abstract collection, later also
Include:
Based on the reproduction state of character or character string in the first abstract subset in each abstract sentence in target abstract sentence, institute is obtained
State the novel degree of target abstract sentence;Target abstract sentence is any one abstract sentence that first abstract is concentrated, described the
One abstract subset includes remaining the abstract sentence of the first abstract concentration in addition to the target makes a summary sentence;
Comprehensive first abstract concentrates the novel degree of each abstract sentence, obtains the novel degree of the first abstract collection;
Then, described based on the abstract diversity factor each of is obtained, it concentrates to choose from the multiple abstract and meets the first screening item
The abstract collection of part, specifically includes:
Based on the novel degree for each of obtaining the abstract diversity factor and each abstract collection, from the multiple abstract concentration
Choose the abstract collection for meeting the first screening conditions.
8. a kind of summarization generation device, which is characterized in that described device includes: first obtains unit, the second obtaining unit, screening
Unit and assembled unit;
The first obtains unit, for obtaining multiple abstract collection;Each abstract collection includes expectation in corresponding time slice
It is different to collect corresponding time slice for the abstract sentence of text, abstract described in any two;
Second obtaining unit obtains the first abstract collection and plucks to second for concentrating the abstract sentence for including according to the abstract
The diversity factor to be collected obtains first abstract and collects corresponding abstract diversity factor;The first abstract collection and second abstract
Collection is any two that the multiple abstract is concentrated;
The screening unit meets for concentrating to choose from the multiple abstract based on the abstract diversity factor each of is obtained
The abstract collection of first screening conditions;
The assembled unit concentrates the abstract sentence for including to generate abstract for combining the abstract selected.
9. a kind of computer readable storage medium, which is characterized in that computer program is stored thereon with, when the computer program quilt
When processor executes, such as the described in any item abstraction generating methods of claim 1-7 are realized.
10. a kind of summarization generation equipment characterized by comprising processor and memory;
The memory is transferred to the processor for storing program code, and by said program code;
The processor, for executing such as the described in any item abstracts of claim 1-7 according to the instruction in said program code
Generation method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811626213.XA CN109815328B (en) | 2018-12-28 | 2018-12-28 | Abstract generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811626213.XA CN109815328B (en) | 2018-12-28 | 2018-12-28 | Abstract generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109815328A true CN109815328A (en) | 2019-05-28 |
CN109815328B CN109815328B (en) | 2021-05-25 |
Family
ID=66602705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811626213.XA Active CN109815328B (en) | 2018-12-28 | 2018-12-28 | Abstract generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109815328B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111813925A (en) * | 2020-07-14 | 2020-10-23 | 混沌时代(北京)教育科技有限公司 | Semantic-based unsupervised automatic summarization method and system |
CN112328783A (en) * | 2020-11-24 | 2021-02-05 | 腾讯科技(深圳)有限公司 | Abstract determining method and related device |
CN114722194A (en) * | 2022-03-15 | 2022-07-08 | 电子科技大学 | Automatic construction method of emergency time sequence based on abstract generation algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
CN101446940A (en) * | 2007-11-27 | 2009-06-03 | 北京大学 | Method and device of automatically generating a summary for document set |
CN102043851A (en) * | 2010-12-22 | 2011-05-04 | 四川大学 | Multiple-document automatic abstracting method based on frequent itemset |
CN106874469A (en) * | 2017-02-16 | 2017-06-20 | 北京大学 | A kind of news roundup generation method and system |
CN107526841A (en) * | 2017-09-19 | 2017-12-29 | 中央民族大学 | A kind of Tibetan language text summarization generation method based on Web |
-
2018
- 2018-12-28 CN CN201811626213.XA patent/CN109815328B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101446940A (en) * | 2007-11-27 | 2009-06-03 | 北京大学 | Method and device of automatically generating a summary for document set |
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
CN102043851A (en) * | 2010-12-22 | 2011-05-04 | 四川大学 | Multiple-document automatic abstracting method based on frequent itemset |
CN106874469A (en) * | 2017-02-16 | 2017-06-20 | 北京大学 | A kind of news roundup generation method and system |
CN107526841A (en) * | 2017-09-19 | 2017-12-29 | 中央民族大学 | A kind of Tibetan language text summarization generation method based on Web |
Non-Patent Citations (2)
Title |
---|
李辉: "《基于时间线的事件组织与摘要技术的研究与应用》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
盛雅东: "《基于Google_Map的地理位置查询系统》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111813925A (en) * | 2020-07-14 | 2020-10-23 | 混沌时代(北京)教育科技有限公司 | Semantic-based unsupervised automatic summarization method and system |
CN112328783A (en) * | 2020-11-24 | 2021-02-05 | 腾讯科技(深圳)有限公司 | Abstract determining method and related device |
CN114722194A (en) * | 2022-03-15 | 2022-07-08 | 电子科技大学 | Automatic construction method of emergency time sequence based on abstract generation algorithm |
CN114722194B (en) * | 2022-03-15 | 2023-05-09 | 电子科技大学 | Automatic construction method for emergency time sequence based on abstract generation algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN109815328B (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jung | Semantic vector learning for natural language understanding | |
US11573996B2 (en) | System and method for hierarchically organizing documents based on document portions | |
CN108304375B (en) | Information identification method and equipment, storage medium and terminal thereof | |
CN1871597B (en) | System and method for associating documents with contextual advertisements | |
CN103870001B (en) | A kind of method and electronic device for generating candidates of input method | |
US20130041652A1 (en) | Cross-language text clustering | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN108319583B (en) | Method and system for extracting knowledge from Chinese language material library | |
CN110232112A (en) | Keyword extracting method and device in article | |
CN109815328A (en) | A kind of abstraction generating method and device | |
CN116501875B (en) | Document processing method and system based on natural language and knowledge graph | |
Awajan | Keyword extraction from Arabic documents using term equivalence classes | |
Kutter | Corpus analysis | |
Tandel et al. | Multi-document text summarization-a survey | |
CN109284389A (en) | A kind of information processing method of text data, device | |
Mustafa et al. | Optimizing document classification: Unleashing the power of genetic algorithms | |
JP4931114B2 (en) | Data display device, data display method, and data display program | |
CN117291192A (en) | Government affair text semantic understanding analysis method and system | |
Saeed et al. | An abstractive summarization technique with variable length keywords as per document diversity | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Eggi | Afaan oromo text retrieval system | |
Ali et al. | Word embedding based new corpus for low-resourced language: Sindhi | |
Nie et al. | Social Emotion Analysis System for Online News | |
Reades et al. | Clustering and Visualising Documents Using Word Embeddings | |
Wolfe | ChronoNLP: Exploration and Analysis of Chronological Textual Corpora |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |