CN105653704B - Autoabstract generation method and device - Google Patents
Autoabstract generation method and device Download PDFInfo
- Publication number
- CN105653704B CN105653704B CN201511026171.2A CN201511026171A CN105653704B CN 105653704 B CN105653704 B CN 105653704B CN 201511026171 A CN201511026171 A CN 201511026171A CN 105653704 B CN105653704 B CN 105653704B
- Authority
- CN
- China
- Prior art keywords
- weight
- sentence
- word
- iteration
- final
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000001105 regulatory effect Effects 0.000 claims description 7
- 238000011161 development Methods 0.000 description 8
- 230000002708 enhancing effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013016 damping Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000446 fuel Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000013332 literature search Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of autoabstract generation method and devices, including:Text is modeled, generates sentence network, sentence network includes side right value;According to the initial weight of each sentence in side right value, text, the first weight of each sentence in text is calculated;According to frequency, first weight of word corresponding sentence of the word in each sentence, word weight is calculated;According to frequency, the word weight of each word of each word in each sentence, the second weight of each sentence is calculated;According to the first weight and the second weight, the final weight of each sentence is calculated;If the final weight difference after 1 iteration of final weight and N after iv-th iteration is less than predetermined threshold value less than the word weight difference after 1 iteration of word weight and N after predetermined threshold value and iv-th iteration, generate autoabstract, it solves the problems, such as single long text abstract poor quality, alleviates information overload, improves abstract quality.
Description
Technical field
The present invention relates to natural language processing field more particularly to a kind of autoabstract generation method and devices.
Background technology
Autoabstract automatically comprehensively, is accurately reflected in text center from extraction in urtext using computer
The sentence of appearance generates the technology of simple coherent short essay.Abstract should have generality, objectivity, comprehensibility and readability.
Autoabstract is always that natural language processing field is important and rich in one of the research theme of challenge for a long time, especially with
The explosive growth of various Web texts and user's original content, autoabstract can alleviate the potential advantages exhibition of information overload
Reveal nothing left, and practical application is obtained in numerous areas, such as news recommendation, machine translation, literature search, intelligence analysis, public sentiment prison
Control etc..
Text can be considered that a series of set of sentences, autoabstract attempt to select most important several sentences formation section
Record formula is made a summary, and therefore, essence is the sequencing problem of sentence.Each sentence is expressed as word with bag of words (bag-of-words)
And its set of word frequency, common autoabstract technology is mainly from two parts of sentence and word, by by the system of word or sentence
Characteristic, such as position, length, frequency linear combination are counted, the significance level of sentence is ranked up;Another kind of method passes through structure
Sentence related network iterates to calculate sentence weight, to be ranked up using graph model sort algorithm.Existing method considers merely sentence
The feature of son or the feature of word, and fail to consider interaction between the two simultaneously.
Invention content
Present invention aim to address when generating abstract, consider the feature of sentence or the spy of word merely in the prior art
Sign, and fail to consider interaction between the two simultaneously, the present invention is cooperateed with by using words and phrases, and iteration enhancing, which calculates, obtains sentence
Sub- weight automatically generates abstract, effectively improves the quality of long text abstract.
In a first aspect, an embodiment of the present invention provides a kind of autoabstract generation method, the method includes:
Text is modeled, generates sentence network, the sentence network includes side right value;
According to the initial weight of each sentence in the side right value, the text, first of each sentence in the text is calculated
Weight;
According to frequency, first weight of institute predicate corresponding sentence of the word in each sentence, institute's predicate is calculated
Word weight;
According to the word weight of frequency and each institute predicate of each institute's predicate in each sentence, each sentence is calculated
The second weight;
According to first weight and second weight, the final weight of each sentence is calculated;
Compare the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than default threshold
Whether the word weight after value and iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th
Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the initial weight according to each sentence in the side right value, the text calculates each in the text
First weight of sentence specifically includes:
Utilize formulaCalculate of each sentence in the text
One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be
Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, frequency, first weight of the corresponding sentence of institute's predicate according to word in each sentence calculates
The word weight of institute's predicate specifically includes:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word
Frequency in j-th of sentence.
Preferably, the frequency according to each institute's predicate in each sentence and the word weight of each institute's predicate calculate
Second weight of each sentence specifically includes:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word
Frequency in j-th of sentence.
Preferably, described according to first weight and second weight, calculate the final weight tool of each sentence
Body includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth
Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
Second aspect, an embodiment of the present invention provides a kind of autoabstract generating means, described device includes:
Generation unit generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit is calculated for the initial weight according to each sentence in the side right value, the text in the text
First weight of each sentence;
The computing unit is additionally operable to, according to frequency of the word in each sentence, the corresponding sentence of institute's predicate first
Weight calculates the word weight of institute's predicate;
The computing unit is additionally operable to, according to the word of frequency and each institute predicate of each institute's predicate in each sentence
Weight calculates the second weight of each sentence;
The computing unit is additionally operable to, and according to first weight and second weight, calculates each sentence most
Whole weight;
Comparing unit, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is small
Whether the word weight difference after word weight and the N-1 times iteration after predetermined threshold value and iv-th iteration is less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th
Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the computing unit is specifically used for,
Utilize formulaCalculate of each sentence in the text
One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be
Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, the computing unit is specifically used for, and utilizes formulaCalculate institute's predicate
Word weight;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word
Frequency in j-th of sentence.
Preferably, the computing unit is specifically used for, and utilizes formulaIt calculates each described
Second weight of sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word
Frequency in j-th of sentence.
Preferably, the computing unit is specifically used for, and utilizes formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) described in calculating
The final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth
Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
The present invention generates sentence network, sentence network includes side right value by being modeled to text;According to side right value,
The initial weight of each sentence in text calculates the first weight of each sentence in text;According to frequency of the word in each sentence, word pair
First weight of the sentence answered calculates the word weight of word;It is weighed according to the word of frequency and each word of each word in each sentence
Weight, calculates the second weight of each sentence;According to the first weight and the second weight, the final weight of each sentence is calculated;Compare n-th
After whether the final weight difference after final weight and the N-1 times iteration after iteration is less than predetermined threshold value and iv-th iteration
Word weight and the N-1 times iteration after word weight difference whether be less than predetermined threshold value;If the final weight after iv-th iteration
It is not less than the word weight after predetermined threshold value and iv-th iteration with the final weight difference after the N-1 times iteration and the N-1 times changes
Word weight difference after generation is not less than predetermined threshold value, using the final weight after the iv-th iteration as at the beginning of the N+1 time iteration
Beginning weight;If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value and
The word weight difference after word weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value, generates autoabstract, passes through
Consider influencing each other between word and sentence in text, the influence of word distich ordering score is incorporated on sentence related network, is solved
It the problem of single long text abstract poor quality, alleviates information overload, improve abstract quality.
Description of the drawings
Fig. 1 is the autoabstract generation method flow chart that the embodiment of the present invention one provides;
Fig. 2 is another flow chart for the autoabstract generation method that the embodiment of the present invention one provides;
Fig. 3 is autoabstract generating means structural schematic diagram provided by Embodiment 2 of the present invention.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into
It is described in detail to one step, it is clear that the described embodiments are only some of the embodiments of the present invention, rather than whole implementation
Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts
All other embodiment, shall fall within the protection scope of the present invention.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Fig. 1 is the autoabstract generation method flow chart that the embodiment of the present invention one provides.As shown in Figure 1, the present embodiment packet
Include following steps:
S110 models text, generates sentence network, the sentence network includes side right value.
Wherein, using sentence as vertex, similitude is side, is a sentence network G=(S, E, w) by text modeling, S is sentence
The set of child node, E are the set on side, and w is side right value, i.e. similarity between sentence.Wherein, Jaccard can be selected in side weight w
The methods of likeness coefficient, cosine similarity, BM25 algorithms calculate, and the present invention is not limited it.
Optionally, further include before S110:Text is pre-processed, which specially segments, removal deactivates
Word and according to punctuation mark comma, fullstop, colon etc. carry out subordinate sentence.
S120 calculates each sentence in the text according to the initial weight of each sentence in the side right value, the text
First weight.
Optionally, the initial weight according to each sentence in the side right value, the text calculates each in the text
First weight of sentence specifically includes:
Utilize formulaCalculate of each sentence in the text
One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be
Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Specifically, by the way of iteration sentence the first weight.In first iteration, in distich network each sentence with
Machine assigns weight initial value, and the first weight W (S of j-th of sentence are then calculated according to above-mentioned formulaj)。
S130 is calculated described according to frequency, the first weight of the corresponding sentence of institute's predicate of the word in each sentence
The word weight of word.
Optionally, frequency, first weight of the corresponding sentence of institute's predicate according to word in each sentence calculates
The word weight of institute's predicate specifically includes:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word
Frequency in j-th of sentence.
Specifically, the first weight calculation of the frequency and corresponding sentence that are occurred in sentence according to word goes out the word power of word
Weight.Its computational methods is, for each word, as unit of sentence, and number and corresponding sentence which is occurred in sentence
The product of sub first multiplied by weight, the word frequency and the first weight of sentence that obtain to each sentence is summed, and divided by word in text
The total degree of middle appearance obtains the weight of the word.
S140 is calculated each described according to the word weight of frequency and each institute's predicate of each institute's predicate in each sentence
Second weight of sentence.
Optionally, the frequency according to each institute's predicate in each sentence and the word weight of each institute's predicate calculate
Second weight of each sentence specifically includes:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word
Frequency in j-th of sentence.
S150 calculates the final weight of each sentence according to first weight and second weight.
Optionally, described according to first weight and second weight, calculate the final weight tool of each sentence
Body includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth
Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
S160, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than in advance
If whether the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is less than predetermined threshold value.
S170, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than default
The word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is not less than predetermined threshold value, by described the
Initial weight of the final weight as the N+1 times iteration after n times iteration;
S180, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than default threshold
Word weight after value and iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, and generation is plucked automatically
It wants.
Specifically, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than pre-
If the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is not less than predetermined threshold value, then by the
Initial weight of the final weight as the N+1 times iteration after n times iteration, repeats S120 to S160, until meeting the condition of convergence.
If for example, after a certain wheel iteration, the difference of the final weight after the final weight of all sentences and preceding an iteration, which is less than, to be preset
Word weight difference when threshold epsilon and after word weight and preceding an iteration is less than predetermined threshold value ε, then judgement reaches convergence state,
Iteration ends.After the final weight for obtaining all sentences, all sentences are ranked up according to the size of final weight.To
Surely it makes a summary under constraints, such as the number of words limitation made a summary, selects k sentence, occur in urtext according to this K sentence
Be ranked sequentially, extract top-k sentence, generate make a summary.
Using autoabstract generation method provided in this embodiment, text is modeled, generates sentence network, sentence network
Figure includes side right value;According to the initial weight of each sentence in side right value, text, the first weight of each sentence in text is calculated;Root
According to frequency, first weight of word corresponding sentence of the word in each sentence, the word weight of word is calculated;According to each word in each sentence
In frequency and each word word weight, calculate the second weight of each sentence;According to the first weight and the second weight, calculate each
The final weight of sentence;Comparing unit, compares the final weight after iv-th iteration and the final weight after the N-1 times iteration is poor
Whether whether value be less than less than the word weight after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration
Predetermined threshold value;If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th
Initial weight of the final weight as the N+1 times iteration after iteration;If the final weight after iv-th iteration and the N-1 times
Final weight difference after iteration is less than the word weight after predetermined threshold value and iv-th iteration and the word power after the N-1 times iteration
Method of double differences value is less than predetermined threshold value, generates autoabstract, and by considering influencing each other between word and sentence in text, net is associated in sentence
The influence that word distich ordering score is incorporated on network solves the problems, such as single long text abstract poor quality, alleviates information mistake
It carries, improve abstract quality.
In a specific embodiment, autoabstract generation method is described in detail, wherein before each sentence
Number is that sentence is numbered.As shown in Fig. 2, Fig. 2 is another flow for the autoabstract generation method that the embodiment of the present invention one provides
Figure.
(1) 2002 year, (2) China's large enterprise group integrally showed that scale is constantly expanded, competitiveness is constantly promoted, economical
The benign development pattern that benefit improves increasingly, (3) compared with last year, (4) number reduces 38, and (5) operating income increases
17.5%, (6) assets, which amount to, increases by 11.3%, and (7) profit increases by 30.2%.(8) a batch large size, the development of group of super-sized enterprises
Rapidly, (9) become the backbone spurred economic growth.(10) the newest system of China corporation group's of State Statistics Bureau's publication on the 11st
Meter information shows that every reform of (12) China corporation group's is further deepened, and (13) various regions relevant department is logical (11) 2002 years
Cross cleaning, (14) have nullified a collection of group lack of standardization, (15) at the same it is newly-built include China Aviation group, China Civil Aviation information collection
All kinds of enterprises including group, China Aviation oil plant group, China Aviation equipment inlet and outlet group company, China Netcom etc.
52, industry group, (16) make enterprise group's sum reduce, and (17) but development are more orderly, and (18) overall strength further strengthens.
(19) survey data is shown, enterprise group, Nation Experimental Enterprise Groups, country in (20) 2002 years Chinese central management enterprises
The enterprise group of enterprise group, unit at the provincial and ministerial level examination & approval in key enterprise, (21) and annual revenue and year end assets are total
Other all kinds of consortiums of the meter at 500,000,000 yuan or more are 2627 total, 7,712,000,000,000 yuan of (22) operating income, (23) year
Last assets amount to 14,253,800,000,000 yuan, and (24) generate profit 417,900,000,000 yuan of total value.(25) statistics shows that (26) China's western region is looked forward to
The development speed of industry group is obviously accelerated.(27) 2002 years, the quantity of (28) west area enterprise group subtracted from 337 of last year
Few to 325, (29) but operating income, year end assets amount to increases by 17.5%, 13.5% than last year respectively, and (30) wherein do business
The speedup of income maintains an equal level with average speed of growth, and the speedup that assets amount at the end of (31) is higher than 2.2 percentages of average speed of growth
Point.(32) it is worth noting that, (33) in the case where whole enterprise group's sums are reduced, (34) operating income and year end assets
It amounts to the group of super-sized enterprises at 5,000,000,000 yuan or more and develops into 214, (35) increase by 35 than last year.(36) super-huge
The quantity proportion of enterprise group is only 8.1%, and (37) and operating income and year end assets amount to proportion close to 7 one-tenth, and (38) are real
Existing profit is more than 7 one-tenth half, and (39) growth rate shows more prominent, and (40) operating income and assets amount to and increase respectively than last year
Long 22.7% and 15.7%, (41) speedup is higher than other Large-scale enterprises groups 14.4 and 14.7 percentage points.(42) according to the U.S.《Wealth
It is rich》500, the worlds in the 2002 maximum enterprise ranking list that magazine is chosen shows that (43) China's large enterprise has on 12 lists and has
Name.”
Above-mentioned text is pre-processed first, pretreatment can be participle, removal stop words and be teased according to punctuation mark
Number, fullstop, colon etc. carry out subordinate sentence, the effect after implementation is as follows:
" 2002, China's large enterprise group, which integrally shows scale and constantly expands competitiveness, constantly promoted economic benefit day
It turns for the better and turns benign development pattern, last year is compared, and number reduces 38, and operating income increases by 17.5%, and assets, which amount to, to be increased
11.3%, profit increases by 30.2%.It is swift and violent to criticize the development of group of large-scale super-sized enterprises, becomes backbone of spurring economic growth.
The publication China corporation group's recent statistics information of State Statistics Bureau 11 days shows 2002, China corporation group's items reform into
One step is deepened, and batch group lack of standardization is nullified in the cleaning of various regions department, and it includes in China Civil Aviation information group of China Aviation group to create
All kinds of enterprise groups 52 including China Netcom of China Aviation equipment inlet and outlet group company of aviation fuels group of state,
Enterprise group's sum is set to reduce, development is more orderly, and overall strength further strengthens.Survey data is shown, in China in 2002
Enterprise group's unit at the provincial and ministerial level examines enterprise in key State-owned enterprises of Nation Experimental Enterprise Groups of enterprise group in the management enterprise of centre
Group, annual revenue year end assets amount to equal 500,000,000 yuan or more all kinds of consortiums and amount to 2627, operating income 77120
Hundred million yuan, year end assets amount to 14,253,800,000,000 yuan, generate profit 417,900,000,000 yuan of total value.Statistics shows that China's western region enterprise collects
Group's development speed is obviously accelerated.2002, west area enterprise group quantity last year 337 reduced 325, operating income year end
Assets amount to last year growth 17.5%13.5%, operating income speedup average speed of growth respectively and maintain an equal level, and year end assets, which amount to, to be increased
Speed is higher than 2.2 percentage point of average speed of growth.It merits attention, in the case of whole enterprise group's sums are reduced, at the end of operating income
Assets amount to equal 5,000,000,000 yuan or more groups of super-sized enterprises and develop 214, and last year increases by 35.Group of super-sized enterprises quantity ratio
Weight only 8.1%, operating income year end assets amount to proportion close to 7 one-tenth, generate profit half more than 7 one-tenth, growth rate shows more
For protrusion, operating income assets amount to last year growth 22.7%15.7%, speedup respectively and are higher than Large-scale enterprises group 14.414.7
Percentage point.U.S.'s Fortune Magazine is chosen 500, the world maximum enterprise ranking list in 2002 and is shown, on 12 lists of China's large enterprise
It is famous.”
To pretreated text, following processing is done:
S210 builds sentence network calculations the first weight of sentence.
It is a sentence network by text modeling to calculate the first weight of sentence, the similarity between side right value, that is, sentence,
In the present embodiment, Jaccard coefficients is selected to calculate side right value.
In iteration for the first time, each sentence assigns weight initial value at random in distich network, and sentence often takes turns the weight of iteration later
Initial value is all obtained by last round of iteration.The first weight of each sentence, wherein damped coefficient d are calculated according to the formula under S120
Take 0.85.
It is as follows to each sentence number, the first weight numbered in the first round iteration of corresponding sentence in text 1:
S220 carries out the enhancing of sentence word and calculates.
All unduplicated words in text are numbered, the weight of each word is calculated according to the formula under S130.With
For word " reduction ", which appears in the sentence that number is 4,16,28,33, and all only occurs once calculating the word power of the word
Weight is:(1*0.986+1*1.266+1*1.045+1*1.11)/4=1.102, then passing through sentence word enhances, the word power of word " reduction "
Weight is 1.102.Since word quantity is more, therefore a certain wheel weight for choosing vehicle indicator section word is presented below:
S230 carries out words and phrases enhancing and calculates.
The second weight that each sentence is calculated according to the formula under S140, by number be 12 sentence for, include in sentence
" China ", " enterprise ", " group ", " items ", " reform ", " further " and " in-depth " 7 words, their word weight are respectively:
1.18,1.154,1.137,1.663,1.663,0.934,1.663, calculate the second weight of the sentence:(1*1.18+1*1.154
+ 1*1.137+1*1.663+1*1.663+1*1.663+1*0.934+1*1.663)/7=1.11, then pass through words and phrases enhancing and calculate
Number be 12 the second weight of sentence be 1.11.Second weight of each sentence finally obtained is as follows:
S240 carries out the enhancing of sentence sentence and calculates.
The first weight of the sentence in epicycle iterative process and the second weight linearly add according to the formula under S150
Power, wherein regulatory factor α take 0.5, and by taking the sentence that number is 12 as an example, the first weight is 1.142, and the second weight is 1.11, meter
Calculating sentence final weight is:0.5*1.142+0.5*1.11=1.126, then final power of the sentence that number is 12 in the wheel iteration
Weight is 1.126.
Sentence final weight of all sentences after the wheel iteration is as follows in text:
S250 repeats step S210-S240, generates autoabstract.
Convergence threshold ε takes 0.0001.After 41 iteration, the final weight of sentence and the 40th takes turns the sentence after iteration
Final weight, and after 41 iteration, the word weight difference of the word after the word weight of word and the 40th wheel iteration is small
In 0.0001, therefore algorithm terminates, and the final weight after 41 iteration of obtained sentence is listed as follows shown:
Sentence final weight is ranked up, it is assumed that the number of words of abstract is limited to 150, then according to the sequence of final weight,
The sentence that number is 2,3,4,5,6,7,8,10,11,12,16 is used for generating abstract.According to sequence of these sentences in original text,
The abstract of generation is as follows:
The algorithm of the present invention extracts 11 sentences altogether, and the abstract that the sentence of generation is constituted is more smooth, can cover
Text main contents, quality are higher.
Correspondingly, the present invention provides a kind of autoabstract generating means, as shown in figure 3, it is carried for the embodiment of the present invention two
The autoabstract generating means structural schematic diagram of confession.The device includes:Generation unit 310, computing unit 320, comparing unit
330。
Generation unit 310 generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit 320 calculates the text for the initial weight according to each sentence in the side right value, the text
First weight of each sentence in this;
The computing unit 320 is additionally operable to, according to frequency of the word in each sentence, the corresponding sentence of institute's predicate
One weight calculates the word weight of institute's predicate;
The computing unit 320 is additionally operable to, according to frequency of each institute's predicate in each sentence and each institute's predicate
Word weight calculates the second weight of each sentence;
The computing unit 320 is additionally operable to, and according to first weight and second weight, calculates each sentence
Final weight;
Comparing unit 330, comparing the final weight after iv-th iteration and the final weight difference after the N-1 times iteration is
It is default whether the no word weight less than after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration are less than
Threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th
Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with
And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the computing unit 310 is specifically used for,
Utilize formulaCalculate of each sentence in the text
One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be
Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, the computing unit 310 is specifically used for, and utilizes formulaDescribed in calculating
The word weight of word;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word
Frequency in j-th of sentence.
Preferably, the computing unit 310 is specifically used for, and utilizes formulaCalculate each institute
State the second weight of sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word
Frequency in j-th of sentence.
Preferably, the computing unit 310 is specifically used for, and utilizes formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate
The final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth
Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
Using autoabstract generating means provided in this embodiment, generation unit models text, generates sentence network
Figure, sentence network includes side right value;Computing unit calculates each in text according to the initial weight of each sentence in side right value, text
First weight of sentence;According to frequency, first weight of word corresponding sentence of the word in each sentence, the word power of word is calculated
Weight;According to the word weight of frequency and each word of each word in each sentence, the second weight of each sentence is calculated;According to the first power
Weight and the second weight, calculate the final weight of each sentence;Comparing unit compares the final weight after iv-th iteration and the N-1 times
After whether final weight difference after iteration is less than the word weight after predetermined threshold value and iv-th iteration and the N-1 times iteration
Whether word weight difference is less than predetermined threshold value;If the final weight after final weight and the N-1 times iteration after iv-th iteration
Difference is not less than the word weight after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration not less than pre-
If threshold value, using the final weight after the iv-th iteration as the initial weight of the N+1 times iteration;If after iv-th iteration
Final weight difference after final weight and the N-1 times iteration is less than the word weight after predetermined threshold value and iv-th iteration and the
Word weight difference after N-1 iteration is less than predetermined threshold value, generates autoabstract, by considering the phase in text between word and sentence
It mutually influences, the influence of word distich ordering score is incorporated on sentence related network, solve single long text abstract poor quality
Problem alleviates information overload, improves abstract quality.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure
Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate
The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description.
These functions are implemented in hardware or software actually, depend on the specific application and design constraint of technical solution.
Professional technician can use different methods to achieve the described function each specific application, but this realization
It should not be considered as beyond the scope of the present invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can use hardware, processor to execute
The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory
(ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field
In any other form of storage medium well known to interior.
Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effect
It is described in detail, it should be understood that the foregoing is merely the specific implementation mode of the present invention, is not intended to limit the present invention
Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of autoabstract generation method, which is characterized in that the method includes:
Text is modeled, generates sentence network, the sentence network includes side right value;
According to the initial weight of each sentence in the side right value, the text, the first weight of each sentence in the text is calculated;
According to frequency, first weight of institute predicate corresponding sentence of the word in each sentence, the word power of institute's predicate is calculated
Weight;
According to the word weight of frequency and each institute predicate of each institute's predicate in each sentence, the of each sentence is calculated
Two weights;
According to first weight and second weight, the final weight of each sentence is calculated;
Compare the final weight after iv-th iteration and the final weight difference after the N-1 times iteration whether be less than predetermined threshold value with
And whether the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration not less than predetermined threshold value and
The word weight difference after word weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value, by the iv-th iteration
Initial weight of the final weight afterwards as the N+1 times iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value and N
The word weight difference after word weight and the N-1 times iteration after secondary iteration is less than predetermined threshold value, generates autoabstract.
2. according to the method described in claim 1, it is characterized in that, described according to each sentence in the side right value, the text
Initial weight, the first weight for calculating each sentence in the text specifically includes:
Utilize formulaCalculate the first power of each sentence in the text
Weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d is damped coefficient,
Link(Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
3. according to the method described in claim 1, it is characterized in that, the frequency according to word in each sentence, described
First weight of the corresponding sentence of word, the word weight for calculating institute's predicate specifically include:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiIt is i-th of word at j-th
Frequency in sentence, m are the sentence number in text.
4. according to the method described in claim 1, it is characterized in that, the frequency according to each institute's predicate in each sentence
The word weight of rate and each institute's predicate, the second weight for calculating each sentence specifically include:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiIt is i-th of word in jth
Frequency in a sentence, n are the not repetitor number in text.
5. according to the method described in claim 1, it is characterized in that, described according to first weight and second weight,
The final weight for calculating each sentence specifically includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is j-th
Second weight of son, α is regulatory factor, α ∈ [0,1].
6. a kind of autoabstract generating means, which is characterized in that described device includes:
Generation unit generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit calculates each sentence in the text for the initial weight according to each sentence in the side right value, the text
First weight of son;
The computing unit is additionally operable to, according to frequency, first weight of institute predicate corresponding sentence of the word in each sentence,
Calculate the word weight of institute's predicate;
The computing unit is additionally operable to, according to the word weight of frequency and each institute predicate of each institute's predicate in each sentence,
Calculate the second weight of each sentence;
The computing unit is additionally operable to, and according to first weight and second weight, calculates the final power of each sentence
Weight;
Comparing unit, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than in advance
If whether the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration not less than predetermined threshold value and
The word weight difference after word weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value, by the iv-th iteration
Initial weight of the final weight afterwards as the N+1 times iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value and N
The word weight difference after word weight and the N-1 times iteration after secondary iteration is less than predetermined threshold value, generates autoabstract.
7. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the first weight of each sentence in the text;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d is damped coefficient,
Link(Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
8. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiIt is i-th of word at j-th
Frequency in sentence, m are the sentence number in text.
9. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiIt is i-th of word in jth
Frequency in a sentence, n are the not repetitor number in text.
10. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formula W ' (Sj)
=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is j-th
Second weight of son, α is regulatory factor, α ∈ [0,1].
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511026171.2A CN105653704B (en) | 2015-12-31 | 2015-12-31 | Autoabstract generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511026171.2A CN105653704B (en) | 2015-12-31 | 2015-12-31 | Autoabstract generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105653704A CN105653704A (en) | 2016-06-08 |
CN105653704B true CN105653704B (en) | 2018-10-12 |
Family
ID=56491043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511026171.2A Expired - Fee Related CN105653704B (en) | 2015-12-31 | 2015-12-31 | Autoabstract generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105653704B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284357B (en) * | 2018-08-29 | 2022-07-19 | 腾讯科技(深圳)有限公司 | Man-machine conversation method, device, electronic equipment and computer readable medium |
CN110287280B (en) * | 2019-06-24 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Method and device for analyzing words in article, storage medium and electronic equipment |
CN110750976A (en) * | 2019-09-26 | 2020-02-04 | 平安科技(深圳)有限公司 | Language model construction method, system, computer device and readable storage medium |
CN117891933A (en) * | 2023-12-11 | 2024-04-16 | 北京万物可知技术有限公司 | Book abstract generation system based on large model |
CN117648917B (en) * | 2024-01-30 | 2024-03-29 | 北京点聚信息技术有限公司 | Layout file comparison method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101008941A (en) * | 2007-01-10 | 2007-08-01 | 复旦大学 | Successive principal axes filter method of multi-document automatic summarization |
CN103699525A (en) * | 2014-01-03 | 2014-04-02 | 江苏金智教育信息技术有限公司 | Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
-
2015
- 2015-12-31 CN CN201511026171.2A patent/CN105653704B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101008941A (en) * | 2007-01-10 | 2007-08-01 | 复旦大学 | Successive principal axes filter method of multi-document automatic summarization |
CN103699525A (en) * | 2014-01-03 | 2014-04-02 | 江苏金智教育信息技术有限公司 | Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
Non-Patent Citations (1)
Title |
---|
社会化短文本自动摘要研究综述;刘德喜等;《小型微型计算机系统》;20131231;第34卷(第12期);第2764-2771页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105653704A (en) | 2016-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105653704B (en) | Autoabstract generation method and device | |
Ruder et al. | Character-level and multi-channel convolutional neural networks for large-scale authorship attribution | |
Hu et al. | Review sentiment analysis based on deep learning | |
Paisley et al. | Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference. | |
CN103207899B (en) | Text recommends method and system | |
CN104484343B (en) | It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging | |
Thorsrud | Nowcasting using news topics. Big Data versus big bank | |
US9881059B2 (en) | Systems and methods for suggesting headlines | |
CN108363790A (en) | For the method, apparatus, equipment and storage medium to being assessed | |
US20150193535A1 (en) | Identifying influencers for topics in social media | |
CN109241412A (en) | A kind of recommended method, system and electronic equipment based on network representation study | |
Lee | Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams | |
Gao et al. | SeCo-LDA: Mining service co-occurrence topics for recommendation | |
CN106294418B (en) | Search method and searching system | |
CN106202065A (en) | A kind of across language topic detecting method and system | |
WO2022183923A1 (en) | Phrase generation method and apparatus, and computer readable storage medium | |
CN109710762B (en) | Short text clustering method integrating multiple feature weights | |
Mair et al. | The grand old party–a party of values? | |
CN107766419B (en) | Threshold denoising-based TextRank document summarization method and device | |
Yan et al. | Two Diverging roads: a semantic network analysis of chinese social connection (“guanxi”) on Twitter | |
WO2021035955A1 (en) | Text news processing method and device and storage medium | |
CN103514167B (en) | Data processing method and equipment | |
Wang | Extracting latent topics from user reviews using online LDA | |
Jiang et al. | Parallel dynamic topic modeling via evolving topic adjustment and term weighting scheme | |
Bansal et al. | Cryptocurrency price prediction using Twitter and news articles analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181012 |