CN105653704B - Autoabstract generation method and device - Google Patents

Autoabstract generation method and device Download PDF

Info

Publication number
CN105653704B
CN105653704B CN201511026171.2A CN201511026171A CN105653704B CN 105653704 B CN105653704 B CN 105653704B CN 201511026171 A CN201511026171 A CN 201511026171A CN 105653704 B CN105653704 B CN 105653704B
Authority
CN
China
Prior art keywords
weight
sentence
word
iteration
final
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201511026171.2A
Other languages
Chinese (zh)
Other versions
CN105653704A (en
Inventor
张璐
陈晨
伍之昂
曹杰
方昌健
卜湛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Finance and Economics
Original Assignee
Nanjing University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Finance and Economics filed Critical Nanjing University of Finance and Economics
Priority to CN201511026171.2A priority Critical patent/CN105653704B/en
Publication of CN105653704A publication Critical patent/CN105653704A/en
Application granted granted Critical
Publication of CN105653704B publication Critical patent/CN105653704B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of autoabstract generation method and devices, including:Text is modeled, generates sentence network, sentence network includes side right value;According to the initial weight of each sentence in side right value, text, the first weight of each sentence in text is calculated;According to frequency, first weight of word corresponding sentence of the word in each sentence, word weight is calculated;According to frequency, the word weight of each word of each word in each sentence, the second weight of each sentence is calculated;According to the first weight and the second weight, the final weight of each sentence is calculated;If the final weight difference after 1 iteration of final weight and N after iv-th iteration is less than predetermined threshold value less than the word weight difference after 1 iteration of word weight and N after predetermined threshold value and iv-th iteration, generate autoabstract, it solves the problems, such as single long text abstract poor quality, alleviates information overload, improves abstract quality.

Description

Autoabstract generation method and device
Technical field
The present invention relates to natural language processing field more particularly to a kind of autoabstract generation method and devices.
Background technology
Autoabstract automatically comprehensively, is accurately reflected in text center from extraction in urtext using computer The sentence of appearance generates the technology of simple coherent short essay.Abstract should have generality, objectivity, comprehensibility and readability. Autoabstract is always that natural language processing field is important and rich in one of the research theme of challenge for a long time, especially with The explosive growth of various Web texts and user's original content, autoabstract can alleviate the potential advantages exhibition of information overload Reveal nothing left, and practical application is obtained in numerous areas, such as news recommendation, machine translation, literature search, intelligence analysis, public sentiment prison Control etc..
Text can be considered that a series of set of sentences, autoabstract attempt to select most important several sentences formation section Record formula is made a summary, and therefore, essence is the sequencing problem of sentence.Each sentence is expressed as word with bag of words (bag-of-words) And its set of word frequency, common autoabstract technology is mainly from two parts of sentence and word, by by the system of word or sentence Characteristic, such as position, length, frequency linear combination are counted, the significance level of sentence is ranked up;Another kind of method passes through structure Sentence related network iterates to calculate sentence weight, to be ranked up using graph model sort algorithm.Existing method considers merely sentence The feature of son or the feature of word, and fail to consider interaction between the two simultaneously.
Invention content
Present invention aim to address when generating abstract, consider the feature of sentence or the spy of word merely in the prior art Sign, and fail to consider interaction between the two simultaneously, the present invention is cooperateed with by using words and phrases, and iteration enhancing, which calculates, obtains sentence Sub- weight automatically generates abstract, effectively improves the quality of long text abstract.
In a first aspect, an embodiment of the present invention provides a kind of autoabstract generation method, the method includes:
Text is modeled, generates sentence network, the sentence network includes side right value;
According to the initial weight of each sentence in the side right value, the text, first of each sentence in the text is calculated Weight;
According to frequency, first weight of institute predicate corresponding sentence of the word in each sentence, institute's predicate is calculated Word weight;
According to the word weight of frequency and each institute predicate of each institute's predicate in each sentence, each sentence is calculated The second weight;
According to first weight and second weight, the final weight of each sentence is calculated;
Compare the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than default threshold Whether the word weight after value and iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the initial weight according to each sentence in the side right value, the text calculates each in the text First weight of sentence specifically includes:
Utilize formulaCalculate of each sentence in the text One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, frequency, first weight of the corresponding sentence of institute's predicate according to word in each sentence calculates The word weight of institute's predicate specifically includes:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word Frequency in j-th of sentence.
Preferably, the frequency according to each institute's predicate in each sentence and the word weight of each institute's predicate calculate Second weight of each sentence specifically includes:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word Frequency in j-th of sentence.
Preferably, described according to first weight and second weight, calculate the final weight tool of each sentence Body includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
Second aspect, an embodiment of the present invention provides a kind of autoabstract generating means, described device includes:
Generation unit generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit is calculated for the initial weight according to each sentence in the side right value, the text in the text First weight of each sentence;
The computing unit is additionally operable to, according to frequency of the word in each sentence, the corresponding sentence of institute's predicate first Weight calculates the word weight of institute's predicate;
The computing unit is additionally operable to, according to the word of frequency and each institute predicate of each institute's predicate in each sentence Weight calculates the second weight of each sentence;
The computing unit is additionally operable to, and according to first weight and second weight, calculates each sentence most Whole weight;
Comparing unit, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is small Whether the word weight difference after word weight and the N-1 times iteration after predetermined threshold value and iv-th iteration is less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the computing unit is specifically used for,
Utilize formulaCalculate of each sentence in the text One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, the computing unit is specifically used for, and utilizes formulaCalculate institute's predicate Word weight;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word Frequency in j-th of sentence.
Preferably, the computing unit is specifically used for, and utilizes formulaIt calculates each described Second weight of sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word Frequency in j-th of sentence.
Preferably, the computing unit is specifically used for, and utilizes formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) described in calculating The final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
The present invention generates sentence network, sentence network includes side right value by being modeled to text;According to side right value, The initial weight of each sentence in text calculates the first weight of each sentence in text;According to frequency of the word in each sentence, word pair First weight of the sentence answered calculates the word weight of word;It is weighed according to the word of frequency and each word of each word in each sentence Weight, calculates the second weight of each sentence;According to the first weight and the second weight, the final weight of each sentence is calculated;Compare n-th After whether the final weight difference after final weight and the N-1 times iteration after iteration is less than predetermined threshold value and iv-th iteration Word weight and the N-1 times iteration after word weight difference whether be less than predetermined threshold value;If the final weight after iv-th iteration It is not less than the word weight after predetermined threshold value and iv-th iteration with the final weight difference after the N-1 times iteration and the N-1 times changes Word weight difference after generation is not less than predetermined threshold value, using the final weight after the iv-th iteration as at the beginning of the N+1 time iteration Beginning weight;If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value and The word weight difference after word weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value, generates autoabstract, passes through Consider influencing each other between word and sentence in text, the influence of word distich ordering score is incorporated on sentence related network, is solved It the problem of single long text abstract poor quality, alleviates information overload, improve abstract quality.
Description of the drawings
Fig. 1 is the autoabstract generation method flow chart that the embodiment of the present invention one provides;
Fig. 2 is another flow chart for the autoabstract generation method that the embodiment of the present invention one provides;
Fig. 3 is autoabstract generating means structural schematic diagram provided by Embodiment 2 of the present invention.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that the described embodiments are only some of the embodiments of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts All other embodiment, shall fall within the protection scope of the present invention.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Fig. 1 is the autoabstract generation method flow chart that the embodiment of the present invention one provides.As shown in Figure 1, the present embodiment packet Include following steps:
S110 models text, generates sentence network, the sentence network includes side right value.
Wherein, using sentence as vertex, similitude is side, is a sentence network G=(S, E, w) by text modeling, S is sentence The set of child node, E are the set on side, and w is side right value, i.e. similarity between sentence.Wherein, Jaccard can be selected in side weight w The methods of likeness coefficient, cosine similarity, BM25 algorithms calculate, and the present invention is not limited it.
Optionally, further include before S110:Text is pre-processed, which specially segments, removal deactivates Word and according to punctuation mark comma, fullstop, colon etc. carry out subordinate sentence.
S120 calculates each sentence in the text according to the initial weight of each sentence in the side right value, the text First weight.
Optionally, the initial weight according to each sentence in the side right value, the text calculates each in the text First weight of sentence specifically includes:
Utilize formulaCalculate of each sentence in the text One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Specifically, by the way of iteration sentence the first weight.In first iteration, in distich network each sentence with Machine assigns weight initial value, and the first weight W (S of j-th of sentence are then calculated according to above-mentioned formulaj)。
S130 is calculated described according to frequency, the first weight of the corresponding sentence of institute's predicate of the word in each sentence The word weight of word.
Optionally, frequency, first weight of the corresponding sentence of institute's predicate according to word in each sentence calculates The word weight of institute's predicate specifically includes:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word Frequency in j-th of sentence.
Specifically, the first weight calculation of the frequency and corresponding sentence that are occurred in sentence according to word goes out the word power of word Weight.Its computational methods is, for each word, as unit of sentence, and number and corresponding sentence which is occurred in sentence The product of sub first multiplied by weight, the word frequency and the first weight of sentence that obtain to each sentence is summed, and divided by word in text The total degree of middle appearance obtains the weight of the word.
S140 is calculated each described according to the word weight of frequency and each institute's predicate of each institute's predicate in each sentence Second weight of sentence.
Optionally, the frequency according to each institute's predicate in each sentence and the word weight of each institute's predicate calculate Second weight of each sentence specifically includes:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word Frequency in j-th of sentence.
S150 calculates the final weight of each sentence according to first weight and second weight.
Optionally, described according to first weight and second weight, calculate the final weight tool of each sentence Body includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
S160, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than in advance If whether the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is less than predetermined threshold value.
S170, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than default The word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is not less than predetermined threshold value, by described the Initial weight of the final weight as the N+1 times iteration after n times iteration;
S180, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than default threshold Word weight after value and iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, and generation is plucked automatically It wants.
Specifically, if the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than pre- If the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is not less than predetermined threshold value, then by the Initial weight of the final weight as the N+1 times iteration after n times iteration, repeats S120 to S160, until meeting the condition of convergence. If for example, after a certain wheel iteration, the difference of the final weight after the final weight of all sentences and preceding an iteration, which is less than, to be preset Word weight difference when threshold epsilon and after word weight and preceding an iteration is less than predetermined threshold value ε, then judgement reaches convergence state, Iteration ends.After the final weight for obtaining all sentences, all sentences are ranked up according to the size of final weight.To Surely it makes a summary under constraints, such as the number of words limitation made a summary, selects k sentence, occur in urtext according to this K sentence Be ranked sequentially, extract top-k sentence, generate make a summary.
Using autoabstract generation method provided in this embodiment, text is modeled, generates sentence network, sentence network Figure includes side right value;According to the initial weight of each sentence in side right value, text, the first weight of each sentence in text is calculated;Root According to frequency, first weight of word corresponding sentence of the word in each sentence, the word weight of word is calculated;According to each word in each sentence In frequency and each word word weight, calculate the second weight of each sentence;According to the first weight and the second weight, calculate each The final weight of sentence;Comparing unit, compares the final weight after iv-th iteration and the final weight after the N-1 times iteration is poor Whether whether value be less than less than the word weight after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration Predetermined threshold value;If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th Initial weight of the final weight as the N+1 times iteration after iteration;If the final weight after iv-th iteration and the N-1 times Final weight difference after iteration is less than the word weight after predetermined threshold value and iv-th iteration and the word power after the N-1 times iteration Method of double differences value is less than predetermined threshold value, generates autoabstract, and by considering influencing each other between word and sentence in text, net is associated in sentence The influence that word distich ordering score is incorporated on network solves the problems, such as single long text abstract poor quality, alleviates information mistake It carries, improve abstract quality.
In a specific embodiment, autoabstract generation method is described in detail, wherein before each sentence Number is that sentence is numbered.As shown in Fig. 2, Fig. 2 is another flow for the autoabstract generation method that the embodiment of the present invention one provides Figure.
(1) 2002 year, (2) China's large enterprise group integrally showed that scale is constantly expanded, competitiveness is constantly promoted, economical The benign development pattern that benefit improves increasingly, (3) compared with last year, (4) number reduces 38, and (5) operating income increases 17.5%, (6) assets, which amount to, increases by 11.3%, and (7) profit increases by 30.2%.(8) a batch large size, the development of group of super-sized enterprises Rapidly, (9) become the backbone spurred economic growth.(10) the newest system of China corporation group's of State Statistics Bureau's publication on the 11st Meter information shows that every reform of (12) China corporation group's is further deepened, and (13) various regions relevant department is logical (11) 2002 years Cross cleaning, (14) have nullified a collection of group lack of standardization, (15) at the same it is newly-built include China Aviation group, China Civil Aviation information collection All kinds of enterprises including group, China Aviation oil plant group, China Aviation equipment inlet and outlet group company, China Netcom etc. 52, industry group, (16) make enterprise group's sum reduce, and (17) but development are more orderly, and (18) overall strength further strengthens. (19) survey data is shown, enterprise group, Nation Experimental Enterprise Groups, country in (20) 2002 years Chinese central management enterprises The enterprise group of enterprise group, unit at the provincial and ministerial level examination & approval in key enterprise, (21) and annual revenue and year end assets are total Other all kinds of consortiums of the meter at 500,000,000 yuan or more are 2627 total, 7,712,000,000,000 yuan of (22) operating income, (23) year Last assets amount to 14,253,800,000,000 yuan, and (24) generate profit 417,900,000,000 yuan of total value.(25) statistics shows that (26) China's western region is looked forward to The development speed of industry group is obviously accelerated.(27) 2002 years, the quantity of (28) west area enterprise group subtracted from 337 of last year Few to 325, (29) but operating income, year end assets amount to increases by 17.5%, 13.5% than last year respectively, and (30) wherein do business The speedup of income maintains an equal level with average speed of growth, and the speedup that assets amount at the end of (31) is higher than 2.2 percentages of average speed of growth Point.(32) it is worth noting that, (33) in the case where whole enterprise group's sums are reduced, (34) operating income and year end assets It amounts to the group of super-sized enterprises at 5,000,000,000 yuan or more and develops into 214, (35) increase by 35 than last year.(36) super-huge The quantity proportion of enterprise group is only 8.1%, and (37) and operating income and year end assets amount to proportion close to 7 one-tenth, and (38) are real Existing profit is more than 7 one-tenth half, and (39) growth rate shows more prominent, and (40) operating income and assets amount to and increase respectively than last year Long 22.7% and 15.7%, (41) speedup is higher than other Large-scale enterprises groups 14.4 and 14.7 percentage points.(42) according to the U.S.《Wealth It is rich》500, the worlds in the 2002 maximum enterprise ranking list that magazine is chosen shows that (43) China's large enterprise has on 12 lists and has Name.”
Above-mentioned text is pre-processed first, pretreatment can be participle, removal stop words and be teased according to punctuation mark Number, fullstop, colon etc. carry out subordinate sentence, the effect after implementation is as follows:
" 2002, China's large enterprise group, which integrally shows scale and constantly expands competitiveness, constantly promoted economic benefit day It turns for the better and turns benign development pattern, last year is compared, and number reduces 38, and operating income increases by 17.5%, and assets, which amount to, to be increased 11.3%, profit increases by 30.2%.It is swift and violent to criticize the development of group of large-scale super-sized enterprises, becomes backbone of spurring economic growth. The publication China corporation group's recent statistics information of State Statistics Bureau 11 days shows 2002, China corporation group's items reform into One step is deepened, and batch group lack of standardization is nullified in the cleaning of various regions department, and it includes in China Civil Aviation information group of China Aviation group to create All kinds of enterprise groups 52 including China Netcom of China Aviation equipment inlet and outlet group company of aviation fuels group of state, Enterprise group's sum is set to reduce, development is more orderly, and overall strength further strengthens.Survey data is shown, in China in 2002 Enterprise group's unit at the provincial and ministerial level examines enterprise in key State-owned enterprises of Nation Experimental Enterprise Groups of enterprise group in the management enterprise of centre Group, annual revenue year end assets amount to equal 500,000,000 yuan or more all kinds of consortiums and amount to 2627, operating income 77120 Hundred million yuan, year end assets amount to 14,253,800,000,000 yuan, generate profit 417,900,000,000 yuan of total value.Statistics shows that China's western region enterprise collects Group's development speed is obviously accelerated.2002, west area enterprise group quantity last year 337 reduced 325, operating income year end Assets amount to last year growth 17.5%13.5%, operating income speedup average speed of growth respectively and maintain an equal level, and year end assets, which amount to, to be increased Speed is higher than 2.2 percentage point of average speed of growth.It merits attention, in the case of whole enterprise group's sums are reduced, at the end of operating income Assets amount to equal 5,000,000,000 yuan or more groups of super-sized enterprises and develop 214, and last year increases by 35.Group of super-sized enterprises quantity ratio Weight only 8.1%, operating income year end assets amount to proportion close to 7 one-tenth, generate profit half more than 7 one-tenth, growth rate shows more For protrusion, operating income assets amount to last year growth 22.7%15.7%, speedup respectively and are higher than Large-scale enterprises group 14.414.7 Percentage point.U.S.'s Fortune Magazine is chosen 500, the world maximum enterprise ranking list in 2002 and is shown, on 12 lists of China's large enterprise It is famous.”
To pretreated text, following processing is done:
S210 builds sentence network calculations the first weight of sentence.
It is a sentence network by text modeling to calculate the first weight of sentence, the similarity between side right value, that is, sentence, In the present embodiment, Jaccard coefficients is selected to calculate side right value.
In iteration for the first time, each sentence assigns weight initial value at random in distich network, and sentence often takes turns the weight of iteration later Initial value is all obtained by last round of iteration.The first weight of each sentence, wherein damped coefficient d are calculated according to the formula under S120 Take 0.85.
It is as follows to each sentence number, the first weight numbered in the first round iteration of corresponding sentence in text 1:
S220 carries out the enhancing of sentence word and calculates.
All unduplicated words in text are numbered, the weight of each word is calculated according to the formula under S130.With For word " reduction ", which appears in the sentence that number is 4,16,28,33, and all only occurs once calculating the word power of the word Weight is:(1*0.986+1*1.266+1*1.045+1*1.11)/4=1.102, then passing through sentence word enhances, the word power of word " reduction " Weight is 1.102.Since word quantity is more, therefore a certain wheel weight for choosing vehicle indicator section word is presented below:
S230 carries out words and phrases enhancing and calculates.
The second weight that each sentence is calculated according to the formula under S140, by number be 12 sentence for, include in sentence " China ", " enterprise ", " group ", " items ", " reform ", " further " and " in-depth " 7 words, their word weight are respectively: 1.18,1.154,1.137,1.663,1.663,0.934,1.663, calculate the second weight of the sentence:(1*1.18+1*1.154 + 1*1.137+1*1.663+1*1.663+1*1.663+1*0.934+1*1.663)/7=1.11, then pass through words and phrases enhancing and calculate Number be 12 the second weight of sentence be 1.11.Second weight of each sentence finally obtained is as follows:
S240 carries out the enhancing of sentence sentence and calculates.
The first weight of the sentence in epicycle iterative process and the second weight linearly add according to the formula under S150 Power, wherein regulatory factor α take 0.5, and by taking the sentence that number is 12 as an example, the first weight is 1.142, and the second weight is 1.11, meter Calculating sentence final weight is:0.5*1.142+0.5*1.11=1.126, then final power of the sentence that number is 12 in the wheel iteration Weight is 1.126.
Sentence final weight of all sentences after the wheel iteration is as follows in text:
S250 repeats step S210-S240, generates autoabstract.
Convergence threshold ε takes 0.0001.After 41 iteration, the final weight of sentence and the 40th takes turns the sentence after iteration Final weight, and after 41 iteration, the word weight difference of the word after the word weight of word and the 40th wheel iteration is small In 0.0001, therefore algorithm terminates, and the final weight after 41 iteration of obtained sentence is listed as follows shown:
Sentence final weight is ranked up, it is assumed that the number of words of abstract is limited to 150, then according to the sequence of final weight, The sentence that number is 2,3,4,5,6,7,8,10,11,12,16 is used for generating abstract.According to sequence of these sentences in original text, The abstract of generation is as follows:
The algorithm of the present invention extracts 11 sentences altogether, and the abstract that the sentence of generation is constituted is more smooth, can cover Text main contents, quality are higher.
Correspondingly, the present invention provides a kind of autoabstract generating means, as shown in figure 3, it is carried for the embodiment of the present invention two The autoabstract generating means structural schematic diagram of confession.The device includes:Generation unit 310, computing unit 320, comparing unit 330。
Generation unit 310 generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit 320 calculates the text for the initial weight according to each sentence in the side right value, the text First weight of each sentence in this;
The computing unit 320 is additionally operable to, according to frequency of the word in each sentence, the corresponding sentence of institute's predicate One weight calculates the word weight of institute's predicate;
The computing unit 320 is additionally operable to, according to frequency of each institute's predicate in each sentence and each institute's predicate Word weight calculates the second weight of each sentence;
The computing unit 320 is additionally operable to, and according to first weight and second weight, calculates each sentence Final weight;
Comparing unit 330, comparing the final weight after iv-th iteration and the final weight difference after the N-1 times iteration is It is default whether the no word weight less than after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration are less than Threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are not less than predetermined threshold value, by the n-th Initial weight of the final weight as the N+1 times iteration after iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration be less than predetermined threshold value with And the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value, generate autoabstract.
Preferably, the computing unit 310 is specifically used for,
Utilize formulaCalculate of each sentence in the text One weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d be damping be Number, Link (Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
Preferably, the computing unit 310 is specifically used for, and utilizes formulaDescribed in calculating The word weight of word;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiExist for i-th of word Frequency in j-th of sentence.
Preferably, the computing unit 310 is specifically used for, and utilizes formulaCalculate each institute State the second weight of sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiFor i-th of word Frequency in j-th of sentence.
Preferably, the computing unit 310 is specifically used for, and utilizes formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate The final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is jth Second weight of a sentence, α are regulatory factor, α ∈ [0,1].
Using autoabstract generating means provided in this embodiment, generation unit models text, generates sentence network Figure, sentence network includes side right value;Computing unit calculates each in text according to the initial weight of each sentence in side right value, text First weight of sentence;According to frequency, first weight of word corresponding sentence of the word in each sentence, the word power of word is calculated Weight;According to the word weight of frequency and each word of each word in each sentence, the second weight of each sentence is calculated;According to the first power Weight and the second weight, calculate the final weight of each sentence;Comparing unit compares the final weight after iv-th iteration and the N-1 times After whether final weight difference after iteration is less than the word weight after predetermined threshold value and iv-th iteration and the N-1 times iteration Whether word weight difference is less than predetermined threshold value;If the final weight after final weight and the N-1 times iteration after iv-th iteration Difference is not less than the word weight after predetermined threshold value and iv-th iteration and the word weight difference after the N-1 times iteration not less than pre- If threshold value, using the final weight after the iv-th iteration as the initial weight of the N+1 times iteration;If after iv-th iteration Final weight difference after final weight and the N-1 times iteration is less than the word weight after predetermined threshold value and iv-th iteration and the Word weight difference after N-1 iteration is less than predetermined threshold value, generates autoabstract, by considering the phase in text between word and sentence It mutually influences, the influence of word distich ordering score is incorporated on sentence related network, solve single long text abstract poor quality Problem alleviates information overload, improves abstract quality.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, depend on the specific application and design constraint of technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It should not be considered as beyond the scope of the present invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can use hardware, processor to execute The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field In any other form of storage medium well known to interior.
Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effect It is described in detail, it should be understood that the foregoing is merely the specific implementation mode of the present invention, is not intended to limit the present invention Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (10)

1. a kind of autoabstract generation method, which is characterized in that the method includes:
Text is modeled, generates sentence network, the sentence network includes side right value;
According to the initial weight of each sentence in the side right value, the text, the first weight of each sentence in the text is calculated;
According to frequency, first weight of institute predicate corresponding sentence of the word in each sentence, the word power of institute's predicate is calculated Weight;
According to the word weight of frequency and each institute predicate of each institute's predicate in each sentence, the of each sentence is calculated Two weights;
According to first weight and second weight, the final weight of each sentence is calculated;
Compare the final weight after iv-th iteration and the final weight difference after the N-1 times iteration whether be less than predetermined threshold value with And whether the word weight after iv-th iteration and the word weight difference after the N-1 times iteration are less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration not less than predetermined threshold value and The word weight difference after word weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value, by the iv-th iteration Initial weight of the final weight afterwards as the N+1 times iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value and N The word weight difference after word weight and the N-1 times iteration after secondary iteration is less than predetermined threshold value, generates autoabstract.
2. according to the method described in claim 1, it is characterized in that, described according to each sentence in the side right value, the text Initial weight, the first weight for calculating each sentence in the text specifically includes:
Utilize formulaCalculate the first power of each sentence in the text Weight;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d is damped coefficient, Link(Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
3. according to the method described in claim 1, it is characterized in that, the frequency according to word in each sentence, described First weight of the corresponding sentence of word, the word weight for calculating institute's predicate specifically include:
Utilize formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiIt is i-th of word at j-th Frequency in sentence, m are the sentence number in text.
4. according to the method described in claim 1, it is characterized in that, the frequency according to each institute's predicate in each sentence The word weight of rate and each institute's predicate, the second weight for calculating each sentence specifically include:
Utilize formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiIt is i-th of word in jth Frequency in a sentence, n are the not repetitor number in text.
5. according to the method described in claim 1, it is characterized in that, described according to first weight and second weight, The final weight for calculating each sentence specifically includes:
Utilize formula W ' (Sj)=α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is j-th Second weight of son, α is regulatory factor, α ∈ [0,1].
6. a kind of autoabstract generating means, which is characterized in that described device includes:
Generation unit generates sentence network, the sentence network includes side right value for being modeled to text;
Computing unit calculates each sentence in the text for the initial weight according to each sentence in the side right value, the text First weight of son;
The computing unit is additionally operable to, according to frequency, first weight of institute predicate corresponding sentence of the word in each sentence, Calculate the word weight of institute's predicate;
The computing unit is additionally operable to, according to the word weight of frequency and each institute predicate of each institute's predicate in each sentence, Calculate the second weight of each sentence;
The computing unit is additionally operable to, and according to first weight and second weight, calculates the final power of each sentence Weight;
Comparing unit, compares the final weight after iv-th iteration and whether the final weight difference after the N-1 times iteration is less than in advance If whether the word weight difference after word weight and the N-1 times iteration after threshold value and iv-th iteration is less than predetermined threshold value;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration not less than predetermined threshold value and The word weight difference after word weight and the N-1 times iteration after iv-th iteration is not less than predetermined threshold value, by the iv-th iteration Initial weight of the final weight afterwards as the N+1 times iteration;
If the final weight difference after final weight and the N-1 times iteration after iv-th iteration is less than predetermined threshold value and N The word weight difference after word weight and the N-1 times iteration after secondary iteration is less than predetermined threshold value, generates autoabstract.
7. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the first weight of each sentence in the text;
Wherein, W (Sj) be j-th of sentence the first weight, W (Si) be i-th of sentence the first weight, d is damped coefficient, Link(Sj) be and sentence SjThe sentence set being connected, wijFor sentence SiWith sentence SjBetween side right value.
8. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the word weight of institute's predicate;
Wherein, WS (Wi) be i-th of word word weight, W (Sj) be j-th of sentence the first weight, njiIt is i-th of word at j-th Frequency in sentence, m are the sentence number in text.
9. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formulaCalculate the second weight of each sentence;
Wherein, WW (Sj) be j-th of sentence the second weight, WS (Wi) be i-th of word word weight, njiIt is i-th of word in jth Frequency in a sentence, n are the not repetitor number in text.
10. device according to claim 6, which is characterized in that the computing unit is specifically used for, and utilizes formula W ' (Sj) =α W (Sj)+(1-α)WW(Sj) calculate the final weight of each sentence;
Wherein, W ' (Sj) be j-th of sentence final weight, W (Sj) j-th sentence the first weight, WW (Sj) it is j-th Second weight of son, α is regulatory factor, α ∈ [0,1].
CN201511026171.2A 2015-12-31 2015-12-31 Autoabstract generation method and device Expired - Fee Related CN105653704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511026171.2A CN105653704B (en) 2015-12-31 2015-12-31 Autoabstract generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511026171.2A CN105653704B (en) 2015-12-31 2015-12-31 Autoabstract generation method and device

Publications (2)

Publication Number Publication Date
CN105653704A CN105653704A (en) 2016-06-08
CN105653704B true CN105653704B (en) 2018-10-12

Family

ID=56491043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511026171.2A Expired - Fee Related CN105653704B (en) 2015-12-31 2015-12-31 Autoabstract generation method and device

Country Status (1)

Country Link
CN (1) CN105653704B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284357B (en) * 2018-08-29 2022-07-19 腾讯科技(深圳)有限公司 Man-machine conversation method, device, electronic equipment and computer readable medium
CN110287280B (en) * 2019-06-24 2023-09-29 腾讯科技(深圳)有限公司 Method and device for analyzing words in article, storage medium and electronic equipment
CN110750976A (en) * 2019-09-26 2020-02-04 平安科技(深圳)有限公司 Language model construction method, system, computer device and readable storage medium
CN117891933A (en) * 2023-12-11 2024-04-16 北京万物可知技术有限公司 Book abstract generation system based on large model
CN117648917B (en) * 2024-01-30 2024-03-29 北京点聚信息技术有限公司 Layout file comparison method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008941A (en) * 2007-01-10 2007-08-01 复旦大学 Successive principal axes filter method of multi-document automatic summarization
CN103699525A (en) * 2014-01-03 2014-04-02 江苏金智教育信息技术有限公司 Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008941A (en) * 2007-01-10 2007-08-01 复旦大学 Successive principal axes filter method of multi-document automatic summarization
CN103699525A (en) * 2014-01-03 2014-04-02 江苏金智教育信息技术有限公司 Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
社会化短文本自动摘要研究综述;刘德喜等;《小型微型计算机系统》;20131231;第34卷(第12期);第2764-2771页 *

Also Published As

Publication number Publication date
CN105653704A (en) 2016-06-08

Similar Documents

Publication Publication Date Title
CN105653704B (en) Autoabstract generation method and device
Ruder et al. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution
Hu et al. Review sentiment analysis based on deep learning
Paisley et al. Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference.
CN103207899B (en) Text recommends method and system
CN104484343B (en) It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging
Thorsrud Nowcasting using news topics. Big Data versus big bank
US9881059B2 (en) Systems and methods for suggesting headlines
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
US20150193535A1 (en) Identifying influencers for topics in social media
CN109241412A (en) A kind of recommended method, system and electronic equipment based on network representation study
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
Gao et al. SeCo-LDA: Mining service co-occurrence topics for recommendation
CN106294418B (en) Search method and searching system
CN106202065A (en) A kind of across language topic detecting method and system
WO2022183923A1 (en) Phrase generation method and apparatus, and computer readable storage medium
CN109710762B (en) Short text clustering method integrating multiple feature weights
Mair et al. The grand old party–a party of values?
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
Yan et al. Two Diverging roads: a semantic network analysis of chinese social connection (“guanxi”) on Twitter
WO2021035955A1 (en) Text news processing method and device and storage medium
CN103514167B (en) Data processing method and equipment
Wang Extracting latent topics from user reviews using online LDA
Jiang et al. Parallel dynamic topic modeling via evolving topic adjustment and term weighting scheme
Bansal et al. Cryptocurrency price prediction using Twitter and news articles analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181012