Embodiment
Below, will combine accompanying drawing and embodiment to describe the present invention.
(first embodiment)
In the present embodiment, read the word flow of the text document of " A Dream of Red Mansions " by name, in text document, title only comprises a paragraph.Easy in order to describe, suppose theing contents are as follows of reading:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
Do row position reader: what you come by this book of road from? Though talk about root by near absurd, thin by then interesting deeply.Treat down this origin is being indicated, it is without doubt that the side is clear the person of readding.
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
Heavy back Gao Tian may sigh at all times feelings not to the utmost,
Pining lovers, pitiful wind and moon-scene debt difficulty is repaid.
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
Nowadays existing quite a few masters imperial physician of our family are looking, all can not when vivid so saying.The happiness of saying so is arranged, the disease of saying so is arranged, this position says and is afraid of Winter Solstice always do not have an accurate work.Ask the grandfather to understand indication.
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi phoenix assistant manager of dragon taboo Ningguo mansion
Gold purple multifarious who manage state affairs, petticoats and hairpins one or two can be tame together.
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
Only lived one, must go back tomorrow.
Second early in the morning, just has the female Wangfu people of merchant to dismiss people's precious jade, and life is worn two clothes more again, and impunity would rather go.
Fig. 1 is the process flow diagram according to the Document Title method for distilling of the first embodiment of the present invention.Below, will be described in detail this method with reference to Fig. 1.
At first, in step S100, preset the key symbol and the maximum length value of the title in the above text document.
In general, each title of digital document all comprises and can identify its common key symbol for title, for example key word or keyword or other symbol.Such as for Chinese document, crucial symbol possibly comprise " the ", " returning ", " chapter ", " volume ", " joint ", " part ", bullets (such as, §) and numbering (such as, (one), (two) One), two) ...) at least one, for english document, crucial symbol possibly comprise " Chapter ", " Section ", bullets and numbering (such as, (i), (ii) I), ii) ...) at least one.For above text document; All comprise " ", " returning " character in all titles; And " return " word and appear at after " " word, therefore, in the present embodiment; The key that presets symbol is first key word " the " and second key word " returns ", and first key word " the " " return " before at second key word.
For the maximum length of title, unsuitable long, otherwise do not meet the simple and clear characteristic of title, in general, title content line feed can not occur and become section separately.Therefore, for above text document, the maximum length value that title can be set is 40 characters.
In the present embodiment, key word that presets and maximum length information are kept in the following xml file:
Here, should be appreciated that above xml file only is exemplary, can also accord with and maximum length value with the key that any known alternate manner saves presets.
Then, in step S101, be that the separator mark is divided into 12 paragraph word flows with the word flow that reads, and be that unit forms paragraph word flows set { T} with these 12 paragraph word flows with the paragraph with the new line.
Then; In step S102; Read the character number of each paragraph word flow,, then this paragraph word flow is confirmed as pseudo-caption text stream if character number surpasses the maximum length value 40 that presets; It { remove the T}, and { remaining the paragraph word flow among the T} is formed similar caption text adfluxion and closes { S} with the set of paragraph word flow from the set of paragraph word flow.Particularly, and the set of paragraph word flow { the 2nd paragraph word flow among the T} " row position reader: though what your this book of road come from? talk about root by near absurd, thin by then interesting deeply.Treat down this origin is being indicated, it is without doubt that the side is clear the person of readding." character number be 50, surpassed the maximum length value 40 that presets, therefore, it { is removed the T} from paragraph word flow set; The 7th paragraph word flow " nowadays existing quite a few masters imperial physician of our family are looking, all can not when vivid so saying.The happiness of saying so is arranged, the disease of saying so is arranged, this position says and is afraid of Winter Solstice always do not have an accurate work.Ask the grandfather to understand indication." character number be 68, surpassed 40, therefore, also with its from paragraph word flow set remove the T}, thus form similar caption text adfluxion as follows close S}:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
Heavy back Gao Tian may sigh at all times feelings not to the utmost,
Pining lovers, pitiful wind and moon-scene debt difficulty is repaid.
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi wind assistant manager of dragon taboo Ningguo mansion
Gold purple multifarious who manage state affairs, petticoats and hairpins one or two can be tame together.
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
Only lived one, must go back tomorrow.
Second early in the morning, just has the female Wangfu people of merchant to dismiss people's precious jade, and life is worn two clothes more again, and impunity would rather go.
Then, in step S103,, similar caption text adfluxion { comprises 10 similar caption text streams among the S} because closing; And the maximum length value that presets is 40; So create the matrix L of the size 10 * 40 of the position of key word in each similar caption text stream that an expression presets, wherein, 10 close the { number that the similar caption text among the S} flows for similar caption text adfluxion; 40 maximum length values for the title that presets, the element L in the matrix L
I, jThe position of representing i j character place in the similar caption text stream, i=1 ..., 10, j=1 ..., 40, and with each element L of matrix L
I, jBe initialized as 0.Initialized matrix L is as follows:
Then, in step S104, ergodic classes closes like the caption text adfluxion that { S} obtain key word " the " and " going back to " position in each similar caption text flows, and the element of relevant position is set to 1 in the matrix L.If some similar caption text stream do not comprise key word " the " or " returning " perhaps " the " and " returning " order occurs and preset order inconsistent; Then this similar caption text stream is pseudo-caption text stream; Keeping all elements value of corresponding line in the matrix L is 0, thereby obtains following matrix L:
Then; In step S105, owing to preset 2 key words, close in similar caption text adfluxion that { size of the positional alignment mode among the S} is 1 * 2 matrix A so create key word that an expression presets; And all elements A is initialized as 0; Wherein, 2 is the key word number that presets, the elements A in the matrix A
iRepresent the positional alignment mode in the similar caption text stream of i key word all in similar caption text adfluxion is closed, i=1,2, A
iPosition in the similar caption text stream of=-1 i key word of expression all in similar caption text adfluxion is closed forms descending sort, A
iStationkeeping in the similar caption text stream of=0 i key word of expression all in similar caption text adfluxion is closed is constant, A
iPosition in the similar caption text stream of=1 i key word of expression all in similar caption text adfluxion is closed forms ascending order and arranges.
Then; In step S106; According to matrix L; Add up the positional alignment mode in the similar caption text of each key word all in similar caption text adfluxion the is closed stream, and respectively according to the positional alignment mode of statistics be descending, immobilize or the ascending order matrix A in respective element be set to-1,0 or 1.Particularly; First key word " " is positioned at first position all the time in matrix L; Sortord immobilizes, and the position that second key word " returns " in matrix L has changed to the 4th position from the 3rd position, so the position that second key word " returns " forms the ascending order arrangement.Therefore, create 1 * 2 matrix A as follows:
A=(0?1)
Wherein, First element representation first key word in the matrix A " " closes in similar caption text adfluxion that { the positional alignment mode among the S} is for immobilizing, and second element representation second key word in the matrix A " returns " and close in similar caption text adfluxion that { the positional alignment mode among the S} is an ascending order.
So far, the key word that has obtained to preset closes { the positional alignment mode in the similar caption text of all among the S} stream, and it is kept in the matrix A in similar caption text adfluxion.
Then; In step S107; Ergodic classes closes like the caption text adfluxion that { S} carries out following steps; Till finding first caption text stream: the position with the key word in the current similar caption text stream is reference; Adding up similar caption text adfluxion closes and { is positioned at the number n Num of the similar caption text stream that satisfies the positional alignment mode shown in the matrix A of (containing current similar caption text stream) after the current similar caption text stream among the S}; And nNum and similar caption text adfluxion closed satisfy among the S} positional alignment mode shown in the matrix A similar caption text stream number m (promptly; 5) compare, here, m be similar caption text adfluxion close removed among the S} pseudo-caption text stream (do not contain the key symbol that presets similar caption text stream or crucial symbol the appearance order with preset the inconsistent similar caption text stream of order) number of afterwards similar caption text stream.If nNum<m/2 (that is, 3) confirm that then current similar caption text stream is pseudo-caption text stream, and all elements of corresponding line is set to 0 in the matrix L; If nNum >=m/2 confirms that then current similar caption text stream is first caption text stream.Particularly; At first; Traverse similar caption text adfluxion and close that { the similar caption text of first among S} stream is reference with the position of the key word in this title, satisfies first key word " the " and immobilizes that " to return " the number n Num that the similar caption text of ascending sort flows be 5 for ordering and second key word; Greater than 3, so first similar caption text stream is first caption text stream;
Then; In step S108; First caption text stream to confirm among the step S107 is reference; Ergodic classes closes like the caption text adfluxion and { is positioned at the similar caption texts stream of afterwards all of this caption text stream among the S}, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream, and all elements of the corresponding line in the matrix L is set to 0.Then, extract with matrix L in have 1 the corresponding similar caption text stream of row as caption text stream, form caption text adfluxion as follows close E}:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi phoenix assistant manager of dragon taboo Ningguo mansion
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
At last, in step S109, can above caption text adfluxion be closed that { E} shows according to given pattern, promptly obtains the catalogue of " A Dream of Red Mansions ".
(second embodiment)
In the present embodiment, read the word flow of the text document on " long-living boundary " by name, in text document, title comprises two paragraphs.Easy in order to describe, suppose theing contents are as follows of reading:
The first volume
Chapter 1, military broken hollow
Can who be not dead in the world?
Either you absolute beauty, brilliant world crown, in the end is the Pink Skull; either you Genghis, sitting on rivers and mountains, in the end will eventually into a loess!
The first volume
Chapter 2, ancient upright stone tablet sky maps
Mo Yun rolls, and instant is unglazed in this world, and endless dark is shrouded and descended, and has hung as the curtain of death, and breath moment of a burst of dense terror fills the air in this world.
The first volume
Chapter 3, eight arms are disliked dragon
Mortified mysterious stone inscription in 10 years, Xiao Chen is benefited a great deal, and his physique constantly changes, and the sensation of remoulding oneself thoroughly has been arranged between indistinct.But, finally let him be, come from a fierce Great War in the Kun Lun Mountain what this width of cloth training of qi figure produced confidence.
The first volume
Chapter 4, savage and wild island
Xiao Chen walks out from coconut palm woods depths, is watching endless vast sea attentively, is imagining that huge monster, the fearful picture of in seawater, acting violently, this peerless really fierce beast!
The first volume
Chapter 5, beautiful shell dragon egg
Flashed over three, Xiao Chen hinders very fast that body recovers, and has six or seven to be preordained and can to return to one's perfect health again.
Sunset clouds have disappeared, and The night screen has hung down, but the seashore does not but have quietly to get off, and noise is increasing.
The first volume
Chapter 6, rough beast goes mad
Xiao Chen cried one bad, wash away towards longshore thick forest fast.Eight arms are disliked dragon and have been returned, and he must conceal figure as early as possible, not so will die without a burial place!
Fortunately, when disliking the huge fierce shadow of dragon in big marine manifesting, Xiao Chen has rushed in primitive area.
Not not for a long time, the make a whistling sound shake day of seashore dragon, though be separated by several in, the roar that is huge is still worn gold and is split stone, turns over like the people's qi and blood that shakes as the space mine and gushes.
The first volume
Chapter 7, heavenly steed walks in the moonlight
Xiao Chen climbs fast and flies on the ancient tree, and that fine gauze has been caught in the hand, and this seemingly hides the yarn of face, and is smooth, soft incomparable.Light if empty.Be rated as tops ground silk goods.
Embroidering a phoenix gleamingly above, careful survey ought really be life-like, and suddenly, what Xiao Chen remembered, a secondary familiar ground picture leaps to brain.The graceful beautiful woman's face of stature hides fine gauze
Can find out from the above content that reads, comprise key word " " and " volume " in first paragraph of title, comprise key word " " and " chapter " in second paragraph.Therefore, in the present embodiment, the key that presets symbol is the 3rd key word in first key word in first paragraph " the " and second key word " volume " and second paragraph " the " and the 4th key word " chapter ".At this moment, except the key symbol and maximum length value of title, also need increase following parameter: 1) the included paragraph number of title in the preset parameter; 2) the paragraph position of each crucial symbol.
Fig. 2 is the process flow diagram of Document Title method for distilling according to a second embodiment of the present invention.Below, will be described in detail this method with reference to Fig. 2.
At first, in step S200, as stated, preset the paragraph position of key symbol, maximum length value, paragraph number and each crucial symbol of title.In the present embodiment; The maximum length value that presets title is 40 characters; The paragraph number be 2, the first key words " " and second key word " volume " in first paragraph, the paragraph position of presetting this both keyword is 1; And first key word in first paragraph " " should appear at the front of second key word " volume "; The 3rd key word " " and " chapter " are in second paragraph, and the paragraph position of presetting this both keyword is 2, and the front of the 4th key word " chapter " should appear in the 3rd key word " " in second paragraph.
Then, in step S201, be that mark is divided into 26 paragraph word flows with above word flow, and be that unit forms paragraph word flows set { T} with these 26 paragraph word flows with the paragraph with the new line.
Then; In step S202; According to the set of the 40 pairs of paragraph word flows of the title maximum length value that presets T} filters, and paragraph length is surpassed 40 paragraph word flow from the set of paragraph word flow remove the T}, and obtain similar caption text adfluxion as follows close S}:
The first volume
Chapter 1, military broken hollow
Can who be not dead in the world?
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
Flashed over three, Xiao Chen hinders very fast that body recovers, and has six or seven to be preordained and can to return to one's perfect health again.
Sunset clouds have disappeared, and The night screen has hung down, but the seashore does not but have quietly to get off, and noise is increasing.
The first volume
Chapter 6, rough beast goes mad
Fortunately, when disliking the huge fierce shadow of dragon in big marine manifesting, Xiao Chen has rushed in primitive area.
The first volume
Chapter 7, heavenly steed walks in the moonlight
Then, in step S203, close according to the title paragraph number that presets and the similar caption text adfluxion of paragraph fetched of crucial symbol that { the similar caption text stream among the S}, in the present embodiment, a similar caption text stream is made up of two adjacent paragraph word flows.Particularly; The first paragraph word flow according to first key word " the " and second key word " volume " and the similar caption text stream of fetched thereof; And serve as the second paragraph word flow that flows with reference to according to the 3rd key word " the " and the 4th key word " chapter " and the similar caption text of fetched thereof with this first paragraph word flow, thereby the similar caption text adfluxion that obtains further extraction as follows close S}:
The first volume
Chapter 1, military broken hollow
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
The first volume
Chapter 6, rough beast goes mad
The first volume
Chapter 7, heavenly steed walks in the moonlight
{ among the S}, comprise 7 similar caption text streams altogether, each similar caption text stream is made up of two adjacent paragraph word flows at above similar caption text stream.Such as, first similar caption text stream is made up of the first paragraph word flow " first volume " and the second paragraph word flow " chapter 1 is military broken hollow ", and the like.
Then, in step S204, { comprise 7 similar caption texts streams among the S}, and the maximum length value that presets is 40, so create the matrix L of size 7 * 40, and with each element Li of matrix L, j is initialized as 0 because similar caption text adfluxion is closed.
Then; In step S205; Ergodic classes closes { S} like the caption text adfluxion; Obtain the paragraph position and be 1 first key word " the " and second key word " volume " and paragraph position and be 2 the 3rd key word " the " and the position of the 4th key word " chapter " in each similar caption text stream, and the element of relevant position is set to 1 in the matrix L, thereby obtains following matrix L:
Can find out from above matrix; The paragraph position is that 1 first key word " the " and the position of second key word " volume " in similar caption text stream remain at the 1st and the 3rd, and the paragraph position is that 2 the 3rd key word " the " and the position of the 4th key word " chapter " in similar caption text flows remain at the 4th and the 6th.
Then, in step S206,,, and all elements A is initialized as 0 so the establishment size is 1 * 4 matrix A owing to preset 4 crucial symbols.
Then; In step S207; Because the paragraph position is 1 first key word " the " and second key word " volume " and paragraph position is that 2 the 3rd key word " the " and the stationkeeping of the 4th key word " chapter " in matrix L are constant, therefore, creates 1 * 4 matrix A as follows:
A=(0?0?0?0)
Then; In step S208; At first, ergodic classes closes { S}, execution following steps like the caption text adfluxion; Till finding first caption text stream: the position with the key word in the current similar caption text stream is reference; Adding up similar caption text adfluxion closes and { among the S} from the number n Num of the similar caption text stream that satisfies the positional alignment mode shown in the matrix A of current similar caption text stream beginning, and nMum and similar caption text adfluxion is closed { satisfying the number m (that is, 7) that the similar caption text of the positional alignment mode shown in the matrix A flows among the S} compares.If nMum<m/2 (that is, 4), then all elements of corresponding line is set to 0 in the matrix L; If nMum >=m/2 (that is, 4) confirms that then current similar caption text stream is first caption text stream.Particularly; At first; Traverse similar caption text adfluxion and close that { the similar caption text of first among S} stream is reference with the position of the key word in this similar title, title word flow, and satisfying the number n Num that the constant similar caption text of stationkeeping of first key word " the " and second key word " volume " and the 3rd key word " the " and the 4th key word " chapter " flows is 7; Greater than 4, so first similar caption text stream is first caption text stream.
Then; In step S209; First caption text stream to confirm among the step S208 is reference; Ergodic classes closes like the caption text adfluxion and { is positioned at the similar caption texts stream of afterwards all of this caption text stream among the S}, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream, and all elements of the corresponding line in the matrix L is set to 0.Then, extract with matrix L in have 1 the corresponding similar caption text stream of row as caption text stream, form caption text adfluxion as follows close E}:
The first volume
Chapter 1, military broken hollow
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
The first volume
Chapter 6, rough beast goes mad
The first volume
Chapter 7, heavenly steed walks in the moonlight
At last, in step S210, can above caption text adfluxion be closed that { E} shows according to given pattern, promptly obtains the catalogue on " long-living boundary ".
Should be appreciated that above embodiment only is exemplary, the inventive method not only can be applicable to text document, but also can be applicable to structurized documents such as PDF, DOC, HTML.For these documents; Can pass through proper process; Form it into after the set of paragraph word flow such as, existing many paragraph recognition technologies, can utilize key word or the keyword of the title that presets and maximum length value to extract title equally; But consider structured document self information characteristic (such as, divide page information).In addition, the inventive method not only can be applicable to Chinese, but also can be applicable to the text of various languages.And the title that is extracted not only can be used for createing directory, but also can be used for the similar application such as file structureization that any other needs extract title, for example reading order identification etc.
In addition, should be appreciated that though only to comprise two paragraphs in the heading, the present invention can be equally applicable to comprise in the title situation more than two paragraphs.And; For the title that comprises key word in first paragraph only; Can extract the paragraph word flow that comprises key word according to process flow diagram shown in Figure 1, be reference with this paragraph word flow then, extracts the word flow of adjacent paragraph word flow as all the other corresponding paragraphs of title.
Below, will describe Document Title extraction element with reference to Fig. 3 according to the embodiment of the invention.
With reference to Fig. 3; This device comprises words input module 100, preset module 200 and literal analysis module 300; Wherein, Words input module 100 is used for reading the word flow of pending document, and preset module 200 is used for presetting the key symbol and the maximum length value of the title of the document that reads through the words input module; Literal analysis module 300 is used for flowing according to key symbol that presets through preset unit and the caption text that maximum length value extracts the word flow that reads through the words input module.
Literal analysis module 300 further comprises paragraph resolution unit 301, caption text stream verification unit 302 and crucial symbol analytic unit 303; Wherein, Paragraph resolution unit 301 is used for the new line for the separator mark will be divided into one or more paragraph word flows through the word flow that the words input module reads, and is that unit forms paragraph word flows set { T} with these paragraph word flows with the paragraph; Caption text stream verification unit 302 is used for that { T} extracts the paragraph word flow of length less than the maximum length value that presets, and forms similar caption text adfluxion and closes { S} from the set of paragraph word flow; Crucial symbol analytic unit 303 is used for filtering similar caption text adfluxion according to the key symbol that presets and closes { the pseudo-caption text stream of S}, and extract similar caption text adfluxion and close that { the similar caption text stream of remaining among the S} forms the caption text adfluxion and closes { E}.
Here point out; The situation that comprises a plurality of paragraphs for title; Not only need preset the key symbol and the maximum length value of title in the preset module 200; But also the paragraph position that need preset the included paragraph number of title and each crucial symbol, and caption text stream verification unit 302 is closed { after the S} forming similar caption text adfluxion according to maximum length value; Also needing from this similar caption text adfluxion is closed, to extract by data according to paragraph number that presets and paragraph position is the similar caption text stream that the paragraph word flow of said paragraph number constitutes, and the similar caption text adfluxion of the further extraction of formation is closed { S}.
In addition, this device can comprise that also catalogue forms module 304, and it is used for the caption text that extracts is flowed as catalogue entry to form catalogue.Should be appreciated that catalogue forms the applying examples that module 304 only is the caption text stream that extracted, can also be the module of the caption text stream of any other demonstration or record or application fetches.
Below with reference to accompanying drawing and embodiment the present invention is described in detail; But; Should be appreciated that the present invention is not limited to above disclosed specific embodiment, modification that any those skilled in the art expects on this basis easily and modification all should be included in protection scope of the present invention.
For example, in flow process shown in Figure 1, the order of step S103 and step S105 can be not limited to order shown in Figure 1, can create matrix L and A in any suitable sequential.In addition; The position of the key that presets symbol in each similar caption text stream and the positional alignment mode of key symbol in similar caption text adfluxion is closed that presets are except the form of utilizing matrix is represented; Also can adopt other form to represent; For example, data structure or alternate manners such as one-dimension array, formation, stack, figure.And; Filter after the pseudo-caption text stream except utilizing the positional alignment mode of crucial symbol in similar caption text stream; Can also filter according to other attribute of key symbol, accord with respect to the position relation of similar caption text stream or the position relation between a plurality of crucial symbol etc. such as key.