CN102375806A - Document title extraction method and device - Google Patents

Document title extraction method and device Download PDF

Info

Publication number
CN102375806A
CN102375806A CN2010102612682A CN201010261268A CN102375806A CN 102375806 A CN102375806 A CN 102375806A CN 2010102612682 A CN2010102612682 A CN 2010102612682A CN 201010261268 A CN201010261268 A CN 201010261268A CN 102375806 A CN102375806 A CN 102375806A
Authority
CN
China
Prior art keywords
caption text
similar
adfluxion
stream
similar caption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102612682A
Other languages
Chinese (zh)
Other versions
CN102375806B (en
Inventor
李松峰
邓姿
王长桥
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leade Technology Development Co., Ltd.
Original Assignee
BEIJING FOUNDER FEIYUE MEDIA TECHNOLOGY Co Ltd
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING FOUNDER FEIYUE MEDIA TECHNOLOGY Co Ltd, Peking University Founder Group Co Ltd filed Critical BEIJING FOUNDER FEIYUE MEDIA TECHNOLOGY Co Ltd
Priority to CN201010261268.2A priority Critical patent/CN102375806B/en
Publication of CN102375806A publication Critical patent/CN102375806A/en
Application granted granted Critical
Publication of CN102375806B publication Critical patent/CN102375806B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a document title extraction method, which comprises the following steps of: presetting a key symbol and a maximum length value of a title in a document to be processed; and extracting a title character flow in a document character flow according to the preset key symbol and the maximum length value. Accordingly, the invention provides a document title extraction device. In the invention, the titles in various digital documents can be extracted by only previously setting the key symbols and the maximum length values of the document titles, particularly the extraction of the titles of pure text documents, the limitation of no attribute setting is broken through, and great convenience is brought to application requiring extraction of the titles.

Description

A kind of Document Title method for distilling and device
Technical field
The present invention relates to the text data processing technology field, relate in particular to a kind of Document Title method for distilling and device.
Background technology
At present; The development of Along with computer technology; Many information resources are all preserved with the form of data for electronic documents; How from data for electronic documents, to extract its logical organization (for example, Document Title) and structural information thereof effectively, be the key issue in present many digital document structure analyses and the application technology.
Such as, in the e-book reading of being accepted by increasing people gradually at present, from technical standpoint; Except the data layout that adopts the compressed image mode (such as; Djvu/djv, pdf etc.) outside, generally be exactly to adopt text mode (promptly; The data layout ASCII character mode) (such as, pdf, doc, txt etc.).But at present, the e-book of this dual mode generally all is not provided with catalogue during making, and this has brought very big inconvenience for user's reading and catalog search.Therefore, expectation can be extracted Document Title as catalogue entry from the image of e-book or text document, thereby forms catalogue.
To this problem of e-book reading, in " based on the realization of the e-book catalogue automatic generating calculation of OCR " (" modern information " 24 9 phases of volume), a kind of method that can generate the e-book catalogue has automatically been proposed.In the method, at first use OCR (optical character identification) technology to form the logic state of digital bibliography, pass through to separate information creating directory trees such as leading space value, title, the page number then.This method only e-book to image mode is effective; And the quality of OCR technology directly affects the result that catalogue is extracted; And the e-book of the text mode that is provided with for attributes such as no font, font size, positions but can't be extracted Document Title and create directory.
Equally; Though having, present widely used office software WORD, WPS etc. extract the function that Document Title forms catalogue; But they also require Document Title will have the attribute that is different from text; Such as, runic, font size etc. extract so these office softwares also can't be applicable to the title of text document.
In addition, be in 200710179938.4 the one Chinese patent application " a kind of indexing method of the complicated space of a whole page based on PDF " at application number, a kind of method of the Document Title in the pdf document being carried out index has been proposed.In the method, through analyzing and obtain information such as last Word message of PDF and position, font, font size, the literal that carries out robotization according to adjacent, similar principle becomes block operations; Further confirm Document Title according to information such as font sizes.By this method, can extract the chapter title that information such as its font in the pdf document, font size are different from text.But, if information such as the font of chapter title, font size are identical with text in the pdf document, and also identical with neighbouring relations between the text, then be difficult to extract Document Title.In addition, owing to do not have attribute settings such as font, font size in the text document, so this method also can't be extracted the title in the text document.
Can find out from above description; At present; For being similar to the application that e-book reading need extract the title in the digital document like this, still can not be effectively from the document of attributes such as the text document that lacks attributes such as font, font size or the font that can't distinguish title and text, font size, extract Document Title.
Summary of the invention
In order to overcome the above problems, the present invention provides a kind of Document Title method for distilling and device, extracts to realize the title in the various digital documents.
In order to realize above purpose, Document Title method for distilling provided by the invention may further comprise the steps: the key symbol and the maximum length value that preset the title in the pending document; The caption text that extracts according to the key that presets symbol and maximum length value in the word flow of said document flows.
Preferably, for Chinese document, the key of said title symbol comprises at least one in " the ", " returning ", " chapter ", " volume ", " joint ", " part ", the bullets and numbering; For english document, the key of said title symbol comprises at least one in " Chapter ", " Section ", the bullets and numbering.
Preferably, the step of the caption text stream in the word flow of said extraction document may further comprise the steps: be that the separator mark is divided into one or more paragraph word flows with said word flow with the new line, and said paragraph word flow is formed the set of paragraph word flow; From the set of paragraph word flow, extract the paragraph word flow of length, form similar caption text adfluxion and close less than the maximum length value that presets; Filter the pseudo-caption text stream of similar caption text adfluxion in closing according to the key that presets symbol, and extract similar caption text adfluxion remaining similar caption text stream in closing and close to form the caption text adfluxion.
Preferably; Comprise the situation of a plurality of paragraphs for title, except the key symbol that presets title and maximum length value, also preset the included paragraph number of title and each crucial paragraph position that accords with; And; Forming after similar caption text adfluxion closes, from this similar caption text adfluxion is closed, extracting by quantity according to paragraph number that presets and paragraph position is the similar caption text stream that the paragraph word flow of said paragraph number constitutes, and the similar caption text adfluxion of the further extraction of formation is closed.
Preferably; The step of the pseudo-caption text stream during the similar caption text adfluxion of said filtration is closed may further comprise the steps: the positional alignment mode in the similar caption texts stream of key symbol all in similar caption text adfluxion is closed that statistics presets comprises ascending order, descending and immobilizes; Ergodic classes closes like the caption text adfluxion and carries out following analytical procedure; Till finding first caption text stream: the key symbol that statistics presets is the positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed; If the key of adding up symbol positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed is consistent with the positional alignment mode that the key that presets of statistics accords with in all the similar caption text streams in similar caption text adfluxion is closed; Then current similar caption text stream is confirmed as first caption text stream, otherwise current similar caption text stream is confirmed as pseudo-caption text stream; Be positioned at the similar caption texts stream of afterwards all of first caption text stream that finds during ergodic classes closes like the caption text adfluxion, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream.
Preferably; The step of the positional alignment mode in the similar caption texts stream of key that said statistics presets symbol all in similar caption text adfluxion is closed may further comprise the steps: the size of creating the position of key symbol in each similar caption text flows that an expression presets is the matrix L of m * maxLength; Wherein, M is the number of the similar caption text stream during similar caption text adfluxion is closed, and maxLength is the maximum length value of the title that presets, the element L in the matrix L I, jThe position of representing i j character place in the similar caption text stream, i=1 ..., m, j=1 ..., maxLength, and with each element L of matrix L I, jBe initialized as 0; Ergodic classes closes like the caption text adfluxion, carries out following steps: all crucial symbols that traversal presets, obtain each crucial symbol position in each similar caption text stream, and with the element L of relevant position in the matrix L I, jBe set to 1; The size of creating the positional alignment mode of key symbol in similar caption text adfluxion is closed that an expression presets is the matrix A of 1 * n, and all elements A is initialized as 0, and wherein, n is the key symbol number that presets, the elements A in the matrix A iRepresent the positional alignment mode of i crucial symbol in similar caption text adfluxion is closed, A iI the position of crucial symbol in similar caption text adfluxion is closed of=-1 expression forms descending sort, i=1 ..., n, A iThe stationkeeping of i crucial symbol of=0 expression in similar caption text adfluxion is closed is constant, A iI the position of crucial symbol in similar caption text adfluxion is closed of=1 expression forms ascending order and arranges; According to matrix L, add up each crucial symbol positional alignment mode in similar caption text adfluxion is closed, and respectively according to the positional alignment mode of statistics be descending, immobilize or the ascending order matrix A in respective element be set to-1,0 or 1.
Preferably; Said analytical procedure may further comprise the steps: add up similar caption text adfluxion satisfies the similar caption text stream of the positional alignment mode shown in the matrix A in closing from the similar caption text stream of current similar caption text stream beginning number n Num; And the number m that satisfies the similar caption text stream of the positional alignment mode shown in the matrix A during nNum and similar caption text adfluxion closed compares; If nNum<m/2; Confirm that then current similar caption text stream is pseudo-caption text stream, and all elements of corresponding line is set to 0 in the matrix L; If nNum >=m/2 confirms that then current similar caption text stream is first caption text stream;
Preferably;, ergodic classes is positioned at the similar caption texts stream of afterwards all of first caption text stream that finds in closing like the caption text adfluxion; The inconsistent similar caption text stream of positional alignment mode in will flowing with the similar caption text of key symbol all in similar caption text adfluxion is closed of statistics is confirmed as after the pseudo-caption text stream; The all elements of the corresponding line in the matrix L is set to 0; And the corresponding similar caption text stream of the row that has element 1 in extraction and the matrix L forms the caption text adfluxion and closes as caption text stream.
Preferably, also comprise the caption text stream that extracts as the step of catalogue entry with the formation catalogue.
On the other hand, the present invention provides a kind of Document Title extraction element, comprising: the words input module, and it is used for reading the word flow of pending document; Preset module, it is used for presetting the key symbol and the maximum length value of the title of the document that reads through the words input module; The literal analysis module, it is used for flowing according to key symbol that presets through preset module and the caption text that maximum length value extracts the word flow that reads through the words input module.
Preferably; Said literal analysis module comprises: the paragraph resolution unit; It is used for the new line for the separator mark will be divided into one or more paragraph word flows through the word flow that the words input module reads, and said paragraph word flow is formed the set of paragraph word flow; Caption text stream verification unit, it is used for extracting the paragraph word flow of length less than the maximum length value that presets from the set of paragraph word flow, forms similar caption text adfluxion and closes; Crucial symbol analytic unit, it is used for filtering the pseudo-caption text stream that similar caption text adfluxion is closed according to the key symbol that presets, and extracts similar caption text adfluxion remaining similar caption text stream in closing and close to form the caption text adfluxion.
Preferably, said crucial symbol analytic unit is carried out following steps: the positional alignment mode in the similar caption texts stream of key symbol all in similar caption text adfluxion is closed that statistics presets comprises ascending order, descending and immobilizes; Ergodic classes closes like the caption text adfluxion and carries out following analytical procedure; Till finding first caption text stream: the key symbol that statistics presets is the positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed; If the key of adding up symbol positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed is consistent with the positional alignment mode that the key that presets of statistics accords with in all the similar caption text streams in similar caption text adfluxion is closed; Then current similar caption text stream is confirmed as first caption text stream, otherwise current similar caption text stream is confirmed as pseudo-caption text stream; Be positioned at the similar caption texts stream of afterwards all of first caption text stream that finds during ergodic classes closes like the caption text adfluxion, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream.
Preferably, said device comprises that also catalogue forms module, and it is used for the caption text that extracts is flowed as catalogue entry to form catalogue.
Through above technical scheme; The key symbol and the maximum length value of Document Title only need be set in advance; Just can extract the chapter title in the various digital documents, especially the title for plain text document extracts, and has broken the limitation that its no attribute is provided with; Give to need extract Document Title application (such as, e-book reading, reading order identification etc.) bringing great convenience property.
Description of drawings
Fig. 1 is the process flow diagram according to the Document Title method for distilling of the first embodiment of the present invention;
Fig. 2 is the process flow diagram of Document Title method for distilling according to a second embodiment of the present invention;
Fig. 3 is the block diagram according to the Document Title extraction element of the embodiment of the invention.
Embodiment
Below, will combine accompanying drawing and embodiment to describe the present invention.
(first embodiment)
In the present embodiment, read the word flow of the text document of " A Dream of Red Mansions " by name, in text document, title only comprises a paragraph.Easy in order to describe, suppose theing contents are as follows of reading:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
Do row position reader: what you come by this book of road from? Though talk about root by near absurd, thin by then interesting deeply.Treat down this origin is being indicated, it is without doubt that the side is clear the person of readding.
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
Heavy back Gao Tian may sigh at all times feelings not to the utmost,
Pining lovers, pitiful wind and moon-scene debt difficulty is repaid.
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
Nowadays existing quite a few masters imperial physician of our family are looking, all can not when vivid so saying.The happiness of saying so is arranged, the disease of saying so is arranged, this position says and is afraid of Winter Solstice always do not have an accurate work.Ask the grandfather to understand indication.
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi phoenix assistant manager of dragon taboo Ningguo mansion
Gold purple multifarious who manage state affairs, petticoats and hairpins one or two can be tame together.
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
Only lived one, must go back tomorrow.
Second early in the morning, just has the female Wangfu people of merchant to dismiss people's precious jade, and life is worn two clothes more again, and impunity would rather go.
Fig. 1 is the process flow diagram according to the Document Title method for distilling of the first embodiment of the present invention.Below, will be described in detail this method with reference to Fig. 1.
At first, in step S100, preset the key symbol and the maximum length value of the title in the above text document.
In general, each title of digital document all comprises and can identify its common key symbol for title, for example key word or keyword or other symbol.Such as for Chinese document, crucial symbol possibly comprise " the ", " returning ", " chapter ", " volume ", " joint ", " part ", bullets (such as, §) and numbering (such as, (one), (two) One), two) ...) at least one, for english document, crucial symbol possibly comprise " Chapter ", " Section ", bullets and numbering (such as, (i), (ii) I), ii) ...) at least one.For above text document; All comprise " ", " returning " character in all titles; And " return " word and appear at after " " word, therefore, in the present embodiment; The key that presets symbol is first key word " the " and second key word " returns ", and first key word " the " " return " before at second key word.
For the maximum length of title, unsuitable long, otherwise do not meet the simple and clear characteristic of title, in general, title content line feed can not occur and become section separately.Therefore, for above text document, the maximum length value that title can be set is 40 characters.
In the present embodiment, key word that presets and maximum length information are kept in the following xml file:
Figure BSA00000241110300081
Here, should be appreciated that above xml file only is exemplary, can also accord with and maximum length value with the key that any known alternate manner saves presets.
Then, in step S101, be that the separator mark is divided into 12 paragraph word flows with the word flow that reads, and be that unit forms paragraph word flows set { T} with these 12 paragraph word flows with the paragraph with the new line.
Then; In step S102; Read the character number of each paragraph word flow,, then this paragraph word flow is confirmed as pseudo-caption text stream if character number surpasses the maximum length value 40 that presets; It { remove the T}, and { remaining the paragraph word flow among the T} is formed similar caption text adfluxion and closes { S} with the set of paragraph word flow from the set of paragraph word flow.Particularly, and the set of paragraph word flow { the 2nd paragraph word flow among the T} " row position reader: though what your this book of road come from? talk about root by near absurd, thin by then interesting deeply.Treat down this origin is being indicated, it is without doubt that the side is clear the person of readding." character number be 50, surpassed the maximum length value 40 that presets, therefore, it { is removed the T} from paragraph word flow set; The 7th paragraph word flow " nowadays existing quite a few masters imperial physician of our family are looking, all can not when vivid so saying.The happiness of saying so is arranged, the disease of saying so is arranged, this position says and is afraid of Winter Solstice always do not have an accurate work.Ask the grandfather to understand indication." character number be 68, surpassed 40, therefore, also with its from paragraph word flow set remove the T}, thus form similar caption text adfluxion as follows close S}:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
Heavy back Gao Tian may sigh at all times feelings not to the utmost,
Pining lovers, pitiful wind and moon-scene debt difficulty is repaid.
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi wind assistant manager of dragon taboo Ningguo mansion
Gold purple multifarious who manage state affairs, petticoats and hairpins one or two can be tame together.
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
Only lived one, must go back tomorrow.
Second early in the morning, just has the female Wangfu people of merchant to dismiss people's precious jade, and life is worn two clothes more again, and impunity would rather go.
Then, in step S103,, similar caption text adfluxion { comprises 10 similar caption text streams among the S} because closing; And the maximum length value that presets is 40; So create the matrix L of the size 10 * 40 of the position of key word in each similar caption text stream that an expression presets, wherein, 10 close the { number that the similar caption text among the S} flows for similar caption text adfluxion; 40 maximum length values for the title that presets, the element L in the matrix L I, jThe position of representing i j character place in the similar caption text stream, i=1 ..., 10, j=1 ..., 40, and with each element L of matrix L I, jBe initialized as 0.Initialized matrix L is as follows:
L = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Then, in step S104, ergodic classes closes like the caption text adfluxion that { S} obtain key word " the " and " going back to " position in each similar caption text flows, and the element of relevant position is set to 1 in the matrix L.If some similar caption text stream do not comprise key word " the " or " returning " perhaps " the " and " returning " order occurs and preset order inconsistent; Then this similar caption text stream is pseudo-caption text stream; Keeping all elements value of corresponding line in the matrix L is 0, thereby obtains following matrix L:
L = 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Then; In step S105, owing to preset 2 key words, close in similar caption text adfluxion that { size of the positional alignment mode among the S} is 1 * 2 matrix A so create key word that an expression presets; And all elements A is initialized as 0; Wherein, 2 is the key word number that presets, the elements A in the matrix A iRepresent the positional alignment mode in the similar caption text stream of i key word all in similar caption text adfluxion is closed, i=1,2, A iPosition in the similar caption text stream of=-1 i key word of expression all in similar caption text adfluxion is closed forms descending sort, A iStationkeeping in the similar caption text stream of=0 i key word of expression all in similar caption text adfluxion is closed is constant, A iPosition in the similar caption text stream of=1 i key word of expression all in similar caption text adfluxion is closed forms ascending order and arranges.
Then; In step S106; According to matrix L; Add up the positional alignment mode in the similar caption text of each key word all in similar caption text adfluxion the is closed stream, and respectively according to the positional alignment mode of statistics be descending, immobilize or the ascending order matrix A in respective element be set to-1,0 or 1.Particularly; First key word " " is positioned at first position all the time in matrix L; Sortord immobilizes, and the position that second key word " returns " in matrix L has changed to the 4th position from the 3rd position, so the position that second key word " returns " forms the ascending order arrangement.Therefore, create 1 * 2 matrix A as follows:
A=(0?1)
Wherein, First element representation first key word in the matrix A " " closes in similar caption text adfluxion that { the positional alignment mode among the S} is for immobilizing, and second element representation second key word in the matrix A " returns " and close in similar caption text adfluxion that { the positional alignment mode among the S} is an ascending order.
So far, the key word that has obtained to preset closes { the positional alignment mode in the similar caption text of all among the S} stream, and it is kept in the matrix A in similar caption text adfluxion.
Then; In step S107; Ergodic classes closes like the caption text adfluxion that { S} carries out following steps; Till finding first caption text stream: the position with the key word in the current similar caption text stream is reference; Adding up similar caption text adfluxion closes and { is positioned at the number n Num of the similar caption text stream that satisfies the positional alignment mode shown in the matrix A of (containing current similar caption text stream) after the current similar caption text stream among the S}; And nNum and similar caption text adfluxion closed satisfy among the S} positional alignment mode shown in the matrix A similar caption text stream number m (promptly; 5) compare, here, m be similar caption text adfluxion close removed among the S} pseudo-caption text stream (do not contain the key symbol that presets similar caption text stream or crucial symbol the appearance order with preset the inconsistent similar caption text stream of order) number of afterwards similar caption text stream.If nNum<m/2 (that is, 3) confirm that then current similar caption text stream is pseudo-caption text stream, and all elements of corresponding line is set to 0 in the matrix L; If nNum >=m/2 confirms that then current similar caption text stream is first caption text stream.Particularly; At first; Traverse similar caption text adfluxion and close that { the similar caption text of first among S} stream is reference with the position of the key word in this title, satisfies first key word " the " and immobilizes that " to return " the number n Num that the similar caption text of ascending sort flows be 5 for ordering and second key word; Greater than 3, so first similar caption text stream is first caption text stream;
Then; In step S108; First caption text stream to confirm among the step S107 is reference; Ergodic classes closes like the caption text adfluxion and { is positioned at the similar caption texts stream of afterwards all of this caption text stream among the S}, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream, and all elements of the corresponding line in the matrix L is set to 0.Then, extract with matrix L in have 1 the corresponding similar caption text stream of row as caption text stream, form caption text adfluxion as follows close E}:
Discriminate the latent illusion of scholar for first time and know travel fatigue bosom, Tong Lingjiayu village young lady
The 5th migration dreamland refers to be confused 12 hairpins and drinks celestial wine with dregs song and drill A Dream of Red Mansions
The tenth time the greedy economic rights of golden widow are abused an imperial physician by a sick thin poor source
The 13 time but the minister in ancient times of the Qin extremely seals the luxuriant Wang Xi phoenix assistant manager of dragon taboo Ningguo mansion
The 14 time woods such as tax official Yangzhou, sea become the Jia Baoyu road to call on northern quiet king
At last, in step S109, can above caption text adfluxion be closed that { E} shows according to given pattern, promptly obtains the catalogue of " A Dream of Red Mansions ".
(second embodiment)
In the present embodiment, read the word flow of the text document on " long-living boundary " by name, in text document, title comprises two paragraphs.Easy in order to describe, suppose theing contents are as follows of reading:
The first volume
Chapter 1, military broken hollow
Can who be not dead in the world?
Either you absolute beauty, brilliant world crown, in the end is the Pink Skull; either you Genghis, sitting on rivers and mountains, in the end will eventually into a loess!
The first volume
Chapter 2, ancient upright stone tablet sky maps
Mo Yun rolls, and instant is unglazed in this world, and endless dark is shrouded and descended, and has hung as the curtain of death, and breath moment of a burst of dense terror fills the air in this world.
The first volume
Chapter 3, eight arms are disliked dragon
Mortified mysterious stone inscription in 10 years, Xiao Chen is benefited a great deal, and his physique constantly changes, and the sensation of remoulding oneself thoroughly has been arranged between indistinct.But, finally let him be, come from a fierce Great War in the Kun Lun Mountain what this width of cloth training of qi figure produced confidence.
The first volume
Chapter 4, savage and wild island
Xiao Chen walks out from coconut palm woods depths, is watching endless vast sea attentively, is imagining that huge monster, the fearful picture of in seawater, acting violently, this peerless really fierce beast!
The first volume
Chapter 5, beautiful shell dragon egg
Flashed over three, Xiao Chen hinders very fast that body recovers, and has six or seven to be preordained and can to return to one's perfect health again.
Sunset clouds have disappeared, and The night screen has hung down, but the seashore does not but have quietly to get off, and noise is increasing.
The first volume
Chapter 6, rough beast goes mad
Xiao Chen cried one bad, wash away towards longshore thick forest fast.Eight arms are disliked dragon and have been returned, and he must conceal figure as early as possible, not so will die without a burial place!
Fortunately, when disliking the huge fierce shadow of dragon in big marine manifesting, Xiao Chen has rushed in primitive area.
Not not for a long time, the make a whistling sound shake day of seashore dragon, though be separated by several in, the roar that is huge is still worn gold and is split stone, turns over like the people's qi and blood that shakes as the space mine and gushes.
The first volume
Chapter 7, heavenly steed walks in the moonlight
Xiao Chen climbs fast and flies on the ancient tree, and that fine gauze has been caught in the hand, and this seemingly hides the yarn of face, and is smooth, soft incomparable.Light if empty.Be rated as tops ground silk goods.
Embroidering a phoenix gleamingly above, careful survey ought really be life-like, and suddenly, what Xiao Chen remembered, a secondary familiar ground picture leaps to brain.The graceful beautiful woman's face of stature hides fine gauze
Can find out from the above content that reads, comprise key word " " and " volume " in first paragraph of title, comprise key word " " and " chapter " in second paragraph.Therefore, in the present embodiment, the key that presets symbol is the 3rd key word in first key word in first paragraph " the " and second key word " volume " and second paragraph " the " and the 4th key word " chapter ".At this moment, except the key symbol and maximum length value of title, also need increase following parameter: 1) the included paragraph number of title in the preset parameter; 2) the paragraph position of each crucial symbol.
Fig. 2 is the process flow diagram of Document Title method for distilling according to a second embodiment of the present invention.Below, will be described in detail this method with reference to Fig. 2.
At first, in step S200, as stated, preset the paragraph position of key symbol, maximum length value, paragraph number and each crucial symbol of title.In the present embodiment; The maximum length value that presets title is 40 characters; The paragraph number be 2, the first key words " " and second key word " volume " in first paragraph, the paragraph position of presetting this both keyword is 1; And first key word in first paragraph " " should appear at the front of second key word " volume "; The 3rd key word " " and " chapter " are in second paragraph, and the paragraph position of presetting this both keyword is 2, and the front of the 4th key word " chapter " should appear in the 3rd key word " " in second paragraph.
Then, in step S201, be that mark is divided into 26 paragraph word flows with above word flow, and be that unit forms paragraph word flows set { T} with these 26 paragraph word flows with the paragraph with the new line.
Then; In step S202; According to the set of the 40 pairs of paragraph word flows of the title maximum length value that presets T} filters, and paragraph length is surpassed 40 paragraph word flow from the set of paragraph word flow remove the T}, and obtain similar caption text adfluxion as follows close S}:
The first volume
Chapter 1, military broken hollow
Can who be not dead in the world?
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
Flashed over three, Xiao Chen hinders very fast that body recovers, and has six or seven to be preordained and can to return to one's perfect health again.
Sunset clouds have disappeared, and The night screen has hung down, but the seashore does not but have quietly to get off, and noise is increasing.
The first volume
Chapter 6, rough beast goes mad
Fortunately, when disliking the huge fierce shadow of dragon in big marine manifesting, Xiao Chen has rushed in primitive area.
The first volume
Chapter 7, heavenly steed walks in the moonlight
Then, in step S203, close according to the title paragraph number that presets and the similar caption text adfluxion of paragraph fetched of crucial symbol that { the similar caption text stream among the S}, in the present embodiment, a similar caption text stream is made up of two adjacent paragraph word flows.Particularly; The first paragraph word flow according to first key word " the " and second key word " volume " and the similar caption text stream of fetched thereof; And serve as the second paragraph word flow that flows with reference to according to the 3rd key word " the " and the 4th key word " chapter " and the similar caption text of fetched thereof with this first paragraph word flow, thereby the similar caption text adfluxion that obtains further extraction as follows close S}:
The first volume
Chapter 1, military broken hollow
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
The first volume
Chapter 6, rough beast goes mad
The first volume
Chapter 7, heavenly steed walks in the moonlight
{ among the S}, comprise 7 similar caption text streams altogether, each similar caption text stream is made up of two adjacent paragraph word flows at above similar caption text stream.Such as, first similar caption text stream is made up of the first paragraph word flow " first volume " and the second paragraph word flow " chapter 1 is military broken hollow ", and the like.
Then, in step S204, { comprise 7 similar caption texts streams among the S}, and the maximum length value that presets is 40, so create the matrix L of size 7 * 40, and with each element Li of matrix L, j is initialized as 0 because similar caption text adfluxion is closed.
Then; In step S205; Ergodic classes closes { S} like the caption text adfluxion; Obtain the paragraph position and be 1 first key word " the " and second key word " volume " and paragraph position and be 2 the 3rd key word " the " and the position of the 4th key word " chapter " in each similar caption text stream, and the element of relevant position is set to 1 in the matrix L, thereby obtains following matrix L:
L = 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Can find out from above matrix; The paragraph position is that 1 first key word " the " and the position of second key word " volume " in similar caption text stream remain at the 1st and the 3rd, and the paragraph position is that 2 the 3rd key word " the " and the position of the 4th key word " chapter " in similar caption text flows remain at the 4th and the 6th.
Then, in step S206,,, and all elements A is initialized as 0 so the establishment size is 1 * 4 matrix A owing to preset 4 crucial symbols.
Then; In step S207; Because the paragraph position is 1 first key word " the " and second key word " volume " and paragraph position is that 2 the 3rd key word " the " and the stationkeeping of the 4th key word " chapter " in matrix L are constant, therefore, creates 1 * 4 matrix A as follows:
A=(0?0?0?0)
Then; In step S208; At first, ergodic classes closes { S}, execution following steps like the caption text adfluxion; Till finding first caption text stream: the position with the key word in the current similar caption text stream is reference; Adding up similar caption text adfluxion closes and { among the S} from the number n Num of the similar caption text stream that satisfies the positional alignment mode shown in the matrix A of current similar caption text stream beginning, and nMum and similar caption text adfluxion is closed { satisfying the number m (that is, 7) that the similar caption text of the positional alignment mode shown in the matrix A flows among the S} compares.If nMum<m/2 (that is, 4), then all elements of corresponding line is set to 0 in the matrix L; If nMum >=m/2 (that is, 4) confirms that then current similar caption text stream is first caption text stream.Particularly; At first; Traverse similar caption text adfluxion and close that { the similar caption text of first among S} stream is reference with the position of the key word in this similar title, title word flow, and satisfying the number n Num that the constant similar caption text of stationkeeping of first key word " the " and second key word " volume " and the 3rd key word " the " and the 4th key word " chapter " flows is 7; Greater than 4, so first similar caption text stream is first caption text stream.
Then; In step S209; First caption text stream to confirm among the step S208 is reference; Ergodic classes closes like the caption text adfluxion and { is positioned at the similar caption texts stream of afterwards all of this caption text stream among the S}, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream, and all elements of the corresponding line in the matrix L is set to 0.Then, extract with matrix L in have 1 the corresponding similar caption text stream of row as caption text stream, form caption text adfluxion as follows close E}:
The first volume
Chapter 1, military broken hollow
The first volume
Chapter 2, ancient upright stone tablet sky maps
The first volume
Chapter 3, eight arms are disliked dragon
The first volume
Chapter 4, savage and wild island
The first volume
Chapter 5, beautiful shell dragon egg
The first volume
Chapter 6, rough beast goes mad
The first volume
Chapter 7, heavenly steed walks in the moonlight
At last, in step S210, can above caption text adfluxion be closed that { E} shows according to given pattern, promptly obtains the catalogue on " long-living boundary ".
Should be appreciated that above embodiment only is exemplary, the inventive method not only can be applicable to text document, but also can be applicable to structurized documents such as PDF, DOC, HTML.For these documents; Can pass through proper process; Form it into after the set of paragraph word flow such as, existing many paragraph recognition technologies, can utilize key word or the keyword of the title that presets and maximum length value to extract title equally; But consider structured document self information characteristic (such as, divide page information).In addition, the inventive method not only can be applicable to Chinese, but also can be applicable to the text of various languages.And the title that is extracted not only can be used for createing directory, but also can be used for the similar application such as file structureization that any other needs extract title, for example reading order identification etc.
In addition, should be appreciated that though only to comprise two paragraphs in the heading, the present invention can be equally applicable to comprise in the title situation more than two paragraphs.And; For the title that comprises key word in first paragraph only; Can extract the paragraph word flow that comprises key word according to process flow diagram shown in Figure 1, be reference with this paragraph word flow then, extracts the word flow of adjacent paragraph word flow as all the other corresponding paragraphs of title.
Below, will describe Document Title extraction element with reference to Fig. 3 according to the embodiment of the invention.
With reference to Fig. 3; This device comprises words input module 100, preset module 200 and literal analysis module 300; Wherein, Words input module 100 is used for reading the word flow of pending document, and preset module 200 is used for presetting the key symbol and the maximum length value of the title of the document that reads through the words input module; Literal analysis module 300 is used for flowing according to key symbol that presets through preset unit and the caption text that maximum length value extracts the word flow that reads through the words input module.
Literal analysis module 300 further comprises paragraph resolution unit 301, caption text stream verification unit 302 and crucial symbol analytic unit 303; Wherein, Paragraph resolution unit 301 is used for the new line for the separator mark will be divided into one or more paragraph word flows through the word flow that the words input module reads, and is that unit forms paragraph word flows set { T} with these paragraph word flows with the paragraph; Caption text stream verification unit 302 is used for that { T} extracts the paragraph word flow of length less than the maximum length value that presets, and forms similar caption text adfluxion and closes { S} from the set of paragraph word flow; Crucial symbol analytic unit 303 is used for filtering similar caption text adfluxion according to the key symbol that presets and closes { the pseudo-caption text stream of S}, and extract similar caption text adfluxion and close that { the similar caption text stream of remaining among the S} forms the caption text adfluxion and closes { E}.
Here point out; The situation that comprises a plurality of paragraphs for title; Not only need preset the key symbol and the maximum length value of title in the preset module 200; But also the paragraph position that need preset the included paragraph number of title and each crucial symbol, and caption text stream verification unit 302 is closed { after the S} forming similar caption text adfluxion according to maximum length value; Also needing from this similar caption text adfluxion is closed, to extract by data according to paragraph number that presets and paragraph position is the similar caption text stream that the paragraph word flow of said paragraph number constitutes, and the similar caption text adfluxion of the further extraction of formation is closed { S}.
In addition, this device can comprise that also catalogue forms module 304, and it is used for the caption text that extracts is flowed as catalogue entry to form catalogue.Should be appreciated that catalogue forms the applying examples that module 304 only is the caption text stream that extracted, can also be the module of the caption text stream of any other demonstration or record or application fetches.
Below with reference to accompanying drawing and embodiment the present invention is described in detail; But; Should be appreciated that the present invention is not limited to above disclosed specific embodiment, modification that any those skilled in the art expects on this basis easily and modification all should be included in protection scope of the present invention.
For example, in flow process shown in Figure 1, the order of step S103 and step S105 can be not limited to order shown in Figure 1, can create matrix L and A in any suitable sequential.In addition; The position of the key that presets symbol in each similar caption text stream and the positional alignment mode of key symbol in similar caption text adfluxion is closed that presets are except the form of utilizing matrix is represented; Also can adopt other form to represent; For example, data structure or alternate manners such as one-dimension array, formation, stack, figure.And; Filter after the pseudo-caption text stream except utilizing the positional alignment mode of crucial symbol in similar caption text stream; Can also filter according to other attribute of key symbol, accord with respect to the position relation of similar caption text stream or the position relation between a plurality of crucial symbol etc. such as key.

Claims (13)

1. Document Title method for distilling may further comprise the steps:
Preset the key symbol and the maximum length value of the title in the pending document;
The caption text that extracts according to the key that presets symbol and maximum length value in the word flow of said document flows.
2. method according to claim 1 is characterized in that:
For Chinese document, the key of said title symbol comprises at least one in " the ", " returning ", " chapter ", " volume ", " joint ", " part ", the bullets and numbering;
And/or
For english document, the key of said title symbol comprises at least one in " Chapter ", " Section ", the bullets and numbering.
3. method according to claim 1 is characterized in that, the step of the caption text stream in the word flow of said extraction document may further comprise the steps:
With the new line is that the separator mark is divided into one or more paragraph word flows with said word flow, and said paragraph word flow is formed the set of paragraph word flow;
From the set of paragraph word flow, extract the paragraph word flow of length, form similar caption text adfluxion and close less than the maximum length value that presets;
Filter the pseudo-caption text stream of similar caption text adfluxion in closing according to the key that presets symbol, and extract similar caption text adfluxion remaining similar caption text stream in closing and close to form the caption text adfluxion.
4. method according to claim 3; It is characterized in that, comprise the situation of a plurality of paragraphs, except the key symbol and maximum length value that preset title for title; Also preset the paragraph position of the included paragraph number of title and each crucial symbol; And, after the similar caption text adfluxion of formation is closed, according to paragraph number that presets and paragraph position; From this similar caption text adfluxion is closed, extracting by quantity is the similar caption text stream that the paragraph word flow of said paragraph number is constituted, and forms the similar caption text adfluxion of further extracting and closes.
5. according to claim 3 or 4 described methods, it is characterized in that the step of the pseudo-caption text stream during the similar caption text adfluxion of said filtration is closed may further comprise the steps:
Positional alignment mode in the similar caption texts stream of the key symbol that statistics presets all in similar caption text adfluxion is closed comprises ascending order, descending and immobilizes;
Ergodic classes closes like the caption text adfluxion and carries out following analytical procedure; Till finding first caption text stream: the key symbol that statistics presets is the positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed; If the key of adding up symbol positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed is consistent with the positional alignment mode that the key that presets of statistics accords with in all the similar caption text streams in similar caption text adfluxion is closed; Then current similar caption text stream is confirmed as first caption text stream, otherwise current similar caption text stream is confirmed as pseudo-caption text stream;
Be positioned at the similar caption texts stream of afterwards all of first caption text stream that finds during ergodic classes closes like the caption text adfluxion, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream.
6. method according to claim 5 is characterized in that:
The step of the positional alignment mode in the similar caption text stream of key symbol all in similar caption text adfluxion is closed that said statistics presets may further comprise the steps:
The size of creating the position of key symbol in each similar caption text stream that an expression presets is the matrix L of m * maxLength; Wherein, M is the number of the similar caption text stream during similar caption text adfluxion is closed, and maxLength is the maximum length value of the title that presets, the element L in the matrix L I, jThe position of representing i j character place in the similar caption text stream, i=1 ..., m, j=1 ..., maxLength, and with each element L of matrix L I, jBe initialized as 0;
Ergodic classes closes like the caption text adfluxion, carries out following steps: all crucial symbols that traversal presets, obtain each crucial symbol position in each similar caption text stream, and with the element L of relevant position in the matrix L I, jBe set to 1;
The size of creating the positional alignment mode of key symbol in similar caption text adfluxion is closed that an expression presets is the matrix A of 1 * n, and all elements A is initialized as 0, and wherein, n is the key symbol number that presets, the elements A in the matrix A iRepresent the positional alignment mode of i crucial symbol in similar caption text adfluxion is closed, A iI the position of crucial symbol in similar caption text adfluxion is closed of=-1 expression forms descending sort, i=1 ..., n, A iThe stationkeeping of i crucial symbol of=0 expression in similar caption text adfluxion is closed is constant, A iI the position of crucial symbol in similar caption text adfluxion is closed of=1 expression forms ascending order and arranges;
According to matrix L, add up each crucial symbol positional alignment mode in similar caption text adfluxion is closed, and respectively according to the positional alignment mode of statistics be descending, immobilize or the ascending order matrix A in respective element be set to-1,0 or 1.
7. method according to claim 6 is characterized in that, said analytical procedure may further comprise the steps:
Add up similar caption text adfluxion satisfies the similar caption text stream of the positional alignment mode shown in the matrix A in closing from the similar caption text stream of current similar caption text stream beginning number n Num; And the number m that satisfies the similar caption text stream of the positional alignment mode shown in the matrix A during nNum and similar caption text adfluxion closed compares; If nNum<m/2; Confirm that then current similar caption text stream is pseudo-caption text stream, and all elements of corresponding line is set to 0 in the matrix L; If nNum >=m/2 confirms that then current similar caption text stream is first caption text stream.
8. method according to claim 7; It is characterized in that;, ergodic classes is positioned at the similar caption texts stream of afterwards all of first caption text stream that finds in closing like the caption text adfluxion; The inconsistent similar caption text stream of positional alignment mode in will flowing with the similar caption text of key symbol all in similar caption text adfluxion is closed of statistics is confirmed as after the pseudo-caption text stream; The all elements of the corresponding line in the matrix L is set to 0, and the corresponding similar caption text stream of row that has element 1 in extraction and the matrix L forms the caption text adfluxion and closes as caption text stream.
9. method according to claim 1 is characterized in that, also comprises the caption text stream that extracts as the step of catalogue entry with the formation catalogue.
10. Document Title extraction element comprises:
The words input module, it is used for reading the word flow of pending document;
Preset module, it is used for presetting the key symbol and the maximum length value of the title of the document that reads through the words input module;
The literal analysis module, it is used for flowing according to key symbol that presets through preset module and the caption text that maximum length value extracts the word flow that reads through the words input module.
11. device according to claim 10 is characterized in that, said literal analysis module comprises:
The paragraph resolution unit, it is used for the new line for the separator mark will be divided into one or more paragraph word flows through the word flow that the words input module reads, and said paragraph word flow is formed the set of paragraph word flow;
Caption text stream verification unit, it is used for extracting the paragraph word flow of length less than the maximum length value that presets from the set of paragraph word flow, forms similar caption text adfluxion and closes;
Crucial symbol analytic unit, it is used for filtering the pseudo-caption text stream that similar caption text adfluxion is closed according to the key symbol that presets, and extracts similar caption text adfluxion remaining similar caption text stream in closing and close to form the caption text adfluxion.
12. device according to claim 11 is characterized in that, said crucial symbol analytic unit is carried out following steps:
Positional alignment mode in the similar caption texts stream of the key symbol that statistics presets all in similar caption text adfluxion is closed comprises ascending order, descending and immobilizes;
Ergodic classes closes like the caption text adfluxion and carries out following analytical procedure; Till finding first caption text stream: the key symbol that statistics presets is the positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed; If the key of adding up symbol positional alignment mode from the similar caption text stream of current similar caption text stream beginning in similar caption text adfluxion is closed is consistent with the positional alignment mode that the key that presets of statistics accords with in all the similar caption text streams in similar caption text adfluxion is closed; Then current similar caption text stream is confirmed as first caption text stream, otherwise current similar caption text stream is confirmed as pseudo-caption text stream;
Be positioned at the similar caption texts stream of afterwards all of first caption text stream that finds during ergodic classes closes like the caption text adfluxion, will flow with the inconsistent similar caption text of positional alignment mode in the similar caption text stream of adding up of key symbol all in similar caption text adfluxion is closed and confirm as pseudo-caption text stream.
13. device according to claim 10 is characterized in that, comprises that also catalogue forms module, it is used for the caption text that extracts is flowed as catalogue entry to form catalogue.
CN201010261268.2A 2010-08-23 2010-08-23 Document title extraction method and device Expired - Fee Related CN102375806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010261268.2A CN102375806B (en) 2010-08-23 2010-08-23 Document title extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010261268.2A CN102375806B (en) 2010-08-23 2010-08-23 Document title extraction method and device

Publications (2)

Publication Number Publication Date
CN102375806A true CN102375806A (en) 2012-03-14
CN102375806B CN102375806B (en) 2014-05-07

Family

ID=45794433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010261268.2A Expired - Fee Related CN102375806B (en) 2010-08-23 2010-08-23 Document title extraction method and device

Country Status (1)

Country Link
CN (1) CN102375806B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942182A (en) * 2014-04-29 2014-07-23 百度在线网络技术(北京)有限公司 English text format optimization method and device
CN105302778A (en) * 2015-10-23 2016-02-03 北京奇虎科技有限公司 Article chapter generation method and system and electronic book reader
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN106815202A (en) * 2015-12-01 2017-06-09 北大方正集团有限公司 Header checksum method and system
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109977366A (en) * 2017-12-27 2019-07-05 珠海金山办公软件有限公司 A kind of catalogue generation method and device
CN113204951A (en) * 2021-05-27 2021-08-03 广州文石信息科技有限公司 Document processing method, document processing device, storage medium and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0753833B1 (en) * 1995-06-30 1999-11-24 Océ-Technologies B.V. Apparatus and method for extracting articles from a document
CN1955952A (en) * 2005-10-25 2007-05-02 国际商业机器公司 System and method for automatically extracting by-line information
CN101183362A (en) * 2006-11-14 2008-05-21 株式会社理光 Method and apparatus for entity of searching target based on document and entity relation
CN101458680A (en) * 2008-09-03 2009-06-17 北京大学 Method and apparatus capable of auto identifying digital document catalog

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0753833B1 (en) * 1995-06-30 1999-11-24 Océ-Technologies B.V. Apparatus and method for extracting articles from a document
CN1955952A (en) * 2005-10-25 2007-05-02 国际商业机器公司 System and method for automatically extracting by-line information
CN101183362A (en) * 2006-11-14 2008-05-21 株式会社理光 Method and apparatus for entity of searching target based on document and entity relation
CN101458680A (en) * 2008-09-03 2009-06-17 北京大学 Method and apparatus capable of auto identifying digital document catalog

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942182A (en) * 2014-04-29 2014-07-23 百度在线网络技术(北京)有限公司 English text format optimization method and device
CN103942182B (en) * 2014-04-29 2018-04-27 百度在线网络技术(北京)有限公司 A kind of English text form optimization method and device
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN106469143B (en) * 2015-08-21 2019-11-19 国际商业机器公司 The estimation of file structure
US10572579B2 (en) 2015-08-21 2020-02-25 International Business Machines Corporation Estimation of document structure
US11030393B2 (en) 2015-08-21 2021-06-08 International Business Machines Corporation Estimation of document structure
CN105302778A (en) * 2015-10-23 2016-02-03 北京奇虎科技有限公司 Article chapter generation method and system and electronic book reader
CN106815202A (en) * 2015-12-01 2017-06-09 北大方正集团有限公司 Header checksum method and system
CN107291677A (en) * 2017-07-14 2017-10-24 北京神州泰岳软件股份有限公司 A kind of PDF document header syntax tree generation method, device, terminal and system
CN109977366A (en) * 2017-12-27 2019-07-05 珠海金山办公软件有限公司 A kind of catalogue generation method and device
CN109977366B (en) * 2017-12-27 2023-10-31 珠海金山办公软件有限公司 Catalog generation method and device
CN113204951A (en) * 2021-05-27 2021-08-03 广州文石信息科技有限公司 Document processing method, document processing device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN102375806B (en) 2014-05-07

Similar Documents

Publication Publication Date Title
CN102375806B (en) Document title extraction method and device
Upton et al. An atlas of English dialects: region and dialect
Landmann Twentieth century borrowings from French to English: Their reception and development
Mühleisen Towards global diglossia? English in the sciences and the humanities
Ikeda Two Versions of Buddhist Karen History of the Late British Colonial Period in Burma: Kayin Chronicle (1929) and Kuyin Great Chronicle (1931)(< Special Issue> De-institutionalizing Religion in Southeast Asia)
McBride II Enchanting Monks and Efficacious Spells: Rhetoric and the Role of Dhāraṇī in Medieval Chinese Buddhism: Rhetoric and the Role of Dhāraṇī in Medieval Chinese Buddhism
Masiola Roses and Peonies: Flower Poetics in Western and Eastern Translation
Tebaldi ‘Bad hombres’,‘aloha snackbar’, and ‘le cuck’: Mock translanguaging and the production of whiteness
Kaplan Magi, Winds, Continents: Dark Skin and Global Allegory in Early Modern Images
Walz Dance as Third Space
Kroll Li Bo and the zan
McLaughlin The coming of paper: Aesthetic value from Ruskin to Benjamin
TW480416B (en) Ten-simple-code Chinese input method
Logan et al. Pre-digital Forms of Hypertext
Kandiyoti Enduring concerns, resilient tropes, and new departures: reading the companion
Bender Figure and Flight in the Songs of Chu
Tebaldi Mock translanguaging and the production of whiteness
Stainton The Guru as Śiva: Govinda Kaula’s Gurustutiratnāvalī and a Lineage of Devotion in Kashmir
Schmiedl Script and Divination Intertwined
Roy Chowdhury et al. Effectiveness and safety of apixaban vs. rivaroxaban in patients with atrial fibrillation and type 2 diabetes mellitus
Guillermo The Temper of the Times: A Critical Introduction
Robertson The French Tradition and the Literature of Medieval England
Wijsen et al. The limitations of an ecumenical language: The case of Ki-Swahili
Sellin et al. An introduction to Maghrebian literature
Heil 1. The term and its meaning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
ASS Succession or assignment of patent right

Owner name: LIDE TECHNOLOGY DEVELOPMENT CO., LTD.

Free format text: FORMER OWNER: BEIJING FOUNDER FEIYUE MEDIA TECHNOLOGY CO., LTD.

Effective date: 20120301

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20120301

Address after: 100871 Beijing, Haidian District Road, building No. 298, founder of the building, Zhongguancun, layer 5

Applicant after: Peking Founder Group Co., Ltd.

Co-applicant after: Leade Technology Development Co., Ltd.

Address before: 100871 Beijing, Haidian District Road, building No. 298, founder of the building, Zhongguancun, layer 5

Applicant before: Peking Founder Group Co., Ltd.

Co-applicant before: Beijing Founder Feiyue Media Technology Co., Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140507

Termination date: 20190823

CF01 Termination of patent right due to non-payment of annual fee