CN104050158B - Automatic quotation extraction method and device with semantic integrity kept - Google Patents
Automatic quotation extraction method and device with semantic integrity kept Download PDFInfo
- Publication number
- CN104050158B CN104050158B CN201410301560.0A CN201410301560A CN104050158B CN 104050158 B CN104050158 B CN 104050158B CN 201410301560 A CN201410301560 A CN 201410301560A CN 104050158 B CN104050158 B CN 104050158B
- Authority
- CN
- China
- Prior art keywords
- text
- quotation
- alternative
- semantic primitive
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides an automatic quotation extraction method and device. Characters or character strings, serving as reading focuses, in a text can be used in the method and device to serve as centers to automatically extract contexts, the length of an extracted quotation is within the preserved length scope, it is kept that the extracted quotation has the semantic integrity, a section of semantic scene which is appropriate in length and integral in meaning and enables the selected characters or the selected character strings to serve as the reading focuses can be extracted from the text and formed, and a user can conveniently restore the correct meaning of the reading focuses in the contexts.
Description
Technical field
The application is related to text analyzing and extractive technique, more particularly, to a kind of quotation of holding semantic integrity certainly
Dynamic extracting method and device.
Background technology
In electronic structure document, to be selected or based on pre-defined rule (such as matched rule) by hand automatically by user
Some key words, phrase, sentence for selecting etc. is read centered on focus, extracts quotation text, is non-in many application scenarios
The function of often needing.For example, user is in the document processes such as webpage are read, it is possible to use marking tool etc. is interested in oneself
Read focus to be selected, to refer in other reading;When user wishes to share these by social networkies such as microbloggings
When reading focus, only rely on labeled crucial word, phrase and sentence and be not enough to allow reader's reduction to read what focus was located
Context, it is impossible to understand and read the purpose of focus, therefore it is complete to be accomplished by extracting one section of the context formation of reading focus
Quotation.Similar situation also has when user is desired based on the labelling preservation reading extracts to reading focus, needs to extract shape
Into quotation etc..It can be seen that, for many products based on electronic structure document and application, it is all to realize it that quotation is extracted
The requisite basic technology of function.
For example, what the Chinese patent application document of Publication No. CN102955820A disclosed a kind of accumulation of foreign vocabulary is
System and its method, wherein user can one side foreign-language text electron reading while being marked to vocabulary therein;And system
The context paragraph for containing vocabulary label information can be submitted to and preserved to service background subsystem.However, the patent documentation
Disclosed technical scheme is that the context paragraph being located using labelling is extracted as quotation, and the context segment that labelling is located
Falling may be long.Under most applied environment, the text size of quotation is conditional, is extracted in units of paragraph
Quotation is possible to the situation for causing quotation length to exceed the restriction, it is clear that the technology of the patent documentation can not be generally applicable
In the applied environment for limiting quotation length.And, if the quotation paragraph for extracting is long, can make originally as the mark for reading focus
Position of the note in quotation not enough projects, and affects reading effect.
The Chinese patent literature of Publication No. CN101192231B is disclosed in a data processing system to the specific of resource
The method that part arranges bookmark, in the method, the current screen for ringing coping resources arranges the request of bookmark, to the current of resource
The actual text collection screen contextual information of screen, and store the address information and screen contextual information work of the resource
It is the bookmark for returning the resource specific part.The technical scheme of the patent documentation is that context is carried out in units of screen
Extract, it is also possible to there is a problem of quotation text long and be unsuitable for some applied environments.And, in units of screen
Context extract for the extraction in units of paragraph, more difficult guarantee quotation in integrity semantically, because
Word segment to be likely to be located at screen the top a line or bottom a line is not the whole sentence of complete, and is
A part for whole sentence, another part of the whole sentence is then located at beyond screen.Will exist in the quotation being achieved in that incomplete
The even incomplete word of sentence, has a strong impact on the reading effect of quotation.
Also include extracting in prior art in the object that is marked and current web page close to before the object being marked and it
The content of upper and lower web page element afterwards is forming the technical scheme of quotation, the such as Chinese patent of Publication No. CN101866342
Document etc..Obviously, the extraction in units of web page element there is also quotation length may long or quotation it is semantic incomplete
Problem.
Existing quotation extracting method and device also include simple according to character in order to adapt to requirement to quotation length
Number for example centered on reading focus, forwardly and rearwardly respectively extracts tens characters come the scheme for being intercepted from text,
Form quotation.The open defect that this method has is that generated quotation does not often have semantic integrity, is usually gone out
Now the half content of certain a word is included into quotation and second half content does not include quotation, or even the feelings that a word is blocked
Condition, make after readers ' reading fail to understand so.And, in some cases, the quotation that this destruction integrity is blocked can affect user couple
It is used, for example, if text includes the information such as e-mail address, URL webpage address, telephone number, and quotation
By these message truncations, then the quotation for being provided will not have any real value.
It can be seen that, existing quotation extractive technique can not keep on the premise of quotation length is maintained within threshold value
Quotation is semantic complete, it is to avoid the globality character string such as cut-out complete sentence, vocabulary and e-mail address, the effect that its quotation is extracted
Fruit can not meet the needs of people.
The content of the invention
For the above-mentioned condition and defect of prior art, the invention provides a kind of quotation extraction method and device.
The present invention can be to automatically extract context in text as centered on the character or character string of reading focus, the quotation for being extracted
Length keeps extracted quotation to have semantic integrity within predetermined length range, so can be from text
Middle extraction is formed using character or character string as reading focus and is of convenient length, complete one section semantic scene of looking like, convenient
Focus correct implication within a context is read in user's reduction.
According to quotation extraction method of the present invention, it is characterised in that include:
Focus setting procedure, is selected as the character or character string for reading focus from text;
Context extraction step, by the extension of the text that carried out in units of complete semantic primitive and/or intercepts, extract with
Context centered on the reading focus, so as to obtain quotation text of the text size in predetermined length interval.
As a first aspect of the present invention, it is preferred that the complete semantic primitive includes:By the difference included in text
The text fragment with various yardsticks that the boundary symbol of type is limited, and by have in text independent semantic character or
The minimum semantic primitive of character string composition.Wherein, the type of the boundary symbol is predefined by symbol table, and it is minimum semantic
The set of unit.It may further be preferable that the minimum semantic primitive includes:English word, Chinese character, URL addresses, electronics
Email address, time format, the text fragment between the punctuation mark for using in pairs, the text with specific font form
Segment.
In above-mentioned steps, the context extraction step includes:Using described as the character or character string of reading focus
It is complete semantic single with large scale with what is limited by some certain types of boundary symbols for starting point and along propagation direction
Unit chooses the spread step of alternative text for unit;For the alternative text, along direction is intercepted, with certain types of by other
The complete semantic primitive with smaller scale that boundary symbol is limited intercepts the intercepting step of alternative text as unit;And pin
To the alternative text after expanded step and intercepting step process, extended along propagation direction in units of the minimum semantic primitive
And/or intercept step along the minimum semantic primitive extension for intercepting the direction intercepting alternative text.In spread step, intercept step
It is described as the character or character of reading focus according to being located in alternative text and minimum semantic primitive extension is intercepted in step
Whether the ratio of the text size before and after string reaches predetermined direction changes threshold value, decides whether to change the extension side
To with intercept direction.
Preferably, the quotation extraction method also included step before context extraction step is carried out:It is predetermined
Justice is interval for the predetermined length of quotation text.
Preferably, the quotation extraction method also included before context extraction step:Initial extraction step,
Extract positioned at text resulting structure node between and comprising it is described as reading focus character or character string it is initial standby
Selection sheet;Text analyzing step, is determined for dividing the complete semantic primitive by analyzing the initial alternative text
Boundary symbol type and minimum semantic primitive set.Wherein, in the text analyzing step, according to the initial alternative text
Language form, determine the boundary symbol type and minimum semantic primitive set.This step can be according to described initial alternative
The ratio of Chinese and English character judges its language form in text.It may further be preferable that the initial extraction step extraction is first
The length of alternative text of beginning is allowed within length of interval, and according to the predetermined length interval computation in alternative quotation
Alternative quotation allows length of interval.
It may further be preferable that the initial extraction step specifically includes following steps:Using with it is described as read focus
Character or the corresponding structuring node of character string be starting point, travel through the forward and backward structuring node of the starting point and exclude therein
After invalid structure node and its text that includes, select to be located between resulting structure node and length is fair in alternative quotation
Perhaps the text in length of interval is used as the initial alternative text.Wherein, the resulting structure is predefined by effective node table
Change the type of node.
As a second aspect of the present invention, it is preferred that the complete semantic primitive can be divided into:Expanding element, by text
In the text fragment that limited of the extended pattern boundary symbol that includes;Interception unit, by the intercepting type boundary symbol included in text
The text fragment for being limited;Minimum semantic primitive, it is single by the minimum that there is independent semantic character or character string to constitute in text
Unit;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is limited more than the intercepting type boundary symbol
The yardstick of text fragment.
It may further be preferable that predefining the type of the extended pattern boundary symbol by extended boundary symbol table, pass through
Intercept boundary symbol table and predefine the type of the intercepting type boundary symbol, by minimum semantic primitive set it is predefined it is described most
Little semantic primitive.
The context extraction step is specifically included:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with described
Expanding element is that unit extracts text and adds alternative text, until the alternative text size is interval more than the predetermined length
Lower limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit,
Using the alternative text as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, to be located at
Alternatively the character of the initial and end portion of text and non-boundary symbol is starting point, and edge intercepts direction in units of interception unit to described
Alternative text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than institute after the intercept operation
The interval lower limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along expansion
Exhibition direction is extended in units of the minimum semantic primitive to the alternative text, until the alternative text size is more than
The interval lower limit of the predetermined length;If after minimum semantic primitive extension described in Jing, the alternative quotation length is more than institute
The interval upper limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, edge is cut
Take direction to intercept the alternative quotation in units of the minimum semantic primitive;Extend and cut by minimum semantic primitive
The successive ignition for taking obtains length in the interval interior alternative text of predetermined length as the quotation text.
In the middle of above-mentioned extended operation, intercept operation and minimum semantic primitive extension intercept operation, by propagation direction mark
Propagation direction described in will bit-identify is that header extension or afterbody extend;It is by intercepting intercepting direction described in Directional Sign bit-identify
Stem is intercepted or afterbody is intercepted.
It may further be preferable that context extraction step also includes:In above-mentioned extended operation, intercept operation and minimum semanteme
In unit extensions intercept operation, after one expanding element of extension or minimum semantic primitive, and cut whenever intercepting one
It is all described as the character or character of reading focus according to being located in the alternative text after taking unit or minimum semantic primitive
Whether the ratio of the text size before and after string reaches predetermined direction changes threshold value to decide whether to change the extension
Direction intercepts direction.
The invention provides a kind of quotation automatic extracting device, it is characterised in that include:
Focus setting module, for being selected as the character or character string of reading focus from text;
Content extraction module, for by the text carried out in units of the complete semantic primitive of various yardsticks extension and/
Or intercept, the context centered on the reading focus is extracted, it is in predetermined length interval so as to obtain text size
Quotation text.
As a first aspect of the present invention, it is preferred that the complete semantic primitive includes:By the difference included in text
The text fragment with various yardsticks that the boundary symbol of type is limited, and by have in text independent semantic character or
The minimum semantic primitive of character string composition.
It may further be preferable that the quotation automatic extracting device also includes symbol table, the symbol table is used to preserve pre-
The set of the type of the boundary symbol of definition and minimum semantic primitive.
It may further be preferable that the minimum semantic primitive includes:English word, Chinese character, URL addresses, electronics postal
Case address, time format, the text fragment between the punctuation mark for using in pairs, the text piece with specific font form
It is disconnected.
Preferably, the content extraction module is used to perform following operation:Using it is described as read focus character or
Character string is starting point and along propagation direction, to be limited by some certain types of boundary symbols with the complete of large scale
Semantic primitive is the extended operation that unit chooses alternative text, for the alternative text, along direction is intercepted, with specific by other
The complete semantic primitive with smaller scale that the boundary symbol of type is limited intercepts the intercept operation of alternative text as unit,
For the alternative text after expanded step and intercepting step process, expanded along propagation direction in units of the minimum semantic primitive
Exhibition and/or the minimum semantic primitive for intercepting the alternative text along intercepting direction extend intercept operation.
Preferably, the quotation automatic extracting device also includes angle detecting module, for according to alternative text middle position
In it is described as the character or character string of reading focus before and after the ratio of text size whether reach predetermined direction
Change threshold value, decide whether to change the propagation direction and intercepting direction.
Preferably, the quotation automatic extracting device also includes quotation length setting module, for predefined for drawing
The predetermined length of text is interval.
Preferably, the content extraction module is additionally operable to perform following operation:Extract the resulting structure positioned at text
Between node and comprising the initial alternative text as the character or character string of reading focus;Also, the quotation is automatic
Extraction element also includes text analysis model, is determined for dividing the complete semanteme by analyzing the initial alternative text
The boundary symbol type of unit and minimum semantic primitive set.
It may further be preferable that the text analysis model determines institute according to the language form of the initial alternative text
State boundary symbol type and minimum semantic primitive set.This module can be according to Chinese and English character in the initial alternative text
Ratio judge its language form.
It may further be preferable that the length of the initial alternative text of the content extraction module extraction permits in alternative quotation
Perhaps within length of interval, and the quotation length setting module of the quotation automatic extracting device, for according to the pre- fixed length
Alternative quotation allows length of interval described in degree interval computation.
It may further be preferable that the content extraction module using with it is described as read focus character or character string it is corresponding
Structuring node be starting point, travel through the forward and backward structuring node of the starting point and exclude invalid structure node therein and its
Comprising text after, select between the resulting structure node and length allow the text in length of interval in alternative quotation
As the initial alternative text.
It may further be preferable that the quotation automatic extracting device also includes effective node table, effective node table is used
In the type for preserving the predefined resulting structure node.
As a second aspect of the present invention, it is preferred that the complete semantic primitive can be divided into:Expanding element, by text
In the text fragment that limited of the extended pattern boundary symbol that includes;Interception unit, by the intercepting type boundary symbol included in text
The text fragment for being limited;Minimum semantic primitive, it is single by the minimum that there is independent semantic character or character string to constitute in text
Unit;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is limited more than the intercepting type boundary symbol
The yardstick of text fragment.
It may further be preferable that the quotation automatic extracting device preserves predefined described by extended boundary symbol table
The type of extended pattern boundary symbol, the type of the predefined intercepting type boundary symbol is preserved by intercepting boundary symbol table,
The predefined minimum semantic primitive is preserved by minimum semantic primitive set.
It may further be preferable that the content extraction module is used to perform following operation:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with described
Expanding element is that unit extracts text and adds alternative text, until the alternative text size is interval more than the predetermined length
Lower limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit,
Using the alternative text as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, to be located at
Alternatively the character of the initial and end portion of text and non-boundary symbol is starting point, and edge intercepts direction in units of interception unit to described
Alternative text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than institute after the intercept operation
The interval lower limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along expansion
Exhibition direction is extended in units of the minimum semantic primitive to the alternative text, until the alternative text size is more than
The interval lower limit of the predetermined length;If after minimum semantic primitive extension described in Jing, the alternative quotation length is more than institute
The interval upper limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, edge is cut
Take direction to intercept the alternative quotation in units of the minimum semantic primitive;Extend and cut by minimum semantic primitive
The successive ignition for taking obtains length in the interval interior alternative text of predetermined length as the quotation text.
It may further be preferable that the angle detecting module that the quotation automatic extracting device includes passes through propagation direction mark
Propagation direction described in bit-identify is that header extension or afterbody extend;Intercept headed by direction described in Directional Sign bit-identify by being intercepted
Portion intercepts or afterbody is intercepted.
It may further be preferable that whenever the content extraction module extend an expanding element or minimum semantic primitive it
Afterwards, and after the content extraction module intercepts an interception unit or minimum semantic primitive, the quotation is carried automatically
Take the angle detecting module of device all according to be located in the alternative text character or character string as reading focus it
Whether the ratio of front and text size afterwards reaches predetermined direction changes threshold value to decide whether to change the propagation direction
Or intercepting direction.
Beneficial effects of the present invention include:The reading focus that can in the text select from user, automatically extracts out
Read the context centered on focus and form quotation fragment, and quotation fragment has complete semanteme, improve quotation can
The property read, enables reader therefrom correctly to restore complete semantic scene, overcomes in prior art due to having completely
Semantic text fragment midway is blocked and affects the reading to quotation and the defect for using;The quotation fragment length extracted meets
The interval requirement of predetermined length, improves the suitability of the quotation to various applied environments.
Description of the drawings
Fig. 1 shows the schematic flow sheet of quotation extraction method of the present invention;
Fig. 2 shows the schematic diagram for being ready for the target text that quotation is automatically extracted;
Fig. 3 shows the signal for the selected predetermined length interval for reading focus and setting of the target text
Figure;
Fig. 4 is the structured document schematic diagram of target text described in the embodiment of the present invention;
Fig. 5 is the refined flow chart of context extraction step;
Fig. 6 is the alternative text schematic diagram after expanded operation;
Fig. 7 shows the page schematic diagram of the quotation text of final acquisition;
Fig. 8 shows the structural representation of quotation automatic extracting device of the present invention.
Specific embodiment
The preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings.It may be noted that explaining preferred embodiment
The characteristics of purpose stated is to more fully show each aspect of the present invention and beneficial effect.It is therefore preferable that embodiment is
As an example property, should not be understood as limiting the scope of the invention.Protection scope of the present invention should be wanted with right
The content that book is asked is asked to be defined.
The present invention is a kind of method and apparatus for realizing context extraction as unit using complete semantic primitive.Introduce first
The once implication of complete semantic primitive.Complete semantic primitive is that have independent and complete semantic text fragment.It is complete semantic single
Unit is unit of expressing the meaning intrinsic in natural language, for example, in the middle of Chinese, with fullstop ("."), question mark (""), exclamation mark
(“!") etc. punctuate a whole sentence can be marked off from text as border, the whole sentence as whole section of text text fragment, table
Complete semantic scene has been reached, a complete semantic primitive has been constituted;And with comma (", ") or branch (";") can be with for border
A subordinate sentence in dividing from whole sentence, the subordinate sentence also expresses relatively complete semantic scene, equally constitutes as text fragment
One complete semantic primitive;Analogously, can also be marked off for border with period (". ") or comma (", ") in English
Complete semantic primitive.It can be seen that, natively there is in the text the various boundary symbols by taking above-mentioned punctuation mark as an example, boundary symbol
Between the text fragment that limited i.e. as the complete semantic primitive, these text fragments natively have relatively complete language
Justice.Certainly, also not all of punctuation mark can serve as the boundary symbol, such as upper and lower quotation marks, upper and lower bracket, up and down
Punctuation marks used to enclose the title, the purpose of pause mark not divide whole sentence or subordinate sentence, therefore generally can not be used as the boundary symbol.In order to correctly recognize
The boundary symbol, the present invention arranges symbol table and predefined can below be situated between as the sign pattern of boundary symbol to preserve
In the context extraction step for continuing can by with symbol table contrast to judge text in which symbol constitute boundary symbol, from
And the text fragment for being limited these symbols is used as complete semantic primitive.
The complete semantic primitive limited by different types of boundary symbol may have different yardsticks.For example, upper
In the example in face, with fullstop ("."), question mark (""), exclamation mark ("!") etc. type the text fragment that limited of boundary symbol
Belong to whole sentence, with large scale;And with comma (", ") or branch (";") text fragment that limits for boundary symbol belongs to point
Sentence, with smaller scale.Here " yardstick " is meant that the yardstick of finger speech justice division aspect, rather than text size etc..One by
The number of characters that included of subordinate sentence that comma is limited may be far more than the number of characters of a whole sentence limited by fullstop, but from yardstick
See, the yardstick of the latter is still above the former.
For the text fragment being made up of the one section of continuation character not separated by the boundary symbol, still may be used
To be split as some complete semantic primitives, these complete semantic primitives are by with independent semantic character or character string
The minimum semantic primitive of composition.For example, an English word or a Chinese character may make up minimum semantic primitive.In addition,
The character strings such as URL addresses, E-mail address, time format obviously can not split again, will otherwise destroy its integrity, therefore
Such character string also constitutes minimum semantic primitive.Text fragment between the punctuation mark for using in pairs is (such as positioned at upper
Text fragment, the text fragment between upper and lower quotation marks between lower bracket, the text fragment between upper and lower punctuation marks used to enclose the title
Deng) there is globality, also should not split again, therefore constitute minimum semantic primitive.In some cases, with specific font form
Text fragment also constitute minimum semantic primitive, for example, in english expression, the exclusive noun such as name, place name generally using with
The different english font of the other parts of text, therefore can be by with the specific font different from text other parts font
Text fragment is defined as minimum semantic primitive;For another example, with underscore, overstriking, italic, blacken, mark the specific font lattice such as red
The text fragment of formula can also be defined as minimum semantic primitive.In order to determine minimum semantic primitive, the present invention in Text Feature Extraction
Predefined minimum semantic primitive is preserved by minimum semantic primitive set, so as to match with minimum semantic primitive in text
Character or character string as complete semantic primitive.Obviously, from from yardstick, the yardstick of minimum semantic primitive is minimum.
Fig. 1 is the schematic flow sheet of quotation extraction method in the embodiment of the present invention.The method sets first including focus
Determine step 101, in this step, for the target text shown in Fig. 2, single character therein or a continuous character may be selected
String is used as reading focus.For example, user can select the single character in target text, or the key being made up of character string
Word, phrase or short sentence should be less than predetermined threshold, for example, be no longer than as the reading focus, selected string length
Eight characters etc..As shown in figure 3, in above-mentioned target text, user selects " structuring between information " this character string conduct
The reading focus.User can be directly performed to reading focus selection action during the texts such as webpage, e-book are read,
Can also select otherwise to read focus;For example, user can be input in a search engine as the character for reading focus
Or character string is obtained and reads the text of focus and to user comprising the term as term, search engine and then Auto-matching
Quotation centered on providing to read focus.
While selecting as the character or character string of reading focus, in a step 102, user can predefine and pass through
The method of the present invention extracts the higher limit and lower limit, the higher limit and lower limit of the length range of the final quotation text for obtaining
It is interval that value constitutes predetermined length.Square frame in Fig. 3 is shown around the length range for reading the final quotation text that focus is formed
The upper limit, in this example the lower limit of quotation text size scope read the text size of focus itself.The predetermined length is interval
Can be presetting.
In the initial extraction step of step 103, extract between the resulting structure node of text and comprising described
As the initial alternative text of the character or character string for reading focus.The text being digitized into is mostly structured document, structure
Changing document includes the structuring node such as content of text itself and label, and such as Fig. 4 shows the structuring text of the target text
Shelves schematic diagram, the structured document is the source file of webpage, wherein<p>,<a>,<img>Etc. web page tag as the structuring
Node.In this step, first according to read focus be located text in structured document location, acquisition read with this
Read the corresponding structuring node of focus, the text at focus " structuring between information " place is read in the diagram in structuring text
Second is in shelves<p>After node, therefore read the associated structuring node of focus and be defined as second<p>Node.With
This second<p>Node is starting point, and the forward and backward structuring node of the starting point is traveled through successively.For example, in this example first backward time
Go through, priority is run into into the 3rd<p>Node,<a>Node and<img>Node, for these structuring nodes, can be by inquiry
Effectively node table, judges which is resulting structure node;Predefined all effective knots are saved in effective node table
The type of structure node.In this example, by inquiry, can be with<a>Node belongs to resulting structure node, therefore<a>Node it
Between text fragment " (microblogging) " will be added in the middle of initial alternative text, and<img>Node Jing inquiries are not belonging to resulting structure
Change node, therefore<img>Node and its content for being included will be excluded.By iteration successively forwardly and rearwardly, will include
Text between resulting structure node adds initial alternative text, until the length of initial alternative text is in alternative quotation
Within allowing length of interval.The alternative quotation allows the initial alternative quotation that length of interval defines the extraction from text to permit
Perhaps maximum text size and minimum text size, can count according to the predetermined length interval of defined final quotation text
Calculating the alternative quotation allows the upper limit of length of interval, and the applied environment for example shared for microblogging, the predetermined length is interval
Higher limit be 120 characters, then alternative quotation allows the higher limit of length of interval to be more than 120 characters, e.g. 120
The several times of individual character;So, by extracting the initial alternative quotation bigger than final quotation text capacity, be conducive to under
The text analyzing step 104 that face is mentioned provides more fully analyst coverage, so as to ensure that precision of analysis.At this
In example, the alternative quotation allows the lower limit of length of interval to be similarly the text size for reading focus itself.
It is initial to this in the text analyzing step of step 104 after determining initial alternative text in step 103
Alternative text is analyzed, it may be determined that the language form of target text, so as to determine will to introduce below with complete language
In the middle of the context extraction step that adopted unit is carried out for unit, which is included most using what type of boundary symbol and employing
The set of little semantic primitive is dividing the complete semantic primitive.In this step, statistics is in the initial alternative text for being extracted
It is central, Chinese character and English character ratio shared respectively, it is clear that in the middle of the target text of the present embodiment, Chinese character
Proportion is bigger, therefore language form is defined as into Chinese type.For different language form, divide complete semantic
The boundary symbol of unit can be different, for example, in Chinese, fullstop ("."), exclamation mark ("!"), question mark ("") can make
One section is limited for the boundary symbol there is complete semantic text fragment, and in English, period (". ") can be used as side
Boundary's symbol and limit with complete semantic text fragment.Different language form, it is interior that minimum semantic primitive set is included
Holding certainly also can be different, such as in English, have substantial amounts of English word to belong to minimum semantic primitive, and in Chinese, except few
Measure outside the abbreviation formed by English character string, will not be single comprising most of other English in the set of its minimum semantic primitive
Word.Therefore, according to the language form detected in this step, Chinese text can be respectively directed to different with English text selection
Boundary symbol type and minimum semantic primitive set.
Next, in upper and lower extraction step 105, to read focus centered on, with the complete semantic single of various yardsticks
Unit is unit, carries out text extension in units of the complete semantic primitive of large scale first and obtains alternative text, then with yardstick
Less complete semantic primitive is intercepted for unit to alternative text, by successive ignition, until with smallest dimension
The minimum semantic primitive is extended for unit and intercepts, and finally extracts using described as the character or character of reading focus
Centered on string and in predetermined length interval in quotation text.During extending and intercept more than, extension side is set
To flag bit and intercepting direction flag, the direction for being respectively used to identify extension and intercept is directed towards stem or afterbody, so as to
Determine extension every time and the direction for intercepting.Fig. 5 is the refined flow chart of context extraction step.
In context extraction step, the extended operation of step 501 is first carried out, i.e., with it is selected reading focus for
Point, chooses along propagation direction and reads the forward and backward alternative text of text fragment addition of focus;Importantly, extended operation each time
All in units of the complete semantic primitive with large scale, namely a complete semantic primitive is added alternative text by one extension
This.The complete semantic primitive of large scale described here is that the certain types of boundary symbol of some being predefined is limited
Text fragment, these the certain types of boundary symbols adopted in extended operation are referred to as extended pattern boundary symbol, quilt by us
The text fragment that extended pattern boundary symbol is limited is referred to as expanding element.For example, in this example, by fullstop (".") limited
Whole sentence has large scale, belongs to the complete semantic primitive in large scale, thus fullstop will by as extended pattern boundary symbol,
The whole sentence limited by fullstop is using by as the expanding element in extended operation;And the subordinate sentence that comma (", ") is limited is because of yardstick
It is relatively small, not as expanding element.The boundary symbol of which type is fixed in advance as the extended pattern boundary symbol
Justice is good, for the ease of inquiring about and managing, the type of extended pattern boundary symbol is stored in the middle of extended boundary symbol table.
In the concrete execution of extended operation 501, first to read focus as starting point, according to reading focus initial alternative
Side-play amount setting propagation direction flag bit in quotation is header extension or afterbody extension;If reading focus positioned at initial standby
Select the first half of quotation, will the flag bit be set to afterbody extension, if read focus be located at after initial alternative quotation
Half part, will the flag bit be set to header extension.Read focus " structuring between information " in this example, as starting point, to be in
The first half of initial alternative quotation, therefore propagation direction flag bit is confirmed as at the beginning afterbody extension.
It is starting point from focus is read, the first character from after " structuring between information " starts forward to read backward successively
Take, when first non-legible character (such as punctuation mark) is run into, whether query expansion boundary symbol table is judging the character
Belong to above-mentioned extended pattern boundary symbol, if belonged to, using the text fragment between reading focus and this character as one
Expanding element, is added in the middle of alternative quotation (initial alternative text has been cleared);If be not belonging to, continuation is read backward,
Until running into first extended pattern boundary symbol.Then, compare in the middle of alternative text, the text read before focus is long
Whether the ratio of both the text sizes after degree and the focus has reached predetermined direction changes threshold value, it is assumed that reached threshold
Value, then be revised as header extension by propagation direction flag bit;Otherwise keep afterbody extension constant.In this example, " tie between information
First character after structure " be extended pattern boundary symbol fullstop ("."), therefore can compare " structuring between information " it
The ratio of front text size (0 character) and the text size (1 character) after it to determine propagation direction, the knot for comparing
Fruit is to continue with afterbody extension.
Further, whether interval more than the predetermined length lower limit of the text size of alternative quotation is judged, if surpassed
Cross the lower limit and then stop extended operation;If not less than lower limit, illustrating that current alternative quotation is also not reaching to required
Quotation minimum length, therefore extended operation will be continued, continue for next expanding element to add alternative quotation along propagation direction.
In this example, will continue rearwardly to extend, until it reaches fullstop after " accurate push " ("."), so, can be by by this
" in this process, user can be with ... accurately pushes away the text fragment limited as the fullstop of extended pattern boundary symbol
Send." this whole sentence adds alternative text as expanding element.And the boundary symbol such as the pause mark that run into during this and comma
Because being not belonging to extended pattern boundary symbol, the text that it is limited can include this extension as a part for expanding element.
After adding above-mentioned text fragment, above-mentioned judgement and adjustment to propagation direction can be performed, Jing judges to read burnt
The ratio of the text size before and after point alreadys exceed direction and changes threshold value, therefore headed by propagation direction flag bit is changed
Portion extends.Jing judges still interval not less than the predetermined length lower limit of the length of current alternative text, therefore will be reading focus
Extend forward for starting point, until will " know in this period ... allow information and " this text fragment as expanding element add
Alternative text.Now, due to having reached lower limit, therefore extended operation 501 terminates, it is not necessary to be further continued for rearwardly or to head
Portion is extended in units of expanding element.The alternative text obtained through extended operation 501 is as shown in Figure 6.
Due to being extended in units of the expanding element of large scale, the alternative text that expanded operation 501 is obtained
It is possible to beyond the interval upper limit of predetermined length.Therefore, the length of the alternative text for obtaining to expanded operation 501 is sentenced
It is disconnected, if being not more than the higher limit, using the alternative text as the quotation text for being extracted;If greater than the higher limit, need
The intercept operation 502 being described below is performed to alternative text.
Intercept operation 502 is to identify intercepting direction along by intercepting direction flag, with relative to above-mentioned expanding element
Complete semantic primitive of the speech with smaller scale is unit, and alternative text is intercepted.Analogously, once intercepting equally is
Intercept a complete semantic primitive, complete semantic primitive be intercepted and with relatively small yardstick is also by fixed in advance
The text fragment that limited of some certain types of boundary symbols of justice, we are by these certain kinds adopted in intercept operation
The boundary symbol of type is referred to as intercepting type boundary symbol, and the text fragment limited by intercepting type boundary symbol is referred to as interception unit.
For example, in this example, the subordinate sentence that comma (", ") is limited is used as interception unit, it is clear that the subordinate sentence and conduct as interception unit
The yardstick that the whole sentence of expanding element is compared semantically is less.The boundary symbol of which type is used as the intercepting type boundary symbol
Pre-define, and be stored in the middle of intercepting boundary symbol table for inquiring about in execution.
It is similar with extended operation in concrete implementation procedure, first according to the text size read before and after focus
Ratio come determine intercept direction flag be stem intercept or afterbody intercept.Because reading focus is in current alternative text
Position it is forward, according to ratio in judgement will intercept walking direction be afterbody intercept, from last non-side of alternative text afterbody
It is starting point that the character of boundary's symbol " send " word, and each character is successively read forward until reaching the comma behind " complicated information ", Jing
Inquiry comma (", ") belongs to and intercepts the intercepting type boundary symbol preserved in boundary symbol table, therefore by the text after the comma
Segment intercepts as an interception unit from the middle of alternative text.
After an intercept operation, judge to intercept whether direction needs to change according to the method described above;Then judge standby
Whether the length of selection sheet is already less than the interval upper limit of the predetermined length;If not being less than the upper limit, continue basis and cut
Intercepting is performed in units of the interception unit that direction is taken to be limited by intercepting type boundary symbol, until being less than the upper limit;If little
After the upper limit, that is, terminate intercept operation 502.
After intercept operation 502 is completed, can judge whether the current alternative text Jing after intercepting can be again less than pre- fixed length
The interval lower limit of degree.In this example, due to being intercepted the quotation text after operating not less than the lower limit, therefore can conduct
Final quotation text.But in some cases, the alternative text after Jing intercept operations may once more be less than predetermined length
Interval lower limit, the minimum semantic primitive extension intercept operation being at this moment accomplished by described in execution step 503.
Implication, example with regard to minimum semantic primitive and the minimum semanteme list for preserving predefined minimum semantic primitive
Unit's set, has been described above being illustrated by detail.In step 503, it is with defined in minimum semantic primitive set
Minimum semantic primitive is unit, with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side
To being extended to the alternative text, the minimum semantic primitive of one extension one;Every time after extension, also according to being situated between above
Whether whether the walking direction propagation direction for continuing need adjustment, and the alternative quotation length obtained by judging again more than described
The interval lower limit of predetermined length, until being more than the lower limit.Due to wrapping in some minimum semantic primitives (such as URL or email address)
The number of characters that contains may be a lot, if Jing, once after minimum semantic primitive extension, the alternative quotation length is once more big
In the upper limit that the predetermined length is interval, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point,
Along direction is intercepted the alternative quotation is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive
With the successive ignition for intercepting, length is finally obtained in that in the interval interior alternative text of predetermined length, as the quotation text.
Fig. 7 shows the page of the quotation text of final acquisition.
In order to realize said method, present invention also offers a kind of quotation automatic extracting device.Fig. 8 shows the device
Overall structure diagram, including focus setting module 801, quotation length setting module 802, text analysis model 803, interior
Hold extraction module 804, angle detecting module 805, effective node table 806, extended boundary symbol table 807, intercepting boundary symbol table
808th, minimum semantic primitive set 809.Wherein, focus setting module 801 is used for the character of manual or automatic selection in the text
Or character string is used as reading focus;Quotation length setting module 802 is used to be directed to quotation text by manually or automatically predefined
The predetermined length it is interval, and the alternative quotation permission length of interval according to the predetermined length interval computation.Content is carried
According to method described above, by the effective node type in the effective node table 806 of inquiry, extraction is located at delivery block 804
Between the resulting structure node of text and comprising the initial alternative text as the character or character string of reading focus.Text
This analysis module 803 judges the class of languages of text by carrying out the analysis such as Chinese and English character statistics to the initial alternative text
Type;This device can respectively be predefined and preserve respective for respective language features of different language type such as Chinese, English
Extended boundary symbol table 807, intercepting boundary symbol table 808, minimum semantic primitive set 809;According to 803 points of text analysis model
The language form that analysis is obtained, content extraction module 804 is selected the extended boundary symbol table 807 corresponding with the language form, is cut
Take boundary symbol table 808, minimum semantic primitive set 809.Further, the method as described above of content extraction module 804, reads
Text is taken and analyzed, by query expansion boundary symbol table 807, boundary symbol table 808 and minimum semantic primitive set is intercepted
809, performed in units of the complete semantic primitive of the different scales such as expanding element, interception unit and minimum semantic primitive corresponding
Extended operation, intercept operation and minimum semantic primitive extension intercept operation, during aforesaid operations are performed, by angle detecting
Module 805 judges extension or intercepts direction and correspondingly update propagation direction flag bit and intercept direction flag.Finally, can
Length is obtained in the interval interior alternative text of predetermined length, as the quotation text.Content extraction module 804 is reserved correlation and is connect
Mouth is connected with other products system, is directed to relevance mechanism.Specifically, pull out from original article paragraph
After coming, quotation is not completely self-contained main body, and original document still has certain association, thus for quotation some
Operating result may influence whether the content in original document.By interface module 810, the system can couple above-mentioned other products
Strain is united and provides the quotation text of extraction to other products system.
By the quotation text that automatically extracts of said method of the present invention, its length meets predetermined length scope, to microblogging,
The applied environment that search engine page etc. has strict restriction to quotation text size has adaptability;It is additionally, since every time
Expansion and intercept operation be all in units of complete semantic primitive launch, can farthest ensure semantic integrity and
Can read, correctly reduction, around the context for reading focus, is overcome in prior art due to having to be conducive to reader
Complete semantic text fragment midway is blocked and affects the reading to quotation and the defect for using.
Claims (32)
1. a kind of quotation extraction method, it is characterised in that include:
Focus setting procedure, is selected as the character or character string for reading focus from text;
Context extraction step, is extended and/or is intercepted by the text carried out in units of complete semantic primitive, is extracted with described
Read focus centered on context, so as to obtain text size in predetermined length interval in and it is with semantic integrity
Quotation text;
Wherein, the context extraction step includes:
With described as the character or character string of focus is read as starting point and along propagation direction, to be limited by extended pattern boundary symbol
The fixed complete semantic primitive with large scale is the spread step that unit chooses alternative text;And/or
For alternative text, along direction is intercepted, with the complete semanteme with smaller scale limited by intercepting type boundary symbol
Unit is the intercepting step that unit intercepts alternative text;
And
For the alternative text after expanded step and/or intercepting step process, along extension side in units of minimum semantic primitive
Step is intercepted to extension and/or along the minimum semantic primitive extension for intercepting the direction intercepting alternative text;
Wherein, the extended pattern boundary symbol and intercepting type boundary symbol are respectively the boundary symbols of predefined type.
2. quotation extraction method according to claim 1, it is characterised in that the complete semantic primitive includes:By
The text fragment with various yardsticks that the different types of boundary symbol included in text is limited, and by having in text
The minimum semantic primitive of independent semantic character or character string composition.
3. quotation extraction method according to claim 2, it is characterised in that the border is predefined by symbol table
The type of symbol, and the set of minimum semantic primitive.
4. quotation extraction method according to claim 2, it is characterised in that the minimum semantic primitive includes:English
Literary word, Chinese character, URL addresses, E-mail address, time format, the text between the punctuation mark for using in pairs
This segment, the text fragment with specific font form.
5. quotation extraction method according to claim 1, it is characterised in that according to being located at the work in alternative text
Whether the ratio for the text size before and after the character or character string of reading focus reaches predetermined direction change threshold value,
Decide whether to change the propagation direction and intercepting direction.
6. quotation extraction method according to claim 2, it is characterised in that predefined for the described of quotation text
Predetermined length is interval.
7. quotation extraction method according to claim 2, it is characterised in that methods described is in context extraction step
Also include before:Initial extraction step, extracts between the resulting structure node of text and comprising described burnt as reading
The character of point or the initial alternative text of character string;Text analyzing step, determines to use by analyzing the initial alternative text
In the boundary symbol type for dividing the complete semantic primitive and minimum semantic primitive set.
8. quotation extraction method according to claim 7, it is characterised in that according to the language of the initial alternative text
Speech type, determines the boundary symbol type and minimum semantic primitive set.
9. quotation extraction method according to claim 7, it is characterised in that the initial extraction step extract just
The length of alternative text of beginning is allowed within length of interval, and according to the predetermined length interval computation in alternative quotation
Alternative quotation allows length of interval.
10. quotation extraction method according to claim 7, it is characterised in that the initial extraction step includes:With
Structuring node corresponding with the character or character string as reading focus is starting point, travels through the forward and backward structure of the starting point
Change node and exclude after invalid structure node therein and its text that includes, select to be located between resulting structure node and
Length allows the text in length of interval as the initial alternative text in alternative quotation.
11. quotation extraction methods according to claim 10, it is characterised in that by the predefined institute of effective node table
State the type of resulting structure node.
12. quotation extraction methods according to claim 1, it is characterised in that the complete semantic primitive can be divided into:
Expanding element, the text fragment limited by the extended pattern boundary symbol included in text;Interception unit, by what is included in text
The text fragment that intercepting type boundary symbol is limited;Minimum semantic primitive, by the character or character in text with independent semanteme
The minimum unit of string composition;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is more than the intercepting type side
The yardstick of the text fragment that boundary's symbol is limited.
13. quotation extraction methods according to claim 12, it is characterised in that predetermined by extended boundary symbol table
The type of the justice extended pattern boundary symbol, the type of the intercepting type boundary symbol is predefined by intercepting boundary symbol table,
The minimum semantic primitive is predefined by minimum semantic primitive set.
14. quotation extraction methods according to claim 12, it is characterised in that the context extraction step bag
Include:Extended operation, it is single with the extension along propagation direction with described as the character or character string of focus is read as origination data
Unit extracts text and adds alternative text for unit, until the alternative text size is more than under predetermined length interval
Limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, should
Alternative text is used as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, with positioned at alternative
The initial and end portion of text and the character of non-boundary symbol are starting point, along intercepting direction in units of interception unit to described alternative
Text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than described pre- after the intercept operation
The interval lower limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side
The alternative text is extended in units of the minimum semantic primitive, until the alternative text size is more than described
The interval lower limit of predetermined length;If after minimum semantic primitive extension described in Jing, the alternative text size is more than described pre-
The interval upper limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along intercepting side
The alternative text is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive and intercepted
Successive ignition obtains length in the interval interior alternative text of predetermined length as the quotation text.
15. quotation extraction methods according to claim 14, it is characterised in that by propagation direction mark bit-identify
The propagation direction is that header extension or afterbody extend;Intercept described in Directional Sign bit-identify direction for stem intercepting by intercepting
Or afterbody is intercepted.
16. quotation extraction methods according to claim 14, it is characterised in that whenever extend an expanding element or
After minimum semantic primitive, and after one interception unit of intercepting or minimum semantic primitive, all according to the alternative text
It is predetermined whether the ratio of the text size being located in this before and after character or character string as reading focus reaches
Direction change threshold value to decide whether to change the propagation direction or intercept direction.
17. a kind of quotation automatic extracting devices, it is characterised in that include:
Focus setting module, for being selected as the character or character string of reading focus from text;
Content extraction module, for being extended by the text carried out in units of complete semantic primitive and/or being intercepted, is extracted with institute
State read focus centered on context, so as to obtain text size in predetermined length interval in and with semantic integrity
Quotation text;
Wherein, the content extraction module is used to perform following operation:
With described as the character or character string of focus is read as starting point and along propagation direction, to be limited by extended pattern boundary symbol
The fixed complete semantic primitive with large scale is the spread step that unit chooses alternative text;And/or
For alternative text, along direction is intercepted, with the complete semanteme with smaller scale limited by intercepting type boundary symbol
Unit is the intercepting step that unit intercepts alternative text;
And
For the alternative text after expanded step and/or intercepting step process, along extension side in units of minimum semantic primitive
Step is intercepted to extension and/or along the minimum semantic primitive extension for intercepting the direction intercepting alternative text;
Wherein, the extended pattern boundary symbol and intercepting type boundary symbol are respectively the boundary symbols of predefined type.
18. quotation automatic extracting devices according to claim 17, it is characterised in that the complete semantic primitive includes:
The text fragment with various yardsticks limited by the different types of boundary symbol included in text, and had by text
The minimum semantic primitive being made up of independent semantic character or character string.
19. quotation automatic extracting devices according to claim 18, it is characterised in that described device includes symbol table, institute
Symbol table is stated for preserving the type of the predefined boundary symbol and the set of minimum semantic primitive.
20. quotation automatic extracting devices according to claim 18, it is characterised in that the minimum semantic primitive includes:
English word, Chinese character, URL addresses, E-mail address, time format, between the punctuation mark for using in pairs
Text fragment, the text fragment with specific font form.
21. quotation automatic extracting devices according to claim 18, described device also includes angle detecting module, for root
According to the text size being located in alternative text before and after the character or character string as reading focus ratio whether
Reach predetermined direction and change threshold value, decide whether to change the propagation direction and intercepting direction.
22. quotation automatic extracting devices according to claim 18, it is characterised in that described device also includes quotation length
Setting module, it is interval for the predetermined length of quotation text for predefining.
23. quotation automatic extracting devices according to claim 18, it is characterised in that the content extraction module is additionally operable to
Perform following operation:Extract between the resulting structure node of text and comprising described as the character or word of reading focus
The initial alternative text of symbol string;Also, described device also includes text analysis model, by analyze the initial alternative text come
Determine the boundary symbol type for dividing the complete semantic primitive and minimum semantic primitive set.
24. quotation automatic extracting devices according to claim 23, it is characterised in that the text analysis model is according to institute
The language form of initial alternative text is stated, the boundary symbol type and minimum semantic primitive set is determined.
25. quotation automatic extracting devices according to claim 23, it is characterised in that what the content extraction module was extracted
The length of initial alternative text is allowed within length of interval in alternative quotation, and the quotation length setting mould of described device
Block, for the alternative quotation according to the predetermined length interval computation length of interval is allowed.
26. quotation automatic extracting devices according to claim 23, it is characterised in that the content extraction module with institute
State as read focus character or the corresponding structuring node of character string be starting point, travel through the forward and backward structuring section of the starting point
After putting and excluding invalid structure node therein and its text that includes, select to be located between resulting structure node and length
In alternative quotation the text in length of interval is allowed as the initial alternative text.
27. quotation automatic extracting devices according to claim 26, it is characterised in that described device includes effective node
Table, effective node table is used to preserve the type of the predefined resulting structure node.
28. quotation automatic extracting devices according to claim 17, it is characterised in that the complete semantic primitive can divide
For:Expanding element, the text fragment limited by the extended pattern boundary symbol included in text;Interception unit, is wrapped by text
The text fragment that the intercepting type boundary symbol for containing is limited;Minimum semantic primitive, by have in text independent semantic character or
The minimum unit of character string composition;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is more than the intercepting
The yardstick of the text fragment that type boundary symbol is limited.
29. quotation automatic extracting devices according to claim 28, it is characterised in that described device is accorded with by extended boundary
Number table preserves the type of the predefined extended pattern boundary symbol, and by intercepting boundary symbol table predefined described cut is preserved
The type of type boundary symbol is taken, the predefined minimum semantic primitive is preserved by minimum semantic primitive set.
30. quotation automatic extracting devices according to claim 28, it is characterised in that the content extraction module is used to hold
The following operation of row:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with the extension
Unit is that unit extracts text and adds alternative text, until the alternative text size is more than under predetermined length interval
Limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, should
Alternative text is used as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, with positioned at alternative
The initial and end portion of text and the character of non-boundary symbol are starting point, along intercepting direction in units of interception unit to described alternative
Text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than described pre- after the intercept operation
The interval lower limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side
The alternative text is extended in units of the minimum semantic primitive, until the alternative text size is more than described
The interval lower limit of predetermined length;If after minimum semantic primitive extension described in Jing, the alternative text size is more than described pre-
The interval upper limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along intercepting side
The alternative text is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive and intercepted
Successive ignition obtains length in the interval interior alternative text of predetermined length as the quotation text.
31. quotation automatic extracting devices according to claim 30, it is characterised in that the angle detecting module of described device
It is that header extension or afterbody extend by propagation direction described in propagation direction mark bit-identify;By intercepting Directional Sign bit-identify
The intercepting direction is that stem is intercepted or afterbody is intercepted.
32. quotation automatic extracting devices according to claim 30, it is characterised in that whenever the content extraction module expands
After one expanding element of exhibition or minimum semantic primitive, and whenever the content extraction module one interception unit of intercepting or most
After little semantic primitive, the angle detecting module of described device is all described as reading focus according to being located in the alternative text
Character or character string before and after the ratio of text size whether reach predetermined direction and change threshold value to decide whether
Change the propagation direction or intercept direction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410301560.0A CN104050158B (en) | 2014-06-27 | 2014-06-27 | Automatic quotation extraction method and device with semantic integrity kept |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410301560.0A CN104050158B (en) | 2014-06-27 | 2014-06-27 | Automatic quotation extraction method and device with semantic integrity kept |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104050158A CN104050158A (en) | 2014-09-17 |
CN104050158B true CN104050158B (en) | 2017-05-17 |
Family
ID=51503012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410301560.0A Active CN104050158B (en) | 2014-06-27 | 2014-06-27 | Automatic quotation extraction method and device with semantic integrity kept |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104050158B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105068992B (en) * | 2015-07-29 | 2019-04-26 | 魅族科技(中国)有限公司 | A kind of search result display methods and device |
US11500535B2 (en) | 2015-10-29 | 2022-11-15 | Lenovo (Singapore) Pte. Ltd. | Two stroke quick input selection |
CN105955616B (en) * | 2016-04-29 | 2019-05-07 | 北京小米移动软件有限公司 | A kind of method and apparatus for choosing document content |
CN106970847A (en) * | 2017-03-28 | 2017-07-21 | 飞驰镁物(北京)信息服务有限公司 | Content cuts Tibetan method and system, user terminal and server |
CN108388664A (en) * | 2018-03-14 | 2018-08-10 | 深圳市网域科技股份有限公司 | Integration method, device, computer equipment and the storage medium of sentence segment |
CN114817520A (en) * | 2021-01-19 | 2022-07-29 | 华为技术有限公司 | Method and device for determining abstract of search result and electronic equipment |
CN115080170A (en) * | 2022-06-29 | 2022-09-20 | 维沃移动通信有限公司 | Information processing method, information processing apparatus, and electronic device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101201841A (en) * | 2007-02-15 | 2008-06-18 | 刘二中 | Convenient method and system for electronic text-processing and searching |
CN101539904A (en) * | 2009-04-21 | 2009-09-23 | 武汉大学 | Automatic indexing method of quotations |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645125B2 (en) * | 2010-03-30 | 2014-02-04 | Evri, Inc. | NLP-based systems and methods for providing quotations |
-
2014
- 2014-06-27 CN CN201410301560.0A patent/CN104050158B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101201841A (en) * | 2007-02-15 | 2008-06-18 | 刘二中 | Convenient method and system for electronic text-processing and searching |
CN101539904A (en) * | 2009-04-21 | 2009-09-23 | 武汉大学 | Automatic indexing method of quotations |
Non-Patent Citations (1)
Title |
---|
从网页中精确提取链接上下文相关文本;徐晴阳;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20041215(第4期);I139-329 * |
Also Published As
Publication number | Publication date |
---|---|
CN104050158A (en) | 2014-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104050158B (en) | Automatic quotation extraction method and device with semantic integrity kept | |
WO2017177809A1 (en) | Word segmentation method and system for language text | |
CN109033282B (en) | Webpage text extraction method and device based on extraction template | |
CN106776555B (en) | A kind of comment text entity recognition method and device based on word model | |
TW201804341A (en) | Character string segmentation method, apparatus and device | |
CN106055667A (en) | Method for extracting core content of webpage based on text-tag density | |
US10599748B2 (en) | Systems and methods for asymmetrical formatting of word spaces according to the uncertainty between words | |
JP2010134922A (en) | Similar word determination method and system | |
CN111199151A (en) | Data processing method and data processing device | |
US11074306B2 (en) | Web content extraction method, device, storage medium | |
US9996508B2 (en) | Input assistance device, input assistance method and storage medium | |
CN111435405A (en) | Method and device for automatically labeling key sentences of article | |
CN115546815A (en) | Table identification method, device, equipment and storage medium | |
CN108664522A (en) | Web page processing method and device | |
JP2009265770A (en) | Significant sentence presentation system | |
KR102182248B1 (en) | System and method for checking grammar and computer program for the same | |
CN108132919A (en) | A kind of method of webpage content extraction | |
KR20220113075A (en) | Word cloud system based on korean noun extraction tokenizer | |
JP2006053866A (en) | Detection method of notation variability of katakana character string | |
KR101909537B1 (en) | System and method for classifying social data | |
JP2008225566A (en) | Device and method for extracting related information | |
CN111444716A (en) | Title word segmentation method, terminal and computer readable storage medium | |
KR101634681B1 (en) | Method and program for searching quoted phrase in document | |
JP2007148630A (en) | Patent analyzing device, patent analyzing system, patent analyzing method and program | |
JP2014112306A (en) | Demand sentence extract device, demand content identification model learning device, method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C53 | Correction of patent for invention or patent application | ||
CB02 | Change of applicant information |
Address after: Yuhua Road, Qinhuai District of Nanjing City, Jiangsu province 210000 No. 22 treasure garden 22-302 Applicant after: Wu Taojun Address before: 200000 West Yan'an Road 900 Road, Changning District, Shanghai Applicant before: Wu Taojun |
|
GR01 | Patent grant | ||
GR01 | Patent grant |