CN104050158B - Automatic quotation extraction method and device with semantic integrity kept - Google Patents

Automatic quotation extraction method and device with semantic integrity kept Download PDF

Info

Publication number
CN104050158B
CN104050158B CN201410301560.0A CN201410301560A CN104050158B CN 104050158 B CN104050158 B CN 104050158B CN 201410301560 A CN201410301560 A CN 201410301560A CN 104050158 B CN104050158 B CN 104050158B
Authority
CN
China
Prior art keywords
text
quotation
alternative
semantic primitive
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410301560.0A
Other languages
Chinese (zh)
Other versions
CN104050158A (en
Inventor
吴涛军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201410301560.0A priority Critical patent/CN104050158B/en
Publication of CN104050158A publication Critical patent/CN104050158A/en
Application granted granted Critical
Publication of CN104050158B publication Critical patent/CN104050158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides an automatic quotation extraction method and device. Characters or character strings, serving as reading focuses, in a text can be used in the method and device to serve as centers to automatically extract contexts, the length of an extracted quotation is within the preserved length scope, it is kept that the extracted quotation has the semantic integrity, a section of semantic scene which is appropriate in length and integral in meaning and enables the selected characters or the selected character strings to serve as the reading focuses can be extracted from the text and formed, and a user can conveniently restore the correct meaning of the reading focuses in the contexts.

Description

A kind of quotation extraction method and device of holding semantic integrity
Technical field
The application is related to text analyzing and extractive technique, more particularly, to a kind of quotation of holding semantic integrity certainly Dynamic extracting method and device.
Background technology
In electronic structure document, to be selected or based on pre-defined rule (such as matched rule) by hand automatically by user Some key words, phrase, sentence for selecting etc. is read centered on focus, extracts quotation text, is non-in many application scenarios The function of often needing.For example, user is in the document processes such as webpage are read, it is possible to use marking tool etc. is interested in oneself Read focus to be selected, to refer in other reading;When user wishes to share these by social networkies such as microbloggings When reading focus, only rely on labeled crucial word, phrase and sentence and be not enough to allow reader's reduction to read what focus was located Context, it is impossible to understand and read the purpose of focus, therefore it is complete to be accomplished by extracting one section of the context formation of reading focus Quotation.Similar situation also has when user is desired based on the labelling preservation reading extracts to reading focus, needs to extract shape Into quotation etc..It can be seen that, for many products based on electronic structure document and application, it is all to realize it that quotation is extracted The requisite basic technology of function.
For example, what the Chinese patent application document of Publication No. CN102955820A disclosed a kind of accumulation of foreign vocabulary is System and its method, wherein user can one side foreign-language text electron reading while being marked to vocabulary therein;And system The context paragraph for containing vocabulary label information can be submitted to and preserved to service background subsystem.However, the patent documentation Disclosed technical scheme is that the context paragraph being located using labelling is extracted as quotation, and the context segment that labelling is located Falling may be long.Under most applied environment, the text size of quotation is conditional, is extracted in units of paragraph Quotation is possible to the situation for causing quotation length to exceed the restriction, it is clear that the technology of the patent documentation can not be generally applicable In the applied environment for limiting quotation length.And, if the quotation paragraph for extracting is long, can make originally as the mark for reading focus Position of the note in quotation not enough projects, and affects reading effect.
The Chinese patent literature of Publication No. CN101192231B is disclosed in a data processing system to the specific of resource The method that part arranges bookmark, in the method, the current screen for ringing coping resources arranges the request of bookmark, to the current of resource The actual text collection screen contextual information of screen, and store the address information and screen contextual information work of the resource It is the bookmark for returning the resource specific part.The technical scheme of the patent documentation is that context is carried out in units of screen Extract, it is also possible to there is a problem of quotation text long and be unsuitable for some applied environments.And, in units of screen Context extract for the extraction in units of paragraph, more difficult guarantee quotation in integrity semantically, because Word segment to be likely to be located at screen the top a line or bottom a line is not the whole sentence of complete, and is A part for whole sentence, another part of the whole sentence is then located at beyond screen.Will exist in the quotation being achieved in that incomplete The even incomplete word of sentence, has a strong impact on the reading effect of quotation.
Also include extracting in prior art in the object that is marked and current web page close to before the object being marked and it The content of upper and lower web page element afterwards is forming the technical scheme of quotation, the such as Chinese patent of Publication No. CN101866342 Document etc..Obviously, the extraction in units of web page element there is also quotation length may long or quotation it is semantic incomplete Problem.
Existing quotation extracting method and device also include simple according to character in order to adapt to requirement to quotation length Number for example centered on reading focus, forwardly and rearwardly respectively extracts tens characters come the scheme for being intercepted from text, Form quotation.The open defect that this method has is that generated quotation does not often have semantic integrity, is usually gone out Now the half content of certain a word is included into quotation and second half content does not include quotation, or even the feelings that a word is blocked Condition, make after readers ' reading fail to understand so.And, in some cases, the quotation that this destruction integrity is blocked can affect user couple It is used, for example, if text includes the information such as e-mail address, URL webpage address, telephone number, and quotation By these message truncations, then the quotation for being provided will not have any real value.
It can be seen that, existing quotation extractive technique can not keep on the premise of quotation length is maintained within threshold value Quotation is semantic complete, it is to avoid the globality character string such as cut-out complete sentence, vocabulary and e-mail address, the effect that its quotation is extracted Fruit can not meet the needs of people.
The content of the invention
For the above-mentioned condition and defect of prior art, the invention provides a kind of quotation extraction method and device. The present invention can be to automatically extract context in text as centered on the character or character string of reading focus, the quotation for being extracted Length keeps extracted quotation to have semantic integrity within predetermined length range, so can be from text Middle extraction is formed using character or character string as reading focus and is of convenient length, complete one section semantic scene of looking like, convenient Focus correct implication within a context is read in user's reduction.
According to quotation extraction method of the present invention, it is characterised in that include:
Focus setting procedure, is selected as the character or character string for reading focus from text;
Context extraction step, by the extension of the text that carried out in units of complete semantic primitive and/or intercepts, extract with Context centered on the reading focus, so as to obtain quotation text of the text size in predetermined length interval.
As a first aspect of the present invention, it is preferred that the complete semantic primitive includes:By the difference included in text The text fragment with various yardsticks that the boundary symbol of type is limited, and by have in text independent semantic character or The minimum semantic primitive of character string composition.Wherein, the type of the boundary symbol is predefined by symbol table, and it is minimum semantic The set of unit.It may further be preferable that the minimum semantic primitive includes:English word, Chinese character, URL addresses, electronics Email address, time format, the text fragment between the punctuation mark for using in pairs, the text with specific font form Segment.
In above-mentioned steps, the context extraction step includes:Using described as the character or character string of reading focus It is complete semantic single with large scale with what is limited by some certain types of boundary symbols for starting point and along propagation direction Unit chooses the spread step of alternative text for unit;For the alternative text, along direction is intercepted, with certain types of by other The complete semantic primitive with smaller scale that boundary symbol is limited intercepts the intercepting step of alternative text as unit;And pin To the alternative text after expanded step and intercepting step process, extended along propagation direction in units of the minimum semantic primitive And/or intercept step along the minimum semantic primitive extension for intercepting the direction intercepting alternative text.In spread step, intercept step It is described as the character or character of reading focus according to being located in alternative text and minimum semantic primitive extension is intercepted in step Whether the ratio of the text size before and after string reaches predetermined direction changes threshold value, decides whether to change the extension side To with intercept direction.
Preferably, the quotation extraction method also included step before context extraction step is carried out:It is predetermined Justice is interval for the predetermined length of quotation text.
Preferably, the quotation extraction method also included before context extraction step:Initial extraction step, Extract positioned at text resulting structure node between and comprising it is described as reading focus character or character string it is initial standby Selection sheet;Text analyzing step, is determined for dividing the complete semantic primitive by analyzing the initial alternative text Boundary symbol type and minimum semantic primitive set.Wherein, in the text analyzing step, according to the initial alternative text Language form, determine the boundary symbol type and minimum semantic primitive set.This step can be according to described initial alternative The ratio of Chinese and English character judges its language form in text.It may further be preferable that the initial extraction step extraction is first The length of alternative text of beginning is allowed within length of interval, and according to the predetermined length interval computation in alternative quotation Alternative quotation allows length of interval.
It may further be preferable that the initial extraction step specifically includes following steps:Using with it is described as read focus Character or the corresponding structuring node of character string be starting point, travel through the forward and backward structuring node of the starting point and exclude therein After invalid structure node and its text that includes, select to be located between resulting structure node and length is fair in alternative quotation Perhaps the text in length of interval is used as the initial alternative text.Wherein, the resulting structure is predefined by effective node table Change the type of node.
As a second aspect of the present invention, it is preferred that the complete semantic primitive can be divided into:Expanding element, by text In the text fragment that limited of the extended pattern boundary symbol that includes;Interception unit, by the intercepting type boundary symbol included in text The text fragment for being limited;Minimum semantic primitive, it is single by the minimum that there is independent semantic character or character string to constitute in text Unit;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is limited more than the intercepting type boundary symbol The yardstick of text fragment.
It may further be preferable that predefining the type of the extended pattern boundary symbol by extended boundary symbol table, pass through Intercept boundary symbol table and predefine the type of the intercepting type boundary symbol, by minimum semantic primitive set it is predefined it is described most Little semantic primitive.
The context extraction step is specifically included:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with described Expanding element is that unit extracts text and adds alternative text, until the alternative text size is interval more than the predetermined length Lower limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, Using the alternative text as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, to be located at Alternatively the character of the initial and end portion of text and non-boundary symbol is starting point, and edge intercepts direction in units of interception unit to described Alternative text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than institute after the intercept operation The interval lower limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along expansion Exhibition direction is extended in units of the minimum semantic primitive to the alternative text, until the alternative text size is more than The interval lower limit of the predetermined length;If after minimum semantic primitive extension described in Jing, the alternative quotation length is more than institute The interval upper limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, edge is cut Take direction to intercept the alternative quotation in units of the minimum semantic primitive;Extend and cut by minimum semantic primitive The successive ignition for taking obtains length in the interval interior alternative text of predetermined length as the quotation text.
In the middle of above-mentioned extended operation, intercept operation and minimum semantic primitive extension intercept operation, by propagation direction mark Propagation direction described in will bit-identify is that header extension or afterbody extend;It is by intercepting intercepting direction described in Directional Sign bit-identify Stem is intercepted or afterbody is intercepted.
It may further be preferable that context extraction step also includes:In above-mentioned extended operation, intercept operation and minimum semanteme In unit extensions intercept operation, after one expanding element of extension or minimum semantic primitive, and cut whenever intercepting one It is all described as the character or character of reading focus according to being located in the alternative text after taking unit or minimum semantic primitive Whether the ratio of the text size before and after string reaches predetermined direction changes threshold value to decide whether to change the extension Direction intercepts direction.
The invention provides a kind of quotation automatic extracting device, it is characterised in that include:
Focus setting module, for being selected as the character or character string of reading focus from text;
Content extraction module, for by the text carried out in units of the complete semantic primitive of various yardsticks extension and/ Or intercept, the context centered on the reading focus is extracted, it is in predetermined length interval so as to obtain text size Quotation text.
As a first aspect of the present invention, it is preferred that the complete semantic primitive includes:By the difference included in text The text fragment with various yardsticks that the boundary symbol of type is limited, and by have in text independent semantic character or The minimum semantic primitive of character string composition.
It may further be preferable that the quotation automatic extracting device also includes symbol table, the symbol table is used to preserve pre- The set of the type of the boundary symbol of definition and minimum semantic primitive.
It may further be preferable that the minimum semantic primitive includes:English word, Chinese character, URL addresses, electronics postal Case address, time format, the text fragment between the punctuation mark for using in pairs, the text piece with specific font form It is disconnected.
Preferably, the content extraction module is used to perform following operation:Using it is described as read focus character or Character string is starting point and along propagation direction, to be limited by some certain types of boundary symbols with the complete of large scale Semantic primitive is the extended operation that unit chooses alternative text, for the alternative text, along direction is intercepted, with specific by other The complete semantic primitive with smaller scale that the boundary symbol of type is limited intercepts the intercept operation of alternative text as unit, For the alternative text after expanded step and intercepting step process, expanded along propagation direction in units of the minimum semantic primitive Exhibition and/or the minimum semantic primitive for intercepting the alternative text along intercepting direction extend intercept operation.
Preferably, the quotation automatic extracting device also includes angle detecting module, for according to alternative text middle position In it is described as the character or character string of reading focus before and after the ratio of text size whether reach predetermined direction Change threshold value, decide whether to change the propagation direction and intercepting direction.
Preferably, the quotation automatic extracting device also includes quotation length setting module, for predefined for drawing The predetermined length of text is interval.
Preferably, the content extraction module is additionally operable to perform following operation:Extract the resulting structure positioned at text Between node and comprising the initial alternative text as the character or character string of reading focus;Also, the quotation is automatic Extraction element also includes text analysis model, is determined for dividing the complete semanteme by analyzing the initial alternative text The boundary symbol type of unit and minimum semantic primitive set.
It may further be preferable that the text analysis model determines institute according to the language form of the initial alternative text State boundary symbol type and minimum semantic primitive set.This module can be according to Chinese and English character in the initial alternative text Ratio judge its language form.
It may further be preferable that the length of the initial alternative text of the content extraction module extraction permits in alternative quotation Perhaps within length of interval, and the quotation length setting module of the quotation automatic extracting device, for according to the pre- fixed length Alternative quotation allows length of interval described in degree interval computation.
It may further be preferable that the content extraction module using with it is described as read focus character or character string it is corresponding Structuring node be starting point, travel through the forward and backward structuring node of the starting point and exclude invalid structure node therein and its Comprising text after, select between the resulting structure node and length allow the text in length of interval in alternative quotation As the initial alternative text.
It may further be preferable that the quotation automatic extracting device also includes effective node table, effective node table is used In the type for preserving the predefined resulting structure node.
As a second aspect of the present invention, it is preferred that the complete semantic primitive can be divided into:Expanding element, by text In the text fragment that limited of the extended pattern boundary symbol that includes;Interception unit, by the intercepting type boundary symbol included in text The text fragment for being limited;Minimum semantic primitive, it is single by the minimum that there is independent semantic character or character string to constitute in text Unit;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is limited more than the intercepting type boundary symbol The yardstick of text fragment.
It may further be preferable that the quotation automatic extracting device preserves predefined described by extended boundary symbol table The type of extended pattern boundary symbol, the type of the predefined intercepting type boundary symbol is preserved by intercepting boundary symbol table, The predefined minimum semantic primitive is preserved by minimum semantic primitive set.
It may further be preferable that the content extraction module is used to perform following operation:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with described Expanding element is that unit extracts text and adds alternative text, until the alternative text size is interval more than the predetermined length Lower limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, Using the alternative text as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, to be located at Alternatively the character of the initial and end portion of text and non-boundary symbol is starting point, and edge intercepts direction in units of interception unit to described Alternative text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than institute after the intercept operation The interval lower limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along expansion Exhibition direction is extended in units of the minimum semantic primitive to the alternative text, until the alternative text size is more than The interval lower limit of the predetermined length;If after minimum semantic primitive extension described in Jing, the alternative quotation length is more than institute The interval upper limit of predetermined length is stated, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, edge is cut Take direction to intercept the alternative quotation in units of the minimum semantic primitive;Extend and cut by minimum semantic primitive The successive ignition for taking obtains length in the interval interior alternative text of predetermined length as the quotation text.
It may further be preferable that the angle detecting module that the quotation automatic extracting device includes passes through propagation direction mark Propagation direction described in bit-identify is that header extension or afterbody extend;Intercept headed by direction described in Directional Sign bit-identify by being intercepted Portion intercepts or afterbody is intercepted.
It may further be preferable that whenever the content extraction module extend an expanding element or minimum semantic primitive it Afterwards, and after the content extraction module intercepts an interception unit or minimum semantic primitive, the quotation is carried automatically Take the angle detecting module of device all according to be located in the alternative text character or character string as reading focus it Whether the ratio of front and text size afterwards reaches predetermined direction changes threshold value to decide whether to change the propagation direction Or intercepting direction.
Beneficial effects of the present invention include:The reading focus that can in the text select from user, automatically extracts out Read the context centered on focus and form quotation fragment, and quotation fragment has complete semanteme, improve quotation can The property read, enables reader therefrom correctly to restore complete semantic scene, overcomes in prior art due to having completely Semantic text fragment midway is blocked and affects the reading to quotation and the defect for using;The quotation fragment length extracted meets The interval requirement of predetermined length, improves the suitability of the quotation to various applied environments.
Description of the drawings
Fig. 1 shows the schematic flow sheet of quotation extraction method of the present invention;
Fig. 2 shows the schematic diagram for being ready for the target text that quotation is automatically extracted;
Fig. 3 shows the signal for the selected predetermined length interval for reading focus and setting of the target text Figure;
Fig. 4 is the structured document schematic diagram of target text described in the embodiment of the present invention;
Fig. 5 is the refined flow chart of context extraction step;
Fig. 6 is the alternative text schematic diagram after expanded operation;
Fig. 7 shows the page schematic diagram of the quotation text of final acquisition;
Fig. 8 shows the structural representation of quotation automatic extracting device of the present invention.
Specific embodiment
The preferred embodiments of the present invention are described in detail below in conjunction with the accompanying drawings.It may be noted that explaining preferred embodiment The characteristics of purpose stated is to more fully show each aspect of the present invention and beneficial effect.It is therefore preferable that embodiment is As an example property, should not be understood as limiting the scope of the invention.Protection scope of the present invention should be wanted with right The content that book is asked is asked to be defined.
The present invention is a kind of method and apparatus for realizing context extraction as unit using complete semantic primitive.Introduce first The once implication of complete semantic primitive.Complete semantic primitive is that have independent and complete semantic text fragment.It is complete semantic single Unit is unit of expressing the meaning intrinsic in natural language, for example, in the middle of Chinese, with fullstop ("."), question mark (""), exclamation mark (“!") etc. punctuate a whole sentence can be marked off from text as border, the whole sentence as whole section of text text fragment, table Complete semantic scene has been reached, a complete semantic primitive has been constituted;And with comma (", ") or branch (";") can be with for border A subordinate sentence in dividing from whole sentence, the subordinate sentence also expresses relatively complete semantic scene, equally constitutes as text fragment One complete semantic primitive;Analogously, can also be marked off for border with period (". ") or comma (", ") in English Complete semantic primitive.It can be seen that, natively there is in the text the various boundary symbols by taking above-mentioned punctuation mark as an example, boundary symbol Between the text fragment that limited i.e. as the complete semantic primitive, these text fragments natively have relatively complete language Justice.Certainly, also not all of punctuation mark can serve as the boundary symbol, such as upper and lower quotation marks, upper and lower bracket, up and down Punctuation marks used to enclose the title, the purpose of pause mark not divide whole sentence or subordinate sentence, therefore generally can not be used as the boundary symbol.In order to correctly recognize The boundary symbol, the present invention arranges symbol table and predefined can below be situated between as the sign pattern of boundary symbol to preserve In the context extraction step for continuing can by with symbol table contrast to judge text in which symbol constitute boundary symbol, from And the text fragment for being limited these symbols is used as complete semantic primitive.
The complete semantic primitive limited by different types of boundary symbol may have different yardsticks.For example, upper In the example in face, with fullstop ("."), question mark (""), exclamation mark ("!") etc. type the text fragment that limited of boundary symbol Belong to whole sentence, with large scale;And with comma (", ") or branch (";") text fragment that limits for boundary symbol belongs to point Sentence, with smaller scale.Here " yardstick " is meant that the yardstick of finger speech justice division aspect, rather than text size etc..One by The number of characters that included of subordinate sentence that comma is limited may be far more than the number of characters of a whole sentence limited by fullstop, but from yardstick See, the yardstick of the latter is still above the former.
For the text fragment being made up of the one section of continuation character not separated by the boundary symbol, still may be used To be split as some complete semantic primitives, these complete semantic primitives are by with independent semantic character or character string The minimum semantic primitive of composition.For example, an English word or a Chinese character may make up minimum semantic primitive.In addition, The character strings such as URL addresses, E-mail address, time format obviously can not split again, will otherwise destroy its integrity, therefore Such character string also constitutes minimum semantic primitive.Text fragment between the punctuation mark for using in pairs is (such as positioned at upper Text fragment, the text fragment between upper and lower quotation marks between lower bracket, the text fragment between upper and lower punctuation marks used to enclose the title Deng) there is globality, also should not split again, therefore constitute minimum semantic primitive.In some cases, with specific font form Text fragment also constitute minimum semantic primitive, for example, in english expression, the exclusive noun such as name, place name generally using with The different english font of the other parts of text, therefore can be by with the specific font different from text other parts font Text fragment is defined as minimum semantic primitive;For another example, with underscore, overstriking, italic, blacken, mark the specific font lattice such as red The text fragment of formula can also be defined as minimum semantic primitive.In order to determine minimum semantic primitive, the present invention in Text Feature Extraction Predefined minimum semantic primitive is preserved by minimum semantic primitive set, so as to match with minimum semantic primitive in text Character or character string as complete semantic primitive.Obviously, from from yardstick, the yardstick of minimum semantic primitive is minimum.
Fig. 1 is the schematic flow sheet of quotation extraction method in the embodiment of the present invention.The method sets first including focus Determine step 101, in this step, for the target text shown in Fig. 2, single character therein or a continuous character may be selected String is used as reading focus.For example, user can select the single character in target text, or the key being made up of character string Word, phrase or short sentence should be less than predetermined threshold, for example, be no longer than as the reading focus, selected string length Eight characters etc..As shown in figure 3, in above-mentioned target text, user selects " structuring between information " this character string conduct The reading focus.User can be directly performed to reading focus selection action during the texts such as webpage, e-book are read, Can also select otherwise to read focus;For example, user can be input in a search engine as the character for reading focus Or character string is obtained and reads the text of focus and to user comprising the term as term, search engine and then Auto-matching Quotation centered on providing to read focus.
While selecting as the character or character string of reading focus, in a step 102, user can predefine and pass through The method of the present invention extracts the higher limit and lower limit, the higher limit and lower limit of the length range of the final quotation text for obtaining It is interval that value constitutes predetermined length.Square frame in Fig. 3 is shown around the length range for reading the final quotation text that focus is formed The upper limit, in this example the lower limit of quotation text size scope read the text size of focus itself.The predetermined length is interval Can be presetting.
In the initial extraction step of step 103, extract between the resulting structure node of text and comprising described As the initial alternative text of the character or character string for reading focus.The text being digitized into is mostly structured document, structure Changing document includes the structuring node such as content of text itself and label, and such as Fig. 4 shows the structuring text of the target text Shelves schematic diagram, the structured document is the source file of webpage, wherein<p>,<a>,<img>Etc. web page tag as the structuring Node.In this step, first according to read focus be located text in structured document location, acquisition read with this Read the corresponding structuring node of focus, the text at focus " structuring between information " place is read in the diagram in structuring text Second is in shelves<p>After node, therefore read the associated structuring node of focus and be defined as second<p>Node.With This second<p>Node is starting point, and the forward and backward structuring node of the starting point is traveled through successively.For example, in this example first backward time Go through, priority is run into into the 3rd<p>Node,<a>Node and<img>Node, for these structuring nodes, can be by inquiry Effectively node table, judges which is resulting structure node;Predefined all effective knots are saved in effective node table The type of structure node.In this example, by inquiry, can be with<a>Node belongs to resulting structure node, therefore<a>Node it Between text fragment " (microblogging) " will be added in the middle of initial alternative text, and<img>Node Jing inquiries are not belonging to resulting structure Change node, therefore<img>Node and its content for being included will be excluded.By iteration successively forwardly and rearwardly, will include Text between resulting structure node adds initial alternative text, until the length of initial alternative text is in alternative quotation Within allowing length of interval.The alternative quotation allows the initial alternative quotation that length of interval defines the extraction from text to permit Perhaps maximum text size and minimum text size, can count according to the predetermined length interval of defined final quotation text Calculating the alternative quotation allows the upper limit of length of interval, and the applied environment for example shared for microblogging, the predetermined length is interval Higher limit be 120 characters, then alternative quotation allows the higher limit of length of interval to be more than 120 characters, e.g. 120 The several times of individual character;So, by extracting the initial alternative quotation bigger than final quotation text capacity, be conducive to under The text analyzing step 104 that face is mentioned provides more fully analyst coverage, so as to ensure that precision of analysis.At this In example, the alternative quotation allows the lower limit of length of interval to be similarly the text size for reading focus itself.
It is initial to this in the text analyzing step of step 104 after determining initial alternative text in step 103 Alternative text is analyzed, it may be determined that the language form of target text, so as to determine will to introduce below with complete language In the middle of the context extraction step that adopted unit is carried out for unit, which is included most using what type of boundary symbol and employing The set of little semantic primitive is dividing the complete semantic primitive.In this step, statistics is in the initial alternative text for being extracted It is central, Chinese character and English character ratio shared respectively, it is clear that in the middle of the target text of the present embodiment, Chinese character Proportion is bigger, therefore language form is defined as into Chinese type.For different language form, divide complete semantic The boundary symbol of unit can be different, for example, in Chinese, fullstop ("."), exclamation mark ("!"), question mark ("") can make One section is limited for the boundary symbol there is complete semantic text fragment, and in English, period (". ") can be used as side Boundary's symbol and limit with complete semantic text fragment.Different language form, it is interior that minimum semantic primitive set is included Holding certainly also can be different, such as in English, have substantial amounts of English word to belong to minimum semantic primitive, and in Chinese, except few Measure outside the abbreviation formed by English character string, will not be single comprising most of other English in the set of its minimum semantic primitive Word.Therefore, according to the language form detected in this step, Chinese text can be respectively directed to different with English text selection Boundary symbol type and minimum semantic primitive set.
Next, in upper and lower extraction step 105, to read focus centered on, with the complete semantic single of various yardsticks Unit is unit, carries out text extension in units of the complete semantic primitive of large scale first and obtains alternative text, then with yardstick Less complete semantic primitive is intercepted for unit to alternative text, by successive ignition, until with smallest dimension The minimum semantic primitive is extended for unit and intercepts, and finally extracts using described as the character or character of reading focus Centered on string and in predetermined length interval in quotation text.During extending and intercept more than, extension side is set To flag bit and intercepting direction flag, the direction for being respectively used to identify extension and intercept is directed towards stem or afterbody, so as to Determine extension every time and the direction for intercepting.Fig. 5 is the refined flow chart of context extraction step.
In context extraction step, the extended operation of step 501 is first carried out, i.e., with it is selected reading focus for Point, chooses along propagation direction and reads the forward and backward alternative text of text fragment addition of focus;Importantly, extended operation each time All in units of the complete semantic primitive with large scale, namely a complete semantic primitive is added alternative text by one extension This.The complete semantic primitive of large scale described here is that the certain types of boundary symbol of some being predefined is limited Text fragment, these the certain types of boundary symbols adopted in extended operation are referred to as extended pattern boundary symbol, quilt by us The text fragment that extended pattern boundary symbol is limited is referred to as expanding element.For example, in this example, by fullstop (".") limited Whole sentence has large scale, belongs to the complete semantic primitive in large scale, thus fullstop will by as extended pattern boundary symbol, The whole sentence limited by fullstop is using by as the expanding element in extended operation;And the subordinate sentence that comma (", ") is limited is because of yardstick It is relatively small, not as expanding element.The boundary symbol of which type is fixed in advance as the extended pattern boundary symbol Justice is good, for the ease of inquiring about and managing, the type of extended pattern boundary symbol is stored in the middle of extended boundary symbol table.
In the concrete execution of extended operation 501, first to read focus as starting point, according to reading focus initial alternative Side-play amount setting propagation direction flag bit in quotation is header extension or afterbody extension;If reading focus positioned at initial standby Select the first half of quotation, will the flag bit be set to afterbody extension, if read focus be located at after initial alternative quotation Half part, will the flag bit be set to header extension.Read focus " structuring between information " in this example, as starting point, to be in The first half of initial alternative quotation, therefore propagation direction flag bit is confirmed as at the beginning afterbody extension.
It is starting point from focus is read, the first character from after " structuring between information " starts forward to read backward successively Take, when first non-legible character (such as punctuation mark) is run into, whether query expansion boundary symbol table is judging the character Belong to above-mentioned extended pattern boundary symbol, if belonged to, using the text fragment between reading focus and this character as one Expanding element, is added in the middle of alternative quotation (initial alternative text has been cleared);If be not belonging to, continuation is read backward, Until running into first extended pattern boundary symbol.Then, compare in the middle of alternative text, the text read before focus is long Whether the ratio of both the text sizes after degree and the focus has reached predetermined direction changes threshold value, it is assumed that reached threshold Value, then be revised as header extension by propagation direction flag bit;Otherwise keep afterbody extension constant.In this example, " tie between information First character after structure " be extended pattern boundary symbol fullstop ("."), therefore can compare " structuring between information " it The ratio of front text size (0 character) and the text size (1 character) after it to determine propagation direction, the knot for comparing Fruit is to continue with afterbody extension.
Further, whether interval more than the predetermined length lower limit of the text size of alternative quotation is judged, if surpassed Cross the lower limit and then stop extended operation;If not less than lower limit, illustrating that current alternative quotation is also not reaching to required Quotation minimum length, therefore extended operation will be continued, continue for next expanding element to add alternative quotation along propagation direction. In this example, will continue rearwardly to extend, until it reaches fullstop after " accurate push " ("."), so, can be by by this " in this process, user can be with ... accurately pushes away the text fragment limited as the fullstop of extended pattern boundary symbol Send." this whole sentence adds alternative text as expanding element.And the boundary symbol such as the pause mark that run into during this and comma Because being not belonging to extended pattern boundary symbol, the text that it is limited can include this extension as a part for expanding element.
After adding above-mentioned text fragment, above-mentioned judgement and adjustment to propagation direction can be performed, Jing judges to read burnt The ratio of the text size before and after point alreadys exceed direction and changes threshold value, therefore headed by propagation direction flag bit is changed Portion extends.Jing judges still interval not less than the predetermined length lower limit of the length of current alternative text, therefore will be reading focus Extend forward for starting point, until will " know in this period ... allow information and " this text fragment as expanding element add Alternative text.Now, due to having reached lower limit, therefore extended operation 501 terminates, it is not necessary to be further continued for rearwardly or to head Portion is extended in units of expanding element.The alternative text obtained through extended operation 501 is as shown in Figure 6.
Due to being extended in units of the expanding element of large scale, the alternative text that expanded operation 501 is obtained It is possible to beyond the interval upper limit of predetermined length.Therefore, the length of the alternative text for obtaining to expanded operation 501 is sentenced It is disconnected, if being not more than the higher limit, using the alternative text as the quotation text for being extracted;If greater than the higher limit, need The intercept operation 502 being described below is performed to alternative text.
Intercept operation 502 is to identify intercepting direction along by intercepting direction flag, with relative to above-mentioned expanding element Complete semantic primitive of the speech with smaller scale is unit, and alternative text is intercepted.Analogously, once intercepting equally is Intercept a complete semantic primitive, complete semantic primitive be intercepted and with relatively small yardstick is also by fixed in advance The text fragment that limited of some certain types of boundary symbols of justice, we are by these certain kinds adopted in intercept operation The boundary symbol of type is referred to as intercepting type boundary symbol, and the text fragment limited by intercepting type boundary symbol is referred to as interception unit. For example, in this example, the subordinate sentence that comma (", ") is limited is used as interception unit, it is clear that the subordinate sentence and conduct as interception unit The yardstick that the whole sentence of expanding element is compared semantically is less.The boundary symbol of which type is used as the intercepting type boundary symbol Pre-define, and be stored in the middle of intercepting boundary symbol table for inquiring about in execution.
It is similar with extended operation in concrete implementation procedure, first according to the text size read before and after focus Ratio come determine intercept direction flag be stem intercept or afterbody intercept.Because reading focus is in current alternative text Position it is forward, according to ratio in judgement will intercept walking direction be afterbody intercept, from last non-side of alternative text afterbody It is starting point that the character of boundary's symbol " send " word, and each character is successively read forward until reaching the comma behind " complicated information ", Jing Inquiry comma (", ") belongs to and intercepts the intercepting type boundary symbol preserved in boundary symbol table, therefore by the text after the comma Segment intercepts as an interception unit from the middle of alternative text.
After an intercept operation, judge to intercept whether direction needs to change according to the method described above;Then judge standby Whether the length of selection sheet is already less than the interval upper limit of the predetermined length;If not being less than the upper limit, continue basis and cut Intercepting is performed in units of the interception unit that direction is taken to be limited by intercepting type boundary symbol, until being less than the upper limit;If little After the upper limit, that is, terminate intercept operation 502.
After intercept operation 502 is completed, can judge whether the current alternative text Jing after intercepting can be again less than pre- fixed length The interval lower limit of degree.In this example, due to being intercepted the quotation text after operating not less than the lower limit, therefore can conduct Final quotation text.But in some cases, the alternative text after Jing intercept operations may once more be less than predetermined length Interval lower limit, the minimum semantic primitive extension intercept operation being at this moment accomplished by described in execution step 503.
Implication, example with regard to minimum semantic primitive and the minimum semanteme list for preserving predefined minimum semantic primitive Unit's set, has been described above being illustrated by detail.In step 503, it is with defined in minimum semantic primitive set Minimum semantic primitive is unit, with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side To being extended to the alternative text, the minimum semantic primitive of one extension one;Every time after extension, also according to being situated between above Whether whether the walking direction propagation direction for continuing need adjustment, and the alternative quotation length obtained by judging again more than described The interval lower limit of predetermined length, until being more than the lower limit.Due to wrapping in some minimum semantic primitives (such as URL or email address) The number of characters that contains may be a lot, if Jing, once after minimum semantic primitive extension, the alternative quotation length is once more big In the upper limit that the predetermined length is interval, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, Along direction is intercepted the alternative quotation is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive With the successive ignition for intercepting, length is finally obtained in that in the interval interior alternative text of predetermined length, as the quotation text. Fig. 7 shows the page of the quotation text of final acquisition.
In order to realize said method, present invention also offers a kind of quotation automatic extracting device.Fig. 8 shows the device Overall structure diagram, including focus setting module 801, quotation length setting module 802, text analysis model 803, interior Hold extraction module 804, angle detecting module 805, effective node table 806, extended boundary symbol table 807, intercepting boundary symbol table 808th, minimum semantic primitive set 809.Wherein, focus setting module 801 is used for the character of manual or automatic selection in the text Or character string is used as reading focus;Quotation length setting module 802 is used to be directed to quotation text by manually or automatically predefined The predetermined length it is interval, and the alternative quotation permission length of interval according to the predetermined length interval computation.Content is carried According to method described above, by the effective node type in the effective node table 806 of inquiry, extraction is located at delivery block 804 Between the resulting structure node of text and comprising the initial alternative text as the character or character string of reading focus.Text This analysis module 803 judges the class of languages of text by carrying out the analysis such as Chinese and English character statistics to the initial alternative text Type;This device can respectively be predefined and preserve respective for respective language features of different language type such as Chinese, English Extended boundary symbol table 807, intercepting boundary symbol table 808, minimum semantic primitive set 809;According to 803 points of text analysis model The language form that analysis is obtained, content extraction module 804 is selected the extended boundary symbol table 807 corresponding with the language form, is cut Take boundary symbol table 808, minimum semantic primitive set 809.Further, the method as described above of content extraction module 804, reads Text is taken and analyzed, by query expansion boundary symbol table 807, boundary symbol table 808 and minimum semantic primitive set is intercepted 809, performed in units of the complete semantic primitive of the different scales such as expanding element, interception unit and minimum semantic primitive corresponding Extended operation, intercept operation and minimum semantic primitive extension intercept operation, during aforesaid operations are performed, by angle detecting Module 805 judges extension or intercepts direction and correspondingly update propagation direction flag bit and intercept direction flag.Finally, can Length is obtained in the interval interior alternative text of predetermined length, as the quotation text.Content extraction module 804 is reserved correlation and is connect Mouth is connected with other products system, is directed to relevance mechanism.Specifically, pull out from original article paragraph After coming, quotation is not completely self-contained main body, and original document still has certain association, thus for quotation some Operating result may influence whether the content in original document.By interface module 810, the system can couple above-mentioned other products Strain is united and provides the quotation text of extraction to other products system.
By the quotation text that automatically extracts of said method of the present invention, its length meets predetermined length scope, to microblogging, The applied environment that search engine page etc. has strict restriction to quotation text size has adaptability;It is additionally, since every time Expansion and intercept operation be all in units of complete semantic primitive launch, can farthest ensure semantic integrity and Can read, correctly reduction, around the context for reading focus, is overcome in prior art due to having to be conducive to reader Complete semantic text fragment midway is blocked and affects the reading to quotation and the defect for using.

Claims (32)

1. a kind of quotation extraction method, it is characterised in that include:
Focus setting procedure, is selected as the character or character string for reading focus from text;
Context extraction step, is extended and/or is intercepted by the text carried out in units of complete semantic primitive, is extracted with described Read focus centered on context, so as to obtain text size in predetermined length interval in and it is with semantic integrity Quotation text;
Wherein, the context extraction step includes:
With described as the character or character string of focus is read as starting point and along propagation direction, to be limited by extended pattern boundary symbol The fixed complete semantic primitive with large scale is the spread step that unit chooses alternative text;And/or
For alternative text, along direction is intercepted, with the complete semanteme with smaller scale limited by intercepting type boundary symbol Unit is the intercepting step that unit intercepts alternative text;
And
For the alternative text after expanded step and/or intercepting step process, along extension side in units of minimum semantic primitive Step is intercepted to extension and/or along the minimum semantic primitive extension for intercepting the direction intercepting alternative text;
Wherein, the extended pattern boundary symbol and intercepting type boundary symbol are respectively the boundary symbols of predefined type.
2. quotation extraction method according to claim 1, it is characterised in that the complete semantic primitive includes:By The text fragment with various yardsticks that the different types of boundary symbol included in text is limited, and by having in text The minimum semantic primitive of independent semantic character or character string composition.
3. quotation extraction method according to claim 2, it is characterised in that the border is predefined by symbol table The type of symbol, and the set of minimum semantic primitive.
4. quotation extraction method according to claim 2, it is characterised in that the minimum semantic primitive includes:English Literary word, Chinese character, URL addresses, E-mail address, time format, the text between the punctuation mark for using in pairs This segment, the text fragment with specific font form.
5. quotation extraction method according to claim 1, it is characterised in that according to being located at the work in alternative text Whether the ratio for the text size before and after the character or character string of reading focus reaches predetermined direction change threshold value, Decide whether to change the propagation direction and intercepting direction.
6. quotation extraction method according to claim 2, it is characterised in that predefined for the described of quotation text Predetermined length is interval.
7. quotation extraction method according to claim 2, it is characterised in that methods described is in context extraction step Also include before:Initial extraction step, extracts between the resulting structure node of text and comprising described burnt as reading The character of point or the initial alternative text of character string;Text analyzing step, determines to use by analyzing the initial alternative text In the boundary symbol type for dividing the complete semantic primitive and minimum semantic primitive set.
8. quotation extraction method according to claim 7, it is characterised in that according to the language of the initial alternative text Speech type, determines the boundary symbol type and minimum semantic primitive set.
9. quotation extraction method according to claim 7, it is characterised in that the initial extraction step extract just The length of alternative text of beginning is allowed within length of interval, and according to the predetermined length interval computation in alternative quotation Alternative quotation allows length of interval.
10. quotation extraction method according to claim 7, it is characterised in that the initial extraction step includes:With Structuring node corresponding with the character or character string as reading focus is starting point, travels through the forward and backward structure of the starting point Change node and exclude after invalid structure node therein and its text that includes, select to be located between resulting structure node and Length allows the text in length of interval as the initial alternative text in alternative quotation.
11. quotation extraction methods according to claim 10, it is characterised in that by the predefined institute of effective node table State the type of resulting structure node.
12. quotation extraction methods according to claim 1, it is characterised in that the complete semantic primitive can be divided into: Expanding element, the text fragment limited by the extended pattern boundary symbol included in text;Interception unit, by what is included in text The text fragment that intercepting type boundary symbol is limited;Minimum semantic primitive, by the character or character in text with independent semanteme The minimum unit of string composition;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is more than the intercepting type side The yardstick of the text fragment that boundary's symbol is limited.
13. quotation extraction methods according to claim 12, it is characterised in that predetermined by extended boundary symbol table The type of the justice extended pattern boundary symbol, the type of the intercepting type boundary symbol is predefined by intercepting boundary symbol table, The minimum semantic primitive is predefined by minimum semantic primitive set.
14. quotation extraction methods according to claim 12, it is characterised in that the context extraction step bag Include:Extended operation, it is single with the extension along propagation direction with described as the character or character string of focus is read as origination data Unit extracts text and adds alternative text for unit, until the alternative text size is more than under predetermined length interval Limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, should Alternative text is used as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, with positioned at alternative The initial and end portion of text and the character of non-boundary symbol are starting point, along intercepting direction in units of interception unit to described alternative Text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than described pre- after the intercept operation The interval lower limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side The alternative text is extended in units of the minimum semantic primitive, until the alternative text size is more than described The interval lower limit of predetermined length;If after minimum semantic primitive extension described in Jing, the alternative text size is more than described pre- The interval upper limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along intercepting side The alternative text is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive and intercepted Successive ignition obtains length in the interval interior alternative text of predetermined length as the quotation text.
15. quotation extraction methods according to claim 14, it is characterised in that by propagation direction mark bit-identify The propagation direction is that header extension or afterbody extend;Intercept described in Directional Sign bit-identify direction for stem intercepting by intercepting Or afterbody is intercepted.
16. quotation extraction methods according to claim 14, it is characterised in that whenever extend an expanding element or After minimum semantic primitive, and after one interception unit of intercepting or minimum semantic primitive, all according to the alternative text It is predetermined whether the ratio of the text size being located in this before and after character or character string as reading focus reaches Direction change threshold value to decide whether to change the propagation direction or intercept direction.
17. a kind of quotation automatic extracting devices, it is characterised in that include:
Focus setting module, for being selected as the character or character string of reading focus from text;
Content extraction module, for being extended by the text carried out in units of complete semantic primitive and/or being intercepted, is extracted with institute State read focus centered on context, so as to obtain text size in predetermined length interval in and with semantic integrity Quotation text;
Wherein, the content extraction module is used to perform following operation:
With described as the character or character string of focus is read as starting point and along propagation direction, to be limited by extended pattern boundary symbol The fixed complete semantic primitive with large scale is the spread step that unit chooses alternative text;And/or
For alternative text, along direction is intercepted, with the complete semanteme with smaller scale limited by intercepting type boundary symbol Unit is the intercepting step that unit intercepts alternative text;
And
For the alternative text after expanded step and/or intercepting step process, along extension side in units of minimum semantic primitive Step is intercepted to extension and/or along the minimum semantic primitive extension for intercepting the direction intercepting alternative text;
Wherein, the extended pattern boundary symbol and intercepting type boundary symbol are respectively the boundary symbols of predefined type.
18. quotation automatic extracting devices according to claim 17, it is characterised in that the complete semantic primitive includes: The text fragment with various yardsticks limited by the different types of boundary symbol included in text, and had by text The minimum semantic primitive being made up of independent semantic character or character string.
19. quotation automatic extracting devices according to claim 18, it is characterised in that described device includes symbol table, institute Symbol table is stated for preserving the type of the predefined boundary symbol and the set of minimum semantic primitive.
20. quotation automatic extracting devices according to claim 18, it is characterised in that the minimum semantic primitive includes: English word, Chinese character, URL addresses, E-mail address, time format, between the punctuation mark for using in pairs Text fragment, the text fragment with specific font form.
21. quotation automatic extracting devices according to claim 18, described device also includes angle detecting module, for root According to the text size being located in alternative text before and after the character or character string as reading focus ratio whether Reach predetermined direction and change threshold value, decide whether to change the propagation direction and intercepting direction.
22. quotation automatic extracting devices according to claim 18, it is characterised in that described device also includes quotation length Setting module, it is interval for the predetermined length of quotation text for predefining.
23. quotation automatic extracting devices according to claim 18, it is characterised in that the content extraction module is additionally operable to Perform following operation:Extract between the resulting structure node of text and comprising described as the character or word of reading focus The initial alternative text of symbol string;Also, described device also includes text analysis model, by analyze the initial alternative text come Determine the boundary symbol type for dividing the complete semantic primitive and minimum semantic primitive set.
24. quotation automatic extracting devices according to claim 23, it is characterised in that the text analysis model is according to institute The language form of initial alternative text is stated, the boundary symbol type and minimum semantic primitive set is determined.
25. quotation automatic extracting devices according to claim 23, it is characterised in that what the content extraction module was extracted The length of initial alternative text is allowed within length of interval in alternative quotation, and the quotation length setting mould of described device Block, for the alternative quotation according to the predetermined length interval computation length of interval is allowed.
26. quotation automatic extracting devices according to claim 23, it is characterised in that the content extraction module with institute State as read focus character or the corresponding structuring node of character string be starting point, travel through the forward and backward structuring section of the starting point After putting and excluding invalid structure node therein and its text that includes, select to be located between resulting structure node and length In alternative quotation the text in length of interval is allowed as the initial alternative text.
27. quotation automatic extracting devices according to claim 26, it is characterised in that described device includes effective node Table, effective node table is used to preserve the type of the predefined resulting structure node.
28. quotation automatic extracting devices according to claim 17, it is characterised in that the complete semantic primitive can divide For:Expanding element, the text fragment limited by the extended pattern boundary symbol included in text;Interception unit, is wrapped by text The text fragment that the intercepting type boundary symbol for containing is limited;Minimum semantic primitive, by have in text independent semantic character or The minimum unit of character string composition;And the yardstick of the text fragment that the extended pattern boundary symbol is limited is more than the intercepting The yardstick of the text fragment that type boundary symbol is limited.
29. quotation automatic extracting devices according to claim 28, it is characterised in that described device is accorded with by extended boundary Number table preserves the type of the predefined extended pattern boundary symbol, and by intercepting boundary symbol table predefined described cut is preserved The type of type boundary symbol is taken, the predefined minimum semantic primitive is preserved by minimum semantic primitive set.
30. quotation automatic extracting devices according to claim 28, it is characterised in that the content extraction module is used to hold The following operation of row:
Extended operation, with described as the character or character string of focus is read as origination data, along propagation direction with the extension Unit is that unit extracts text and adds alternative text, until the alternative text size is more than under predetermined length interval Limit;Judge whether interval more than the predetermined length upper limit of the length of the alternative text, if the no more than upper limit, should Alternative text is used as the quotation text for being extracted;
Intercept operation, if the alternative text that extended operation is obtained is more than the interval upper limit of the predetermined length, with positioned at alternative The initial and end portion of text and the character of non-boundary symbol are starting point, along intercepting direction in units of interception unit to described alternative Text is intercepted, until the alternative text size upper limit interval less than the predetermined length;
Minimum semantic primitive extends intercept operation, if the alternative text size is less than described pre- after the intercept operation The interval lower limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along extension side The alternative text is extended in units of the minimum semantic primitive, until the alternative text size is more than described The interval lower limit of predetermined length;If after minimum semantic primitive extension described in Jing, the alternative text size is more than described pre- The interval upper limit of measured length, then with the character positioned at the initial and end portion of alternative text and non-boundary symbol as starting point, along intercepting side The alternative text is intercepted in units of the minimum semantic primitive;Extended by minimum semantic primitive and intercepted Successive ignition obtains length in the interval interior alternative text of predetermined length as the quotation text.
31. quotation automatic extracting devices according to claim 30, it is characterised in that the angle detecting module of described device It is that header extension or afterbody extend by propagation direction described in propagation direction mark bit-identify;By intercepting Directional Sign bit-identify The intercepting direction is that stem is intercepted or afterbody is intercepted.
32. quotation automatic extracting devices according to claim 30, it is characterised in that whenever the content extraction module expands After one expanding element of exhibition or minimum semantic primitive, and whenever the content extraction module one interception unit of intercepting or most After little semantic primitive, the angle detecting module of described device is all described as reading focus according to being located in the alternative text Character or character string before and after the ratio of text size whether reach predetermined direction and change threshold value to decide whether Change the propagation direction or intercept direction.
CN201410301560.0A 2014-06-27 2014-06-27 Automatic quotation extraction method and device with semantic integrity kept Active CN104050158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410301560.0A CN104050158B (en) 2014-06-27 2014-06-27 Automatic quotation extraction method and device with semantic integrity kept

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410301560.0A CN104050158B (en) 2014-06-27 2014-06-27 Automatic quotation extraction method and device with semantic integrity kept

Publications (2)

Publication Number Publication Date
CN104050158A CN104050158A (en) 2014-09-17
CN104050158B true CN104050158B (en) 2017-05-17

Family

ID=51503012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410301560.0A Active CN104050158B (en) 2014-06-27 2014-06-27 Automatic quotation extraction method and device with semantic integrity kept

Country Status (1)

Country Link
CN (1) CN104050158B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068992B (en) * 2015-07-29 2019-04-26 魅族科技(中国)有限公司 A kind of search result display methods and device
US11500535B2 (en) 2015-10-29 2022-11-15 Lenovo (Singapore) Pte. Ltd. Two stroke quick input selection
CN105955616B (en) * 2016-04-29 2019-05-07 北京小米移动软件有限公司 A kind of method and apparatus for choosing document content
CN106970847A (en) * 2017-03-28 2017-07-21 飞驰镁物(北京)信息服务有限公司 Content cuts Tibetan method and system, user terminal and server
CN108388664A (en) * 2018-03-14 2018-08-10 深圳市网域科技股份有限公司 Integration method, device, computer equipment and the storage medium of sentence segment
CN114817520A (en) * 2021-01-19 2022-07-29 华为技术有限公司 Method and device for determining abstract of search result and electronic equipment
CN115080170A (en) * 2022-06-29 2022-09-20 维沃移动通信有限公司 Information processing method, information processing apparatus, and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201841A (en) * 2007-02-15 2008-06-18 刘二中 Convenient method and system for electronic text-processing and searching
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645125B2 (en) * 2010-03-30 2014-02-04 Evri, Inc. NLP-based systems and methods for providing quotations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201841A (en) * 2007-02-15 2008-06-18 刘二中 Convenient method and system for electronic text-processing and searching
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
从网页中精确提取链接上下文相关文本;徐晴阳;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20041215(第4期);I139-329 *

Also Published As

Publication number Publication date
CN104050158A (en) 2014-09-17

Similar Documents

Publication Publication Date Title
CN104050158B (en) Automatic quotation extraction method and device with semantic integrity kept
WO2017177809A1 (en) Word segmentation method and system for language text
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN106776555B (en) A kind of comment text entity recognition method and device based on word model
TW201804341A (en) Character string segmentation method, apparatus and device
CN106055667A (en) Method for extracting core content of webpage based on text-tag density
US10599748B2 (en) Systems and methods for asymmetrical formatting of word spaces according to the uncertainty between words
JP2010134922A (en) Similar word determination method and system
CN111199151A (en) Data processing method and data processing device
US11074306B2 (en) Web content extraction method, device, storage medium
US9996508B2 (en) Input assistance device, input assistance method and storage medium
CN111435405A (en) Method and device for automatically labeling key sentences of article
CN115546815A (en) Table identification method, device, equipment and storage medium
CN108664522A (en) Web page processing method and device
JP2009265770A (en) Significant sentence presentation system
KR102182248B1 (en) System and method for checking grammar and computer program for the same
CN108132919A (en) A kind of method of webpage content extraction
KR20220113075A (en) Word cloud system based on korean noun extraction tokenizer
JP2006053866A (en) Detection method of notation variability of katakana character string
KR101909537B1 (en) System and method for classifying social data
JP2008225566A (en) Device and method for extracting related information
CN111444716A (en) Title word segmentation method, terminal and computer readable storage medium
KR101634681B1 (en) Method and program for searching quoted phrase in document
JP2007148630A (en) Patent analyzing device, patent analyzing system, patent analyzing method and program
JP2014112306A (en) Demand sentence extract device, demand content identification model learning device, method and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: Yuhua Road, Qinhuai District of Nanjing City, Jiangsu province 210000 No. 22 treasure garden 22-302

Applicant after: Wu Taojun

Address before: 200000 West Yan'an Road 900 Road, Changning District, Shanghai

Applicant before: Wu Taojun

GR01 Patent grant
GR01 Patent grant