CN104615654B - A kind of text snippet acquisition methods and device - Google Patents
A kind of text snippet acquisition methods and device Download PDFInfo
- Publication number
- CN104615654B CN104615654B CN201410850654.3A CN201410850654A CN104615654B CN 104615654 B CN104615654 B CN 104615654B CN 201410850654 A CN201410850654 A CN 201410850654A CN 104615654 B CN104615654 B CN 104615654B
- Authority
- CN
- China
- Prior art keywords
- feature
- word segmentation
- chinese word
- text
- vocabulary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000012634 fragment Substances 0.000 claims abstract description 42
- 230000011218 segmentation Effects 0.000 claims description 195
- 239000000284 extract Substances 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 17
- 230000003203 everyday effect Effects 0.000 claims description 12
- 238000012163 sequencing technique Methods 0.000 claims 5
- 238000012545 processing Methods 0.000 description 42
- 244000068988 Glycine max Species 0.000 description 10
- 235000010469 Glycine max Nutrition 0.000 description 10
- 239000008267 milk Substances 0.000 description 5
- 210000004080 milk Anatomy 0.000 description 5
- 235000013336 milk Nutrition 0.000 description 5
- 230000001174 ascending effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 241000266501 Ormosia ormondii Species 0.000 description 1
- 244000046052 Phaseolus vulgaris Species 0.000 description 1
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 239000002002 slurry Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Abstract
The embodiment of the invention discloses a kind of text snippet acquisition methods, this method includes:Obtain the text data in file destination;Judge whether include summary keyword in the text data, the summary keyword is used to indicate the position where the text snippet in the text data;If judging, the text data includes the summary keyword, count it is described summary keyword where text fragment number of words;The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than first predetermined threshold value, the text fragment where the summary keyword is defined as text snippet.Implement the embodiment of the present invention, text snippet can quickly be recognized according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensive.
Description
Technical field
The present invention relates to communication technical field, and in particular to a kind of text snippet acquisition methods and device.
Background technology
Text enormous amount, content are numerous and diverse on internet, and there is the webpage of a large amount of identical subject contents.How from
Key message is fast and effeciently obtained in text, this is accomplished by text snippet acquiring technology.Accurately, the weight in rapid extraction document
Point information is the key of text snippet acquisition methods, is of great practical significance.Current text snippet acquisition methods master
There are two kinds, a kind of is, by the key term in analyzing web page text, to count item frequency to calculate the weight of each sentence,
And made a summary by weight acquiring size web page text.Another is to carry out text to be divided into some semantic sections, and measurement is each
Importance of the sentence in the semantic section in place, the important sentence of extraction comparison as the semantic section representative sentences, by representative sentences group
Into documentation summary.
The acquisition methods of above-mentioned text snippet, former approach merely mechanically pieces together several sentences, the company of sentence
Coherence and logicality are not being met;Later approach only selects semantic section center sentence to constitute summary, can go out in semanteme linking
Existing incoherent problem.
The content of the invention
The embodiment of the present invention provides a kind of text snippet acquisition methods and device, can quickly be recognized according to summary keyword
Text snippet, improves text snippet identification and the efficiency, the precision and comprehensive that obtain.
The embodiment of the invention discloses a kind of text snippet acquisition methods, including:
Obtain the text data in file destination;
Judge whether include summary keyword in the text data, the summary keyword is used to indicate the textual data
The position where text snippet in;
If judging, the text data includes the summary keyword, counts the text where the summary keyword
The number of words of this paragraph;
The number of words of statistics and the first predetermined threshold value are compared;
If the number of words is compared less than first predetermined threshold value, by the text fragment where the summary keyword
It is defined as text snippet.
Correspondingly, the embodiment of the invention also discloses a kind of text snippet acquisition device, including:
Acquiring unit, for obtaining the text data in file destination;
Judging unit, for judging whether include summary keyword in the text data, the summary keyword is used for
Indicate the position where the text snippet in the text data;
Statistic unit, when judging that the text data includes the summary keyword for the judging unit, system
The number of words of text fragment where the meter summary keyword;
Comparing unit, the number of words and the first predetermined threshold value for the statistic unit to be counted are compared;
Determining unit, the number of words is compared less than first predetermined threshold value for the comparing unit, then will be described
Text fragment where summary keyword is defined as text snippet.
In the embodiment of the present invention, by obtaining the text data in file destination;Judge whether wrapped in the text data
Summary keyword is included, the summary keyword is used to indicate the position where the text snippet in the text data;If judging
Going out the text data includes the summary keyword, then count it is described summary keyword where text fragment number of words;
The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than first predetermined threshold value, by institute
Text fragment where stating summary keyword is defined as text snippet.By implementing the embodiment of the present invention, it can be closed according to summary
Key word quickly recognizes text snippet, improves text snippet identification and the efficiency obtained, precision and comprehensive.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art
The required accompanying drawing used is briefly described, it should be apparent that, drawings in the following description are only some realities of the present invention
Example is applied, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these accompanying drawings
Obtain other accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of text snippet acquisition methods disclosed in first embodiment of the invention;
Fig. 2 a are a kind of schematic flow sheets of text snippet acquisition methods disclosed in second embodiment of the invention;
Fig. 2 b are a kind of schematic flow sheets of text snippet acquisition methods disclosed in third embodiment of the invention;
Fig. 3 a are a kind of schematic flow sheets of text snippet acquisition methods disclosed in fourth embodiment of the invention;
Fig. 3 b are a kind of schematic flow sheets of text snippet acquisition methods disclosed in fifth embodiment of the invention;
Fig. 4 is a kind of structural representation of text snippet acquisition device disclosed in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is a part of embodiment of the invention, rather than whole embodiments.Based on this hair
Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example, belongs to the scope of protection of the invention.
The embodiment of the invention discloses a kind of text snippet acquisition methods, this method includes:Obtain the text in file destination
Notebook data;Judge whether include summary keyword in the text data, the summary keyword is used to indicate the textual data
The position where text snippet in;If judging, the text data includes the summary keyword, and statistics is described
The number of words of text fragment where summary keyword;The number of words of statistics and the first predetermined threshold value are compared;If described
Number of words is less than first predetermined threshold value, then the text fragment where the summary keyword is defined as into text snippet.This hair
Bright embodiment, can quickly recognize text snippet according to summary keyword, improve text snippet identification and the efficiency, the precision that obtain
With it is comprehensive.
Below in conjunction with the accompanying drawings and embodiment, the technical scheme to the embodiment of the present invention is described in detail.
Referring to Fig. 1, Fig. 1 is a kind of schematic flow sheet of alarm clock implementing method disclosed in first embodiment of the invention.Such as
Shown in Fig. 1, the present embodiment alarm clock implementing method may comprise steps of:
Step S101, obtains the text data in file destination.
Step S102, judges whether include summary keyword in the text data, and the summary keyword is used to indicate
The position where text snippet in the text data.
Step S103, if judging, the text data includes the summary keyword, counts the summary crucial
The number of words of text fragment where word.
Step S104, the number of words of statistics and the first predetermined threshold value are compared.
Step S105, if the number of words is compared less than first predetermined threshold value, by where the summary keyword
Text fragment be defined as text snippet.
In the embodiment of the present invention, above-mentioned text snippet acquisition methods can apply to the terminal for possessing text processing capabilities,
All kinds of archive server ends can also be applied to, the document server can obtain the text data in file destination by processing
Take after text snippet, text snippet and text data are forwarded in the display screen of terminal device, output display text snippet
With text data content, above-mentioned terminal can include but is not limited to:Mobile device, notebook, tablet personal computer, smart machine, wear
Formula equipment, etc. is worn, above-mentioned terminal can run Saipan, Android, WINDOWS, IOS (operating system of Apple Inc.'s exploitation) etc.
Operating system, the embodiment of the present invention is not specifically limited to the application scenarios of text snippet acquisition methods.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S101 obtains the tool of the text data in file destination
Body embodiment can include:The text-processing unit for possessing text processing capabilities is obtained by wired or wireless communication link
The text data in file destination is taken, the file destination can be the file in the static storage devices such as the hard disk of terminal, also may be used
To be the file stored in network data base, the file stored in all kinds of mobile devices can also be;The content of the file destination
Text, picture, video, etc. can be included.Above-mentioned processing unit is mainly used in obtaining the text data in above-mentioned file destination,
For data such as picture and videos, it may be considered that, will be non-in file destination by image recognition technology, video analytic technique etc.
Text data is converted into the corresponding text data of content, so that the content more fully in analysis file destination, to extract text
This summary.For example, above-mentioned text-processing unit extracts one from network data base includes the file destination of text and picture, tool
Body includes having 20 words on 200 words and 1 pictures, the picture, and text-processing unit identifies 20 texts in picture
Word, obtains 220 word contents in file destination altogether.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S102 judges whether include plucking in the text data
Want keyword, the specific embodiment party of position of the summary keyword where for indicating the text snippet in the text data
Formula can include:Text-processing unit can scan the text data of file destination, judge whether include in above-mentioned text data
Summary keyword, the summary keyword can include but is not limited to:Documentation summary, summary, core notice, etc., above-mentioned summary
Text fragment where keyword, general is exactly the summary paragraph of file destination.
Still optionally further, in the embodiment of the present invention, it is described that above-mentioned steps S103 judges that the text data includes
Summary keyword, the embodiment of the number of words of the text fragment where the statistics summary keyword can include:Text
Processing unit judges that the text data of a file destination is included after summary keyword, in addition it is also necessary to count the summary keyword
The number of words of the text fragment at place, so as to the text where further determining that above-mentioned text snippet according to the number of words of text paragraph
Whether paragraph is summary paragraph.
Still optionally further, in the embodiment of the present invention, the number of words of statistics is preset threshold by above-mentioned steps S104 with first
The embodiment that value is compared can include:, will be above-mentioned after the number of words of text-processing unit obtaining step S103 statistics
The number of words of statistics is compared with the first predetermined threshold value, and first predetermined threshold value is user's reference word numerical value set in advance, example
Such as, the first predetermined threshold value is set as 200, then the number of words of statistics can be compared by text-processing unit with preset value 200.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S105 compares the number of words less than described first
Predetermined threshold value, then can wrap the embodiment that the text fragment where the summary keyword is defined as text snippet
Include:The number of words that text-processing unit compares above-mentioned steps S103 statistics is less than the first predetermined threshold value, so that it is determined that above-mentioned summary
Text fragment where keyword is the text snippet of file destination.
Text snippet acquisition methods shown in Fig. 1, by obtaining the text data in file destination;Judge the textual data
Whether include summary keyword in, the summary keyword is used to indicate the position where the text snippet in the text data
Put;If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword
The number of words fallen;The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than the described first default threshold
Value, then be defined as text snippet by the text fragment where the summary keyword.Implement the text snippet acquisition side shown in Fig. 1
Method, can quickly recognize text snippet according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensively
Property.
Fig. 2 a are referred to, Fig. 2 a illustrate for a kind of flow of text snippet acquisition methods disclosed in second embodiment of the invention
Figure.As shown in Figure 2 a, the present embodiment text snippet acquisition methods may comprise steps of:
Step S201, obtains the text data in file destination.
Step S202, judges whether include summary keyword in the text data, and the summary keyword is used to indicate
The position where text snippet in the text data.
Step S203, if judging, the text data includes the summary keyword, counts the summary crucial
The number of words of text fragment where word.
Step S204, the number of words of statistics and the first predetermined threshold value are compared.
Step S205, if judging not include the summary keyword in the text data, extracts the textual data
The fisrt feature Chinese word segmentation in the first text fragment in.
Step S206, fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation.
Step S207, extracts the second feature Chinese word segmentation in the text data.
Step S208, second feature vocabulary is generated according to the second feature Chinese word segmentation.
Step S209, determines the first weight of the second feature Chinese word segmentation, and first weight is special by described second
Levy in all features in the occurrence number and the second feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation
Literary participle occurrence number plus and determine.
Step S2010, by the second feature Chinese word segmentation in the second feature vocabulary according to the of first weight
One specifies order to arrange.
Step S2011, extracts first weight in the second feature vocabulary and comes first N1 second spy
Chinese word segmentation is levied, the N1 is determined by the quantity of the fisrt feature Chinese word segmentation with the first predetermined coefficient, and to be more than or waiting
In 1 integer.
Step S2012, the second feature of N1 before the first weight comes according to the second feature vocabulary
Chinese word segmentation, generates third feature vocabulary.
Step S2013, determines the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary.
Step S2014, the goodness of fit is compared with the second predetermined threshold value.
Step S2015, if the goodness of fit is compared more than or equal to second predetermined threshold value, by the described first text
This paragraph is defined as text snippet.
In the embodiment of the present invention, above-mentioned text snippet acquisition methods can apply to the terminal for possessing text processing capabilities,
All kinds of archive server ends can also be applied to, the document server can obtain the text data in file destination by processing
Take after text snippet, text snippet and text data are forwarded in the display screen of terminal device, output display text snippet
With text data content, above-mentioned terminal can include but is not limited to:Mobile device, notebook, tablet personal computer, smart machine, wear
Formula equipment, etc. is worn, above-mentioned terminal can run Saipan, Android, WINDOWS, IOS (operating system of Apple Inc.'s exploitation) etc.
Operating system, the embodiment of the present invention is not specifically limited to the application scenarios of text snippet acquisition methods.
Still optionally further, in the embodiment of the present invention, step S201 to step S204 and the step in first embodiment
S101 is identical to step S104, and here is omitted.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S205 judges not include in the text data
The summary keyword, then extract the specific reality of the fisrt feature Chinese word segmentation in the first text fragment in the text data
The mode of applying can include:Text-processing unit judges, which go out in the text data of file destination, does not include summary keyword, it is contemplated that
Most texts can be broadly described in the first text fragment of text data to the content of text, i.e. summary part can typically go out
First paragraph of present text, therefore text-processing unit analyzed first against the first text fragment of text data, is carried
Multiple fisrt feature Chinese word segmentations in first text fragment are taken, the fisrt feature Chinese word segmentation can include but is not limited to:
Noun, verb, etc..
Still optionally further, in the embodiment of the present invention, above-mentioned steps S206 is generated according to the fisrt feature Chinese word segmentation
The embodiment of fisrt feature vocabulary can include:The fisrt feature that text-processing unit can be extracted according to step S205
Chinese word segmentation, and each fisrt feature Chinese word segmentation occurrence number, the appearance order in text data, generation first
Feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S207 extracts the second feature in the text data
The embodiment of Chinese word segmentation can include:After text-processing unit generation fisrt feature vocabulary, further extract whole
Second feature Chinese word segmentation in text data, it is clear that the second feature Chinese word segmentation also includes the extracted in step S205
One feature Chinese word segmentation, due to being that the text data for being directed to whole file destination is extracted, therefore second feature Chinese point
The quantity of word, the occurrence number of each second feature Chinese word segmentation and there is order and all should be different from fisrt feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S208 is generated according to the second feature Chinese word segmentation
The embodiment of second feature vocabulary can include:The second feature Chinese that text-processing unit obtaining step S207 is extracted
Participle, and according to the quantity of the second feature Chinese word segmentation, occurrence number, there are the information such as order, generate second feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S209 determines the of the second feature Chinese word segmentation
The embodiment of one weight can include:After text-processing unit generation second feature vocabulary, according to second feature Chinese
All features Chinese in the occurrence number and the second feature Chinese word segmentation of each second feature Chinese word segmentation in participle
Participle occurrence number plus and, it is determined that the weight of each second feature Chinese word segmentation.For example, second feature vocabulary includes second
Feature Chinese word segmentation:" soya-bean milk " (occurrence number 5 times), " soya bean " (occurrence number 4 times), " grinding " (4 times), then in features described above
The occurrence number of literary participle adds and is 13 times (5+4+4=13), then text-processing unit can determine feature Chinese word segmentation " beans
First weight of slurry " is 0.39 (5/13), and the first weight of feature Chinese word segmentation " soya bean " is 0.31 (4/13), feature Chinese point
First weight of word " grinding " is 0.31.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S2010 is special by second in the second feature vocabulary
Levying Chinese word segmentation can include according to the first specified tactic embodiment of first weight:Text-processing list
Member is determined after the first weight of each second feature Chinese word segmentation in second feature vocabulary, according to first weight
First specifies order to be ranked up the second feature Chinese word segmentation in second feature vocabulary, and the first specified order can be wrapped
Include:It is descending, ascending, etc..For example, the original of the second feature Chinese word segmentation in second feature vocabulary puts in order
For:" soya bean " (occurrence number 4 times), " soya-bean milk " (occurrence number 5 times), " grinding " (4 times), calculate feature Chinese word segmentation " yellow
Beans ", " soya-bean milk ", first weight of " grinding " are respectively after 0.31,0.39,0.31, according to the order that the first weight is descending
Arranging second feature vocabulary is:" soya-bean milk " (occurrence number 5 times), " soya bean " (occurrence number 4 times), " grinding " (4 times), or,
" soya-bean milk " (occurrence number 5 times), " grinding " (4 times), " soya bean " (occurrence number 4 times).
Still optionally further, in the embodiment of the present invention, above-mentioned steps S2011 extracts described in the second feature vocabulary
The embodiment that first weight comes the first N1 second feature Chinese word segmentation can include:Text-processing unit is arranged
After sequence second feature vocabulary, the second feature Chinese word segmentation of N1 before the first weight in the second feature vocabulary comes is extracted, should
Numerical value of N 1 is determined with the first predetermined coefficient by the quantity of the fisrt feature Chinese word segmentation, and is the integer more than or equal to 1.Example
Such as, the first predetermined coefficient is set as 1.5, fisrt feature Chinese word segmentation is 9, it is determined that N1 numerical value is 13 or 14 (9*1.5),
Text-processing unit extracts the second feature Chinese word segmentation that the first weight in second feature vocabulary comes first 13 or first 13.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S2012 is according to the second feature vocabulary
One weight comes the first N1 second feature Chinese word segmentation, and the embodiment of generation third feature vocabulary can be wrapped
Include:The second feature of N1 before the first weight comes in the second feature vocabulary that text-processing unit is extracted according to step S2011
Chinese word segmentation, generates third feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S2013 determines the fisrt feature vocabulary and described the
The goodness of fit of three feature vocabularys, as shown in Figure 2 b, can specifically be realized by following steps:
Step S20131, count in the fisrt feature vocabulary with the second feature in the third feature vocabulary
In the fisrt feature in the quantity of first Chinese word segmentation described in literary participle identical, and the statistics fisrt feature vocabulary
The total quantity of literary participle.
Step S20132, according to the quantity of statistics and the total quantity of the fisrt feature Chinese word segmentation, calculates described
The goodness of fit of fisrt feature vocabulary and the third feature vocabulary.
It is further alternative, for example, document handling unit statistics the fisrt feature vocabulary in the third feature
The quantity of first Chinese word segmentation described in the second feature Chinese word segmentation identical in vocabulary is 12, fisrt feature Chinese word segmentation
Total quantity be 14, then the goodness of fit of above-mentioned fisrt feature vocabulary and third feature vocabulary be 92% (12/13).
Still optionally further, in the embodiment of the present invention, above-mentioned steps S2014 enters the goodness of fit with the second predetermined threshold value
The embodiment that row compares can include:The goodness of fit that text-processing unit determines step S2013 is preset with second
Threshold value is compared, and second predetermined threshold value is an empirical value, and such as 90%, etc..
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S2015 compares the goodness of fit and is more than or equal to
Second predetermined threshold value, then can include the embodiment that first text fragment is defined as text snippet:Text
Present treatment unit compares fisrt feature vocabulary and the goodness of fit of third feature vocabulary is more than or equal to after default Second Threshold,
The first text fragment is then defined as text snippet.
Text snippet acquisition methods shown in Fig. 2 a, Fig. 2 b, by obtaining the text data in file destination;Judge described
Whether summary keyword is included in text data, and the summary keyword is used to indicate the text snippet institute in the text data
Position;If judging, the text data includes the summary keyword, count it is described summary keyword where
The number of words of text fragment;The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than described first
Predetermined threshold value, then be defined as text snippet by the text fragment where the summary keyword.Implement the text shown in Fig. 2 a, Fig. 2 b
This summary acquisition methods, can quickly recognize text snippet according to summary keyword, the first text fragment, improve text snippet and know
Not and the efficiency, the precision and comprehensive that obtain.
Fig. 3 a are referred to, Fig. 3 a illustrate for a kind of flow of text snippet acquisition methods disclosed in fourth embodiment of the invention
Figure.As shown in figure 3, the present embodiment text snippet acquisition methods may comprise steps of:
Step S301, obtains the text data in file destination.
Step S302, judges whether include summary keyword in the text data, and the summary keyword is used to indicate
The position where text snippet in the text data.
Step S303, if judging, the text data includes the summary keyword, counts the summary crucial
The number of words of text fragment where word.
Step S304, the number of words of statistics and the first predetermined threshold value are compared.
Step S305, if judging not include the summary keyword in the text data, extracts the textual data
The fisrt feature Chinese word segmentation in the first text fragment in.
Step S306, fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation.
Step S307, extracts the second feature Chinese word segmentation in the text data.
Step S308, second feature vocabulary is generated according to the second feature Chinese word segmentation.
Step S309, determines the first weight of the second feature Chinese word segmentation, and first weight is special by described second
Levy in all features in the occurrence number and the second feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation
Literary participle occurrence number plus and determine.
Step S3010, by the second feature Chinese word segmentation in the second feature vocabulary according to the of first weight
One specifies order to arrange.
Step S3011, extracts first weight in the second feature vocabulary and comes first N1 second spy
Chinese word segmentation is levied, the N1 is determined by the quantity of the fisrt feature Chinese word segmentation with the first predetermined coefficient, and to be more than or waiting
In 1 integer.
Step S3013, the second feature of N1 before the first weight comes according to the second feature vocabulary
Chinese word segmentation, generates third feature vocabulary.
Step S3013, determines the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary.
Step S3014, the goodness of fit is compared with the second predetermined threshold value.
Step S3015, if comparing the goodness of fit less than second predetermined threshold value, is rejected in the text data
Stop words and everyday words.
Step S3016, is extracted in the text header for eliminating the stop words and the text data of the everyday words
Third feature Chinese word segmentation.
Step S3017, according to the third feature Chinese word segmentation, generates fourth feature vocabulary.
Step S3018, extracts and the text is removed in the text data for eliminating the stop words and the everyday words
Fourth feature Chinese word segmentation in the data of title.
Step S3019, fifth feature vocabulary is generated according to the fourth feature Chinese word segmentation.
Step S3020, determines the second weight of the fourth feature Chinese word segmentation, and second weight is special by the described 4th
Levy in all features in the occurrence number and the fourth feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation
The total degree of literary participle occurrence number is determined.
Step S3021, by the fourth feature Chinese word segmentation in the fifth feature vocabulary according to the of second weight
Two specify order to arrange.
Step S3022, judges whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical
Four feature Chinese word segmentations.
Step S3023, if judging to exist and the third feature Chinese word segmentation identical in the fifth feature vocabulary
Fourth feature Chinese word segmentation, then by the fifth feature vocabulary with the third feature Chinese word segmentation identical fourth feature
Second weight of literary participle is adjusted to the maximum of second weight of all fourth feature Chinese word segmentations.
Step S3024, by the fourth feature Chinese word segmentation in the fifth feature vocabulary according to described in after adjustment
Described the second of second weight specifies order to rearrange.
Step S3025, multiple simple sentences are divided into according to specified punctuation mark by the text data.
Step S3026, extracts first simple sentence of the clause in the multiple simple sentence for statement clause.
Step S3027, according to first simple sentence of the clause of extraction for statement clause, generates sentence list.
Step S3028, judges whether include the fourth feature Chinese word segmentation in first simple sentence.
Step S3029, if judging, first simple sentence includes the fourth feature Chinese word segmentation, it is determined that described
3rd weight of one simple sentence, the 3rd weight is as described in the fourth feature Chinese word segmentation included in first simple sentence
Second weight plus and determined with the length of first simple sentence.
Step S3030, if judging not include the fourth feature Chinese word segmentation in first simple sentence, it is determined that described
The 3rd weight of first simple sentence is zero.
Step S3031, specifies suitable by first simple sentence in the sentence list according to the 3rd of the 3rd weight
Sequence is arranged.
Step S3032, the quantity of the multiple simple sentence in the second predetermined coefficient, the text data and described
Text number of words, it is determined that the quantity reference value of summary simple sentence.
Step S3033, according to the quantity reference value of the summary simple sentence, determines the quantity N2 of the summary simple sentence, described
N2 is the integer more than or equal to 1.
Step S3034, first simple sentence of N2 is true before the 3rd weighted value in the sentence list is come
It is set to text snippet.
In the embodiment of the present invention, above-mentioned text snippet acquisition methods can apply to the terminal for possessing text processing capabilities,
All kinds of archive server ends can also be applied to, the document server can obtain the text data in file destination by processing
Take after text snippet, text snippet and text data are forwarded in the display screen of terminal device, output display text snippet
With text data content, above-mentioned terminal can include but is not limited to:Mobile device, notebook, tablet personal computer, smart machine, wear
Formula equipment, etc. is worn, above-mentioned terminal can run Saipan, Android, WINDOWS, IOS (operating system of Apple Inc.'s exploitation) etc.
Operating system, the embodiment of the present invention is not specifically limited to the application scenarios of text snippet acquisition methods.
Still optionally further, in the embodiment of the present invention, step S301 to step S3014 and the step in second embodiment
S201 is identical to step S2014, and here is omitted.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3015 compares the goodness of fit and is less than described the
Two predetermined threshold values, then rejecting the embodiment of the stop words in the text data and everyday words can include:At text
The goodness of fit that reason unit compares step S3012 determinations is less than the second predetermined threshold value, determines the first text chunk in text data
Get blamed text snippet, and rejects stop words and everyday words in the text data.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3016, which is extracted, eliminates the stop words and described normal
The embodiment of third feature Chinese word segmentation in the text header of the text data of word can include:At text
The stop words and everyday words managed in the third feature Chinese word segmentation in unit extraction text header, text title have been removed.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3017 is raw according to the third feature Chinese word segmentation
Embodiment into fourth feature vocabulary can include:The third feature that text-processing unit is extracted according to step S3016
Chinese word segmentation, generates fourth feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3018, which is extracted, eliminates the stop words and described normal
Can except the embodiment of the fourth feature Chinese word segmentation in the data of the text header in the text data of word
With including:Text-processing unit extracts the fourth feature Chinese word segmentation in the first text data, and above-mentioned first text data is to pick
Except stop words and everyday words and except the text data of text header.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3020 determines the of the fourth feature Chinese word segmentation
The embodiment of two weights can include:Text-processing unit obtains the single feature Chinese in fourth feature Chinese word segmentation
The total degree of all feature Chinese word segmentation occurrence numbers in the occurrence number of participle, and acquisition fourth feature Chinese word segmentation,
According to the total degree of above-mentioned occurrence number and occurrence number, the second weight is determined.For example, fourth feature Chinese word segmentation is " car "
(occurring 3 times), " family expenses " (occurring 4 times), " preferential " (occurring 2 times), it is determined that the second weight of feature Chinese word segmentation:" car "
Second weight of feature Chinese word segmentation is that 0.33 (3/ (3+4+2)), second weight of " family expenses " feature Chinese word segmentation are 0.45 (4/
Second weight of (3+4+2), " preferential " feature Chinese word segmentation is 0.22.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3021 is special by the 4th in the fifth feature vocabulary
Levying Chinese word segmentation can include according to the second specified tactic embodiment of second weight:Text-processing list
Fourth feature Chinese word segmentation in fifth feature vocabulary is specified order to be ranked up by member according to the second of the second weight, and this second
Specified order can include but is not limited to:It is descending, it is ascending, etc..
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3022 judges whether deposited in the fifth feature vocabulary
It can include with the embodiment of the third feature Chinese word segmentation identical fourth feature Chinese word segmentation:Text-processing list
Member judges whether deposited in fifth feature vocabulary and third feature Chinese word segmentation identical fourth feature Chinese word segmentation.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3023 judges to deposit in the fifth feature vocabulary
With the third feature Chinese word segmentation identical fourth feature Chinese word segmentation, then by the fifth feature vocabulary with described
Second weight of three feature Chinese word segmentation identical fourth feature Chinese word segmentations is adjusted to all fourth feature Chinese
The embodiment of the maximum of second weight of participle can include:Text-processing unit judges go out fifth feature word
It can include one or more with third feature Chinese word segmentation identical fourth feature Chinese word segmentation present in table, if in the presence of
Only have one with third feature Chinese word segmentation identical fourth feature Chinese word segmentation, then directly by the of this feature Chinese word segmentation
Two weights are set to the maximum of the second weight of all fourth feature Chinese word segmentations, if exist with third feature Chinese word segmentation
Second weight of above-mentioned multiple feature Chinese word segmentations then is disposed as owning by identical fourth feature Chinese word segmentation including multiple
The maximum of second weight of fourth feature Chinese word segmentation.As fourth feature Chinese word segmentation has:" from trade area " (second weight
0.43), " patent " (the second weight 0.17), " trade mark " (the second weight 0.28), " copyright " (the second weight 0.12), with the 3rd
Feature Chinese word segmentation identical feature Chinese word segmentation is " patent " (the second weight 0.17), then by this feature Chinese word segmentation " patent "
The second weight be set to 0.48 (0.48>0.47>0.28>0.17).
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3024 is by described in the fifth feature vocabulary
Four feature Chinese word segmentations specify the specific embodiment party that order is rearranged according to described second of second weight after adjustment
Formula can include:The second weight after text-processing unit updates according to step S2023 is resequenced, and the after being updated the 5th
Feature vocabulary.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3025 is according to specified punctuation mark by the textual data
It can include according to the embodiment for being divided into multiple simple sentences:Text-processing unit can be according to formulating punctuation mark by text
Data carry out simple sentence division, obtain multiple first simple sentences, the above-mentioned punctuation mark specified can include but is not limited to:Fullstop, point
Number, etc..
Still optionally further, in the embodiment of the present invention, the clause that above-mentioned steps S3026 extracts in the multiple simple sentence is old
Stating the embodiment of the first simple sentence of clause can include:The first simple sentence that text-processing unit divides step S3025
In statement clause extract.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3027 is according to institute of the clause of extraction for statement clause
The first simple sentence is stated, the embodiment of generation sentence list can include:Text-processing unit is extracted according to step S3026
Clause generates simple sentence list to state the first simple sentence of clause according to first document.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3028 judges whether include institute in first simple sentence
State fourth feature Chinese word segmentation.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3029 judges that first simple sentence includes institute
State fourth feature Chinese word segmentation, it is determined that the embodiment of the 3rd weight of first simple sentence can include:At text
Reason unit judges go out the first simple sentence and include fourth feature Chinese word segmentation, calculate the 3rd weight of each the first simple sentence, and this
Second weight for the fourth feature Chinese word segmentation that two weights are included in the first simple sentence plus and with first simple sentence
Length is determined.
Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3030 judges not include in first simple sentence
The fourth feature Chinese word segmentation, it is determined that the embodiment that the 3rd weight of first simple sentence is zero can be wrapped
Include:Text-processing unit
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3031 is single by described first in the sentence list
Sentence can include according to the 3rd embodiment for specifying order to be arranged of the 3rd weight:Text-processing unit will
First simple sentence in the sentence list specifies order to be arranged according to the 3rd of the 3rd weight, and the 3rd specifies
Order can include but is not limited to:It is descending, it is ascending, etc..
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3032 is according to the second predetermined coefficient, the textual data
The quantity of the multiple simple sentence in and the text number of words, it is determined that the specific embodiment party of the quantity reference value of summary simple sentence
Formula can include:The quantity of the multiple simple sentence of the text-processing unit in the second predetermined coefficient, the text data with
And the text number of words, it is determined that the quantity reference value of summary simple sentence, second predetermined coefficient can be preset, such as second presets
Coefficient is 200, and the quantity of multiple simple sentences is 18, and text number of words is 242, it is determined that the quantity reference value of summary simple sentence is
14.9。
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3033 is referred to according to the quantity of the summary simple sentence
Value, determining the quantity N2, the N2 of the summary simple sentence can include for the embodiment of the integer more than or equal to 1:
Text-processing unit is according to the quantity reference value of summary simple sentence, according to the principle that rounds up, it is determined that the quantity N2 of summary simple sentence.
Still optionally further, in the embodiment of the present invention, above-mentioned steps S3034 weighs the described 3rd in the sentence list
Weight values come first N2 first simple sentence and are defined as text snippet, as shown in Figure 3 b, and it is real specifically to pass through following steps
It is existing:
Step S30341, extracts the 3rd weighted value in the sentence list and comes first N2 first list
Sentence.
Step S30342, first simple sentence of N2 is in the text data before being come according to the 3rd weighted value
The order of appearance, arranges the 3rd weighted value and comes first N2 first simple sentence.
Step S30343, first simple sentence of N2 is defined as text before the 3rd weighted value after arrangement is come
Summary.
Text snippet acquisition methods shown in Fig. 3 a, Fig. 3 b, by obtaining the text data in file destination;Judge described
Whether summary keyword is included in text data, and the summary keyword is used to indicate the text snippet institute in the text data
Position;If judging, the text data includes the summary keyword, count it is described summary keyword where
The number of words of text fragment;The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than described first
Predetermined threshold value, then be defined as text snippet by the text fragment where the summary keyword.Text shown in Fig. 3 a, Fig. 3 b is plucked
Acquisition methods are wanted, text snippet can quickly be recognized according to summary keyword, the first text fragment, the first simple sentence, improve text
Summary identification and efficiency, the precision and comprehensive obtained.
Referring to Fig. 4, Fig. 4 is a kind of structural representation of text snippet acquisition device disclosed in the embodiment of the present invention, use
In text snippet acquisition methods disclosed in the execution embodiment of the present invention.In the embodiment of the present invention, device can include but is not limited to:
Smart mobile phone, PC, tablet personal computer, personal digital assistant (Personal Digital Assistant, PAD), media player
And wearable portable devices etc..As shown in figure 4, text summary acquisition device can specifically include:
Acquiring unit 401, for obtaining the text data in file destination;
Judging unit 402, for judging whether include summary keyword in the text data, the summary keyword is used
Position where the text snippet in the text data is indicated;
Statistic unit 403, when judging that the text data includes the summary keyword for the judging unit,
The number of words of text fragment where the statistics summary keyword;
Comparing unit 404, the number of words and the first predetermined threshold value for the statistic unit to be counted are compared;
Determining unit 405, compares the number of words less than first predetermined threshold value, then by institute for the comparing unit
Text fragment where stating summary keyword is defined as text snippet.
In the embodiment of the present invention, above-mentioned text snippet acquisition device can include mobile device, tablet personal computer, intelligently refer to
Ring, intelligent watch, smart home and Wearable, etc..
Text snippet acquisition device shown in Fig. 4 is by obtaining the text data in file destination;Judge the textual data
Whether include summary keyword in, the summary keyword is used to indicate the position where the text snippet in the text data
Put;If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword
The number of words fallen;The number of words of statistics and the first predetermined threshold value are compared;If the number of words is less than the described first default threshold
Value, then be defined as text snippet by the text fragment where the summary keyword.Text snippet acquisition device energy shown in Fig. 4
It is enough that text snippet is quickly recognized according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensive.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
To instruct the hardware of correlation to complete by program, the program can be stored in a computer-readable recording medium, storage
Medium can include:Flash disk, read-only storage (Read-Only Memory, ROM), random access device (Random Access
Memory, RAM), disk or CD etc..
Above disclosure is only preferred embodiment of present invention, can not limit the right model of the present invention with this certainly
Enclose, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.
Claims (8)
1. a kind of text snippet acquisition methods, it is characterised in that including:
Obtain the text data in file destination;
Judge whether include summary keyword in the text data, the summary keyword is used to indicate in the text data
Text snippet where position;
If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword
The number of words fallen;
The number of words of statistics and the first predetermined threshold value are compared;
If comparing the number of words less than first predetermined threshold value, the text fragment where the summary keyword is determined
For text snippet;
Wherein, it is described judge that the text data includes the summary keyword before, it is described to judge the text data
In whether include summary keyword after, in addition to:
If judging not include the summary keyword in the text data, the first text in the text data is extracted
Fisrt feature Chinese word segmentation in paragraph;
Fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation;
Extract the second feature Chinese word segmentation in the text data;
Second feature vocabulary is generated according to the second feature Chinese word segmentation;
The first weight of the second feature Chinese word segmentation is determined, first weight is in the second feature Chinese word segmentation
All feature Chinese word segmentation occurrence numbers in the occurrence number of single feature Chinese word segmentation and the second feature Chinese word segmentation
Plus and determine;
Order is specified to arrange according to the first of first weight second feature Chinese word segmentation in the second feature vocabulary;
Extract first weight in the second feature vocabulary and come the first N1 second feature Chinese word segmentation, it is described
N1 is determined with the first predetermined coefficient by the quantity of the fisrt feature Chinese word segmentation, and is the integer more than or equal to 1;
The second feature Chinese word segmentation of N1 values position before the first weight comes according to the second feature vocabulary, generation
Third feature vocabulary;
Determine the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary;
The goodness of fit is compared with the second predetermined threshold value;
If comparing the goodness of fit more than or equal to second predetermined threshold value, first text fragment is defined as text
This summary.
2. according to the method described in claim 1, it is characterised in that described to calculate the fisrt feature vocabulary and the described 3rd spy
Levying the goodness of fit of vocabulary includes:
Count in the fisrt feature vocabulary with the second feature Chinese word segmentation identical institute in the third feature vocabulary
State the sum of the fisrt feature Chinese word segmentation in the quantity of the first Chinese word segmentation, and the statistics fisrt feature vocabulary
Amount;
According to the quantity of statistics and the total quantity of the fisrt feature Chinese word segmentation, the fisrt feature vocabulary and institute are calculated
State the goodness of fit of third feature vocabulary.
3. method according to claim 2, it is characterised in that described to be compared the goodness of fit with the second predetermined threshold value
It is described to compare the goodness of fit more than or equal to before second predetermined threshold value after relatively, in addition to:
If comparing the goodness of fit less than second predetermined threshold value, reject the stop words in the text data and commonly use
Word;
Extract the third feature Chinese in the text header for eliminating the stop words and the text data of the everyday words
Participle;
According to the third feature Chinese word segmentation, fourth feature vocabulary is generated;
Extract in the text data for eliminating the stop words and the everyday words except in the data of the text header
Fourth feature Chinese word segmentation;
Fifth feature vocabulary is generated according to the fourth feature Chinese word segmentation;
The second weight of the fourth feature Chinese word segmentation is determined, second weight is in the fourth feature Chinese word segmentation
All feature Chinese word segmentation occurrence numbers in the occurrence number of single feature Chinese word segmentation and the fourth feature Chinese word segmentation
Total degree determine;
Order is specified to arrange according to the second of second weight fourth feature Chinese word segmentation in the fifth feature vocabulary;
Judge whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical fourth feature Chinese word segmentation;
If judging in the fifth feature vocabulary to exist and third feature Chinese word segmentation identical fourth feature Chinese point
Word, then by the fifth feature vocabulary with the third feature Chinese word segmentation identical fourth feature Chinese word segmentation described
Two weights are adjusted to the maximum of second weight of all fourth feature Chinese word segmentations;
By the fourth feature Chinese word segmentation in the fifth feature vocabulary according to described in second weight after adjustment
Second specifies order to rearrange;
The text data is divided into by multiple simple sentences according to specified punctuation mark;
Extract first simple sentence of the clause in the multiple simple sentence for statement clause;
According to first simple sentence of the clause of extraction for statement clause, sentence list is generated;
Judge whether include the fourth feature Chinese word segmentation in first simple sentence;
If judging, first simple sentence includes the fourth feature Chinese word segmentation, it is determined that the 3rd power of first simple sentence
Weight, second weight for the fourth feature Chinese word segmentation that the 3rd weight is included in first simple sentence plus and
Determined with the length of first simple sentence;
If judging in first simple sentence not include the fourth feature Chinese word segmentation, it is determined that first simple sentence it is described
3rd weight is zero;
Order is specified to be arranged according to the 3rd of the 3rd weight first simple sentence in the sentence list;
The quantity of the multiple simple sentence in the second predetermined coefficient, the text data and the text number of words, it is determined that
The quantity reference value of summary simple sentence;
According to the quantity reference value of the summary simple sentence, the quantity N2, the N2 for determining the summary simple sentence are more than or equal to 1
Integer;
First simple sentence of N2 is defined as text snippet before the 3rd weighted value in the sentence list is come.
4. method according to claim 3, it is characterised in that the 3rd weighted value by the sentence list
First simple sentence of N2, which is defined as text snippet, before coming includes:
Extract the 3rd weighted value in the sentence list and come first N2 first simple sentence;
The order that first simple sentence of N2 occurs in the text data before being come according to the 3rd weighted value, arrangement
3rd weighted value comes first N2 first simple sentence;
First simple sentence of N2 is defined as text snippet before the 3rd weighted value after arrangement is come.
5. a kind of text snippet acquisition device, it is characterised in that including:
Acquiring unit, for obtaining the text data in file destination;
Judging unit, for judging whether include summary keyword in the text data, the summary keyword is used to indicate
The position where text snippet in the text data;
Statistic unit, when judging that the text data includes the summary keyword for the judging unit, counts institute
State summary keyword where text fragment number of words;
Comparing unit, the number of words and the first predetermined threshold value for the statistic unit to be counted are compared;
Determining unit, compares the number of words less than first predetermined threshold value, then by the summary for the comparing unit
Text fragment where keyword is defined as text snippet.
Wherein, described device also includes:
Extraction unit, when judging not including the summary keyword in the text data for the judging unit, is extracted
The fisrt feature Chinese word segmentation in the first text fragment in the text data;
First generation unit, for generating fisrt feature vocabulary according to the fisrt feature Chinese word segmentation;
Second extraction unit, for extracting the second feature Chinese word segmentation in the text data;
Second generation unit, for generating second feature vocabulary according to the second feature Chinese word segmentation;
First determining unit, the first weight for determining the second feature Chinese word segmentation, first weight is by described
All spies in the occurrence number of single feature Chinese word segmentation in two feature Chinese word segmentations and the second feature Chinese word segmentation
Levy adding and determination for Chinese word segmentation occurrence number;
First sequencing unit, it is single for the second feature Chinese word segmentation in the second feature vocabulary to be determined according to described first
The first of first weight that member is determined specifies order to arrange;
3rd extraction unit, first weight for extracting in the second feature vocabulary comes first N1 described second
Feature Chinese word segmentation, the N1 determines by quantity and the first predetermined coefficient of the fisrt feature Chinese word segmentation, and be more than or
Integer equal to 1;
3rd generation unit, for the second feature Chinese word segmentation extracted according to the 3rd extraction unit, generation the 3rd
Feature vocabulary;
Second determining unit, the goodness of fit for determining the fisrt feature vocabulary and the third feature vocabulary;
Second comparing unit, for the goodness of fit to be compared with the second predetermined threshold value;
3rd determining unit, the goodness of fit is compared more than or equal to the described second default threshold for second comparing unit
During value, then first text fragment is defined as text snippet.
6. device according to claim 5, it is characterised in that second determining unit also includes:
With the second feature Chinese in the third feature vocabulary in first statistic unit, the statistics fisrt feature vocabulary
Fisrt feature Chinese in the quantity of first Chinese word segmentation described in participle identical, and the statistics fisrt feature vocabulary
The total quantity of participle;
Computing unit, for the quantity and the total quantity counted according to first statistic unit, calculates described first
The goodness of fit of feature vocabulary and the third feature vocabulary.
7. device according to claim 6, it is characterised in that also include:
Culling unit, when comparing the goodness of fit less than second predetermined threshold value for second comparing unit, is rejected
Stop words and everyday words in the text data;
4th extraction unit, the text header of the stop words and the text data of the everyday words is eliminated for extracting
In third feature Chinese word segmentation;
4th generation unit, for according to the third feature Chinese word segmentation, generating fourth feature vocabulary;
4th extraction unit, eliminates for extracting and the text is removed in the text data of the stop words and the everyday words
Fourth feature Chinese word segmentation in the data of this title;
5th generation unit, for generating fifth feature vocabulary according to the fourth feature Chinese word segmentation;
4th determining unit, the second power for determining the fourth feature Chinese word segmentation that the 4th extraction unit is extracted
Weight, second weight is by the occurrence number of the single feature Chinese word segmentation in the fourth feature Chinese word segmentation and the described 4th
The total degree of all feature Chinese word segmentation occurrence numbers in feature Chinese word segmentation is determined;
Second sequencing unit, for the fourth feature Chinese in the fifth feature vocabulary that generates the 5th generation unit
The second of second weight that participle is determined according to the 4th determining unit specifies order to arrange;
First judging unit, for judging whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical
Fourth feature Chinese word segmentation;
Adjustment unit, judges to deposit in the fifth feature vocabulary and third feature Chinese for first judging unit
During participle identical fourth feature Chinese word segmentation, by the fifth feature vocabulary with the third feature Chinese word segmentation identical
Second weight of fourth feature Chinese word segmentation is adjusted to second weight of all fourth feature Chinese word segmentations
Maximum;
3rd sequencing unit, for the fourth feature Chinese word segmentation in the fifth feature vocabulary is single according to the adjustment
Described second specified order of second weight after member adjustment is rearranged;
Division unit, for the text data to be divided into multiple simple sentences according to specified punctuation mark;
5th extraction unit, is the of statement clause for extracting the clause in the multiple simple sentence that the division unit is divided
One simple sentence;
6th generation unit, the clause for being extracted according to the 5th extraction unit is first simple sentence of statement clause,
Generate sentence list;
Second judging unit, for judging whether include the fourth feature Chinese word segmentation in first simple sentence;
5th determining unit, judges that first simple sentence includes the fourth feature Chinese for second judging unit
During participle, determine the 3rd weight of first simple sentence, the 3rd weight included in first simple sentence the described 4th
Second weight of feature Chinese word segmentation plus and determined with the length of first simple sentence;
6th determining unit, judges not include in the fourth feature in first simple sentence for second judging unit
During literary participle, the 3rd weight for determining first simple sentence is zero;
4th sequencing unit, for first simple sentence in the sentence list to be specified according to the 3rd of the 3rd weight
Order is arranged;
7th determining unit, for the multiple simple sentence in the second predetermined coefficient, the text data quantity and
The text number of words, it is determined that the quantity reference value of summary simple sentence;
8th determining unit, for the quantity reference value according to the summary simple sentence, determines the quantity N2 of the summary simple sentence, institute
It is the integer more than or equal to 1 to state N2;
9th determining unit, for the 3rd weighted value in the sentence list to be come to first N2 first simple sentence
It is defined as text snippet.
8. device according to claim 7, it is characterised in that the 9th determining unit also includes:
6th extraction unit, it is single that the 3rd weighted value for extracting in the sentence list comes first N2 described first
Sentence;
5th sequencing unit, for being come according to the 3rd weighted value before first simple sentence of N2 in the text data
The order of middle appearance, arranges first simple sentence that the 6th extraction unit is extracted;
Tenth determining unit, text is defined as the 3rd weighted value after arrangement to be come into first N2 first simple sentence
This summary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410850654.3A CN104615654B (en) | 2014-12-30 | 2014-12-30 | A kind of text snippet acquisition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410850654.3A CN104615654B (en) | 2014-12-30 | 2014-12-30 | A kind of text snippet acquisition methods and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104615654A CN104615654A (en) | 2015-05-13 |
CN104615654B true CN104615654B (en) | 2017-09-22 |
Family
ID=53150098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410850654.3A Expired - Fee Related CN104615654B (en) | 2014-12-30 | 2014-12-30 | A kind of text snippet acquisition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104615654B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111694947A (en) * | 2020-06-15 | 2020-09-22 | 中国银行股份有限公司 | Text abstract display method, text abstract display device, storage medium and equipment |
CN112199499A (en) * | 2020-09-29 | 2021-01-08 | 京东方科技集团股份有限公司 | Text division method, text classification method, device, equipment and storage medium |
CN114115670A (en) * | 2021-07-30 | 2022-03-01 | 荣耀终端有限公司 | Method for prompting generation of text abstract and method and device for generating text abstract |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6104990A (en) * | 1998-09-28 | 2000-08-15 | Prompt Software, Inc. | Language independent phrase extraction |
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
-
2014
- 2014-12-30 CN CN201410850654.3A patent/CN104615654B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6104990A (en) * | 1998-09-28 | 2000-08-15 | Prompt Software, Inc. | Language independent phrase extraction |
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
CN104156452A (en) * | 2014-08-18 | 2014-11-19 | 中国人民解放军国防科学技术大学 | Method and device for generating webpage text summarization |
Non-Patent Citations (1)
Title |
---|
"文本自动标引方法研究与实现";马娟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100315(第03期);论文第46页第5.1.2节 * |
Also Published As
Publication number | Publication date |
---|---|
CN104615654A (en) | 2015-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105893478B (en) | A kind of tag extraction method and apparatus | |
Gu et al. | " what parts of your apps are loved by users?"(T) | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
Gao et al. | Improving user profile with personality traits predicted from social media content | |
CN108009228A (en) | A kind of method to set up of content tab, device and storage medium | |
CN106970988A (en) | Data processing method, device and electronic equipment | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
CN107766318A (en) | Keyword extraction method and device and electronic equipment | |
CN110737768A (en) | Text abstract automatic generation method and device based on deep learning and storage medium | |
CN109710947A (en) | Power specialty word stock generating method and device | |
CN110334110A (en) | Natural language classification method, device, computer equipment and storage medium | |
AU2019389172A1 (en) | Systems and methods for identifying an event in data | |
CN104615654B (en) | A kind of text snippet acquisition methods and device | |
KR102296931B1 (en) | Real-time keyword extraction method and device in text streaming environment | |
CN103577452A (en) | Website server and method and device for enriching content of website | |
Jmal et al. | Customer review summarization approach using twitter and sentiwordnet | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN106066867B (en) | A kind of method and device for extracting abstract | |
CN108549697A (en) | Information-pushing method, device, equipment based on semantic association and storage medium | |
Tkachenko et al. | Named entity recognition in estonian | |
CN109947934A (en) | For the data digging method and system of short text | |
Almuqren et al. | Framework for sentiment analysis of Arabic text | |
CN106897290A (en) | A kind of method and device for setting up keyword models | |
Yadav et al. | A comparative study of deep learning methods for hate speech and offensive language detection in textual data | |
CN115034300A (en) | Classification model training method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170922 Termination date: 20171230 |