CN104615654B

CN104615654B - A kind of text snippet acquisition methods and device

Info

Publication number: CN104615654B
Application number: CN201410850654.3A
Authority: CN
Inventors: 李慧; 赵瑞龙; 韦正云; 黄茂松; 郭维; 肖国彪; 燕青; 冯烨; 陈维; 吴汉章; 郭伟
Original assignee: Beijing Asialnfo Smart Data Technology Co ltd; China United Network Communications Corp Ltd Guangdong Branch
Current assignee: Beijing Asialnfo Smart Data Technology Co ltd; China United Network Communications Corp Ltd Guangdong Branch
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2017-09-22
Anticipated expiration: 2034-12-30
Also published as: CN104615654A

Abstract

The embodiment of the invention discloses a kind of text snippet acquisition methods, this method includes：Obtain the text data in file destination；Judge whether include summary keyword in the text data, the summary keyword is used to indicate the position where the text snippet in the text data；If judging, the text data includes the summary keyword, count it is described summary keyword where text fragment number of words；The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than first predetermined threshold value, the text fragment where the summary keyword is defined as text snippet.Implement the embodiment of the present invention, text snippet can quickly be recognized according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensive.

Description

A kind of text snippet acquisition methods and device

Technical field

The present invention relates to communication technical field, and in particular to a kind of text snippet acquisition methods and device.

Background technology

Text enormous amount, content are numerous and diverse on internet, and there is the webpage of a large amount of identical subject contents.How from Key message is fast and effeciently obtained in text, this is accomplished by text snippet acquiring technology.Accurately, the weight in rapid extraction document Point information is the key of text snippet acquisition methods, is of great practical significance.Current text snippet acquisition methods master There are two kinds, a kind of is, by the key term in analyzing web page text, to count item frequency to calculate the weight of each sentence, And made a summary by weight acquiring size web page text.Another is to carry out text to be divided into some semantic sections, and measurement is each Importance of the sentence in the semantic section in place, the important sentence of extraction comparison as the semantic section representative sentences, by representative sentences group Into documentation summary.

The acquisition methods of above-mentioned text snippet, former approach merely mechanically pieces together several sentences, the company of sentence Coherence and logicality are not being met；Later approach only selects semantic section center sentence to constitute summary, can go out in semanteme linking Existing incoherent problem.

The content of the invention

The embodiment of the present invention provides a kind of text snippet acquisition methods and device, can quickly be recognized according to summary keyword Text snippet, improves text snippet identification and the efficiency, the precision and comprehensive that obtain.

The embodiment of the invention discloses a kind of text snippet acquisition methods, including：

Obtain the text data in file destination；

Judge whether include summary keyword in the text data, the summary keyword is used to indicate the textual data The position where text snippet in；

If judging, the text data includes the summary keyword, counts the text where the summary keyword The number of words of this paragraph；

The number of words of statistics and the first predetermined threshold value are compared；

If the number of words is compared less than first predetermined threshold value, by the text fragment where the summary keyword It is defined as text snippet.

Correspondingly, the embodiment of the invention also discloses a kind of text snippet acquisition device, including：

Acquiring unit, for obtaining the text data in file destination；

Judging unit, for judging whether include summary keyword in the text data, the summary keyword is used for Indicate the position where the text snippet in the text data；

Statistic unit, when judging that the text data includes the summary keyword for the judging unit, system The number of words of text fragment where the meter summary keyword；

Comparing unit, the number of words and the first predetermined threshold value for the statistic unit to be counted are compared；

Determining unit, the number of words is compared less than first predetermined threshold value for the comparing unit, then will be described Text fragment where summary keyword is defined as text snippet.

In the embodiment of the present invention, by obtaining the text data in file destination；Judge whether wrapped in the text data Summary keyword is included, the summary keyword is used to indicate the position where the text snippet in the text data；If judging Going out the text data includes the summary keyword, then count it is described summary keyword where text fragment number of words； The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than first predetermined threshold value, by institute Text fragment where stating summary keyword is defined as text snippet.By implementing the embodiment of the present invention, it can be closed according to summary Key word quickly recognizes text snippet, improves text snippet identification and the efficiency obtained, precision and comprehensive.

Brief description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art The required accompanying drawing used is briefly described, it should be apparent that, drawings in the following description are only some realities of the present invention Example is applied, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is a kind of schematic flow sheet of text snippet acquisition methods disclosed in first embodiment of the invention；

Fig. 2 a are a kind of schematic flow sheets of text snippet acquisition methods disclosed in second embodiment of the invention；

Fig. 2 b are a kind of schematic flow sheets of text snippet acquisition methods disclosed in third embodiment of the invention；

Fig. 3 a are a kind of schematic flow sheets of text snippet acquisition methods disclosed in fourth embodiment of the invention；

Fig. 3 b are a kind of schematic flow sheets of text snippet acquisition methods disclosed in fifth embodiment of the invention；

Fig. 4 is a kind of structural representation of text snippet acquisition device disclosed in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is a part of embodiment of the invention, rather than whole embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of protection of the invention.

The embodiment of the invention discloses a kind of text snippet acquisition methods, this method includes：Obtain the text in file destination Notebook data；Judge whether include summary keyword in the text data, the summary keyword is used to indicate the textual data The position where text snippet in；If judging, the text data includes the summary keyword, and statistics is described The number of words of text fragment where summary keyword；The number of words of statistics and the first predetermined threshold value are compared；If described Number of words is less than first predetermined threshold value, then the text fragment where the summary keyword is defined as into text snippet.This hair Bright embodiment, can quickly recognize text snippet according to summary keyword, improve text snippet identification and the efficiency, the precision that obtain With it is comprehensive.

Below in conjunction with the accompanying drawings and embodiment, the technical scheme to the embodiment of the present invention is described in detail.

Referring to Fig. 1, Fig. 1 is a kind of schematic flow sheet of alarm clock implementing method disclosed in first embodiment of the invention.Such as Shown in Fig. 1, the present embodiment alarm clock implementing method may comprise steps of：

Step S101, obtains the text data in file destination.

Step S102, judges whether include summary keyword in the text data, and the summary keyword is used to indicate The position where text snippet in the text data.

Step S103, if judging, the text data includes the summary keyword, counts the summary crucial The number of words of text fragment where word.

Step S104, the number of words of statistics and the first predetermined threshold value are compared.

Step S105, if the number of words is compared less than first predetermined threshold value, by where the summary keyword Text fragment be defined as text snippet.

In the embodiment of the present invention, above-mentioned text snippet acquisition methods can apply to the terminal for possessing text processing capabilities, All kinds of archive server ends can also be applied to, the document server can obtain the text data in file destination by processing Take after text snippet, text snippet and text data are forwarded in the display screen of terminal device, output display text snippet With text data content, above-mentioned terminal can include but is not limited to：Mobile device, notebook, tablet personal computer, smart machine, wear Formula equipment, etc. is worn, above-mentioned terminal can run Saipan, Android, WINDOWS, IOS (operating system of Apple Inc.'s exploitation) etc. Operating system, the embodiment of the present invention is not specifically limited to the application scenarios of text snippet acquisition methods.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S101 obtains the tool of the text data in file destination Body embodiment can include：The text-processing unit for possessing text processing capabilities is obtained by wired or wireless communication link The text data in file destination is taken, the file destination can be the file in the static storage devices such as the hard disk of terminal, also may be used To be the file stored in network data base, the file stored in all kinds of mobile devices can also be；The content of the file destination Text, picture, video, etc. can be included.Above-mentioned processing unit is mainly used in obtaining the text data in above-mentioned file destination, For data such as picture and videos, it may be considered that, will be non-in file destination by image recognition technology, video analytic technique etc. Text data is converted into the corresponding text data of content, so that the content more fully in analysis file destination, to extract text This summary.For example, above-mentioned text-processing unit extracts one from network data base includes the file destination of text and picture, tool Body includes having 20 words on 200 words and 1 pictures, the picture, and text-processing unit identifies 20 texts in picture Word, obtains 220 word contents in file destination altogether.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S102 judges whether include plucking in the text data Want keyword, the specific embodiment party of position of the summary keyword where for indicating the text snippet in the text data Formula can include：Text-processing unit can scan the text data of file destination, judge whether include in above-mentioned text data Summary keyword, the summary keyword can include but is not limited to：Documentation summary, summary, core notice, etc., above-mentioned summary Text fragment where keyword, general is exactly the summary paragraph of file destination.

Still optionally further, in the embodiment of the present invention, it is described that above-mentioned steps S103 judges that the text data includes Summary keyword, the embodiment of the number of words of the text fragment where the statistics summary keyword can include：Text Processing unit judges that the text data of a file destination is included after summary keyword, in addition it is also necessary to count the summary keyword The number of words of the text fragment at place, so as to the text where further determining that above-mentioned text snippet according to the number of words of text paragraph Whether paragraph is summary paragraph.

Still optionally further, in the embodiment of the present invention, the number of words of statistics is preset threshold by above-mentioned steps S104 with first The embodiment that value is compared can include：, will be above-mentioned after the number of words of text-processing unit obtaining step S103 statistics The number of words of statistics is compared with the first predetermined threshold value, and first predetermined threshold value is user's reference word numerical value set in advance, example Such as, the first predetermined threshold value is set as 200, then the number of words of statistics can be compared by text-processing unit with preset value 200.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S105 compares the number of words less than described first Predetermined threshold value, then can wrap the embodiment that the text fragment where the summary keyword is defined as text snippet Include：The number of words that text-processing unit compares above-mentioned steps S103 statistics is less than the first predetermined threshold value, so that it is determined that above-mentioned summary Text fragment where keyword is the text snippet of file destination.

Text snippet acquisition methods shown in Fig. 1, by obtaining the text data in file destination；Judge the textual data Whether include summary keyword in, the summary keyword is used to indicate the position where the text snippet in the text data Put；If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword The number of words fallen；The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than the described first default threshold Value, then be defined as text snippet by the text fragment where the summary keyword.Implement the text snippet acquisition side shown in Fig. 1 Method, can quickly recognize text snippet according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensively Property.

Fig. 2 a are referred to, Fig. 2 a illustrate for a kind of flow of text snippet acquisition methods disclosed in second embodiment of the invention Figure.As shown in Figure 2 a, the present embodiment text snippet acquisition methods may comprise steps of：

Step S201, obtains the text data in file destination.

Step S202, judges whether include summary keyword in the text data, and the summary keyword is used to indicate The position where text snippet in the text data.

Step S203, if judging, the text data includes the summary keyword, counts the summary crucial The number of words of text fragment where word.

Step S204, the number of words of statistics and the first predetermined threshold value are compared.

Step S205, if judging not include the summary keyword in the text data, extracts the textual data The fisrt feature Chinese word segmentation in the first text fragment in.

Step S206, fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation.

Step S207, extracts the second feature Chinese word segmentation in the text data.

Step S208, second feature vocabulary is generated according to the second feature Chinese word segmentation.

Step S209, determines the first weight of the second feature Chinese word segmentation, and first weight is special by described second Levy in all features in the occurrence number and the second feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation Literary participle occurrence number plus and determine.

Step S2010, by the second feature Chinese word segmentation in the second feature vocabulary according to the of first weight One specifies order to arrange.

Step S2011, extracts first weight in the second feature vocabulary and comes first N1 second spy Chinese word segmentation is levied, the N1 is determined by the quantity of the fisrt feature Chinese word segmentation with the first predetermined coefficient, and to be more than or waiting In 1 integer.

Step S2012, the second feature of N1 before the first weight comes according to the second feature vocabulary Chinese word segmentation, generates third feature vocabulary.

Step S2013, determines the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary.

Step S2014, the goodness of fit is compared with the second predetermined threshold value.

Step S2015, if the goodness of fit is compared more than or equal to second predetermined threshold value, by the described first text This paragraph is defined as text snippet.

Still optionally further, in the embodiment of the present invention, step S201 to step S204 and the step in first embodiment S101 is identical to step S104, and here is omitted.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S205 judges not include in the text data The summary keyword, then extract the specific reality of the fisrt feature Chinese word segmentation in the first text fragment in the text data The mode of applying can include：Text-processing unit judges, which go out in the text data of file destination, does not include summary keyword, it is contemplated that Most texts can be broadly described in the first text fragment of text data to the content of text, i.e. summary part can typically go out First paragraph of present text, therefore text-processing unit analyzed first against the first text fragment of text data, is carried Multiple fisrt feature Chinese word segmentations in first text fragment are taken, the fisrt feature Chinese word segmentation can include but is not limited to： Noun, verb, etc..

Still optionally further, in the embodiment of the present invention, above-mentioned steps S206 is generated according to the fisrt feature Chinese word segmentation The embodiment of fisrt feature vocabulary can include：The fisrt feature that text-processing unit can be extracted according to step S205 Chinese word segmentation, and each fisrt feature Chinese word segmentation occurrence number, the appearance order in text data, generation first Feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S207 extracts the second feature in the text data The embodiment of Chinese word segmentation can include：After text-processing unit generation fisrt feature vocabulary, further extract whole Second feature Chinese word segmentation in text data, it is clear that the second feature Chinese word segmentation also includes the extracted in step S205 One feature Chinese word segmentation, due to being that the text data for being directed to whole file destination is extracted, therefore second feature Chinese point The quantity of word, the occurrence number of each second feature Chinese word segmentation and there is order and all should be different from fisrt feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S208 is generated according to the second feature Chinese word segmentation The embodiment of second feature vocabulary can include：The second feature Chinese that text-processing unit obtaining step S207 is extracted Participle, and according to the quantity of the second feature Chinese word segmentation, occurrence number, there are the information such as order, generate second feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S209 determines the of the second feature Chinese word segmentation The embodiment of one weight can include：After text-processing unit generation second feature vocabulary, according to second feature Chinese All features Chinese in the occurrence number and the second feature Chinese word segmentation of each second feature Chinese word segmentation in participle Participle occurrence number plus and, it is determined that the weight of each second feature Chinese word segmentation.For example, second feature vocabulary includes second Feature Chinese word segmentation：" soya-bean milk " (occurrence number 5 times), " soya bean " (occurrence number 4 times), " grinding " (4 times), then in features described above The occurrence number of literary participle adds and is 13 times (5+4+4=13), then text-processing unit can determine feature Chinese word segmentation " beans First weight of slurry " is 0.39 (5/13), and the first weight of feature Chinese word segmentation " soya bean " is 0.31 (4/13), feature Chinese point First weight of word " grinding " is 0.31.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S2010 is special by second in the second feature vocabulary Levying Chinese word segmentation can include according to the first specified tactic embodiment of first weight：Text-processing list Member is determined after the first weight of each second feature Chinese word segmentation in second feature vocabulary, according to first weight First specifies order to be ranked up the second feature Chinese word segmentation in second feature vocabulary, and the first specified order can be wrapped Include：It is descending, ascending, etc..For example, the original of the second feature Chinese word segmentation in second feature vocabulary puts in order For：" soya bean " (occurrence number 4 times), " soya-bean milk " (occurrence number 5 times), " grinding " (4 times), calculate feature Chinese word segmentation " yellow Beans ", " soya-bean milk ", first weight of " grinding " are respectively after 0.31,0.39,0.31, according to the order that the first weight is descending Arranging second feature vocabulary is：" soya-bean milk " (occurrence number 5 times), " soya bean " (occurrence number 4 times), " grinding " (4 times), or, " soya-bean milk " (occurrence number 5 times), " grinding " (4 times), " soya bean " (occurrence number 4 times).

Still optionally further, in the embodiment of the present invention, above-mentioned steps S2011 extracts described in the second feature vocabulary The embodiment that first weight comes the first N1 second feature Chinese word segmentation can include：Text-processing unit is arranged After sequence second feature vocabulary, the second feature Chinese word segmentation of N1 before the first weight in the second feature vocabulary comes is extracted, should Numerical value of N 1 is determined with the first predetermined coefficient by the quantity of the fisrt feature Chinese word segmentation, and is the integer more than or equal to 1.Example Such as, the first predetermined coefficient is set as 1.5, fisrt feature Chinese word segmentation is 9, it is determined that N1 numerical value is 13 or 14 (9*1.5), Text-processing unit extracts the second feature Chinese word segmentation that the first weight in second feature vocabulary comes first 13 or first 13.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S2012 is according to the second feature vocabulary One weight comes the first N1 second feature Chinese word segmentation, and the embodiment of generation third feature vocabulary can be wrapped Include：The second feature of N1 before the first weight comes in the second feature vocabulary that text-processing unit is extracted according to step S2011 Chinese word segmentation, generates third feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S2013 determines the fisrt feature vocabulary and described the The goodness of fit of three feature vocabularys, as shown in Figure 2 b, can specifically be realized by following steps：

Step S20131, count in the fisrt feature vocabulary with the second feature in the third feature vocabulary In the fisrt feature in the quantity of first Chinese word segmentation described in literary participle identical, and the statistics fisrt feature vocabulary The total quantity of literary participle.

Step S20132, according to the quantity of statistics and the total quantity of the fisrt feature Chinese word segmentation, calculates described The goodness of fit of fisrt feature vocabulary and the third feature vocabulary.

It is further alternative, for example, document handling unit statistics the fisrt feature vocabulary in the third feature The quantity of first Chinese word segmentation described in the second feature Chinese word segmentation identical in vocabulary is 12, fisrt feature Chinese word segmentation Total quantity be 14, then the goodness of fit of above-mentioned fisrt feature vocabulary and third feature vocabulary be 92% (12/13).

Still optionally further, in the embodiment of the present invention, above-mentioned steps S2014 enters the goodness of fit with the second predetermined threshold value The embodiment that row compares can include：The goodness of fit that text-processing unit determines step S2013 is preset with second Threshold value is compared, and second predetermined threshold value is an empirical value, and such as 90%, etc..

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S2015 compares the goodness of fit and is more than or equal to Second predetermined threshold value, then can include the embodiment that first text fragment is defined as text snippet：Text Present treatment unit compares fisrt feature vocabulary and the goodness of fit of third feature vocabulary is more than or equal to after default Second Threshold, The first text fragment is then defined as text snippet.

Text snippet acquisition methods shown in Fig. 2 a, Fig. 2 b, by obtaining the text data in file destination；Judge described Whether summary keyword is included in text data, and the summary keyword is used to indicate the text snippet institute in the text data Position；If judging, the text data includes the summary keyword, count it is described summary keyword where The number of words of text fragment；The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than described first Predetermined threshold value, then be defined as text snippet by the text fragment where the summary keyword.Implement the text shown in Fig. 2 a, Fig. 2 b This summary acquisition methods, can quickly recognize text snippet according to summary keyword, the first text fragment, improve text snippet and know Not and the efficiency, the precision and comprehensive that obtain.

Fig. 3 a are referred to, Fig. 3 a illustrate for a kind of flow of text snippet acquisition methods disclosed in fourth embodiment of the invention Figure.As shown in figure 3, the present embodiment text snippet acquisition methods may comprise steps of：

Step S301, obtains the text data in file destination.

Step S302, judges whether include summary keyword in the text data, and the summary keyword is used to indicate The position where text snippet in the text data.

Step S303, if judging, the text data includes the summary keyword, counts the summary crucial The number of words of text fragment where word.

Step S304, the number of words of statistics and the first predetermined threshold value are compared.

Step S305, if judging not include the summary keyword in the text data, extracts the textual data The fisrt feature Chinese word segmentation in the first text fragment in.

Step S306, fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation.

Step S307, extracts the second feature Chinese word segmentation in the text data.

Step S308, second feature vocabulary is generated according to the second feature Chinese word segmentation.

Step S309, determines the first weight of the second feature Chinese word segmentation, and first weight is special by described second Levy in all features in the occurrence number and the second feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation Literary participle occurrence number plus and determine.

Step S3010, by the second feature Chinese word segmentation in the second feature vocabulary according to the of first weight One specifies order to arrange.

Step S3011, extracts first weight in the second feature vocabulary and comes first N1 second spy Chinese word segmentation is levied, the N1 is determined by the quantity of the fisrt feature Chinese word segmentation with the first predetermined coefficient, and to be more than or waiting In 1 integer.

Step S3013, the second feature of N1 before the first weight comes according to the second feature vocabulary Chinese word segmentation, generates third feature vocabulary.

Step S3013, determines the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary.

Step S3014, the goodness of fit is compared with the second predetermined threshold value.

Step S3015, if comparing the goodness of fit less than second predetermined threshold value, is rejected in the text data Stop words and everyday words.

Step S3016, is extracted in the text header for eliminating the stop words and the text data of the everyday words Third feature Chinese word segmentation.

Step S3017, according to the third feature Chinese word segmentation, generates fourth feature vocabulary.

Step S3018, extracts and the text is removed in the text data for eliminating the stop words and the everyday words Fourth feature Chinese word segmentation in the data of title.

Step S3019, fifth feature vocabulary is generated according to the fourth feature Chinese word segmentation.

Step S3020, determines the second weight of the fourth feature Chinese word segmentation, and second weight is special by the described 4th Levy in all features in the occurrence number and the fourth feature Chinese word segmentation of the single feature Chinese word segmentation in Chinese word segmentation The total degree of literary participle occurrence number is determined.

Step S3021, by the fourth feature Chinese word segmentation in the fifth feature vocabulary according to the of second weight Two specify order to arrange.

Step S3022, judges whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical Four feature Chinese word segmentations.

Step S3023, if judging to exist and the third feature Chinese word segmentation identical in the fifth feature vocabulary Fourth feature Chinese word segmentation, then by the fifth feature vocabulary with the third feature Chinese word segmentation identical fourth feature Second weight of literary participle is adjusted to the maximum of second weight of all fourth feature Chinese word segmentations.

Step S3024, by the fourth feature Chinese word segmentation in the fifth feature vocabulary according to described in after adjustment Described the second of second weight specifies order to rearrange.

Step S3025, multiple simple sentences are divided into according to specified punctuation mark by the text data.

Step S3026, extracts first simple sentence of the clause in the multiple simple sentence for statement clause.

Step S3027, according to first simple sentence of the clause of extraction for statement clause, generates sentence list.

Step S3028, judges whether include the fourth feature Chinese word segmentation in first simple sentence.

Step S3029, if judging, first simple sentence includes the fourth feature Chinese word segmentation, it is determined that described 3rd weight of one simple sentence, the 3rd weight is as described in the fourth feature Chinese word segmentation included in first simple sentence Second weight plus and determined with the length of first simple sentence.

Step S3030, if judging not include the fourth feature Chinese word segmentation in first simple sentence, it is determined that described The 3rd weight of first simple sentence is zero.

Step S3031, specifies suitable by first simple sentence in the sentence list according to the 3rd of the 3rd weight Sequence is arranged.

Step S3032, the quantity of the multiple simple sentence in the second predetermined coefficient, the text data and described Text number of words, it is determined that the quantity reference value of summary simple sentence.

Step S3033, according to the quantity reference value of the summary simple sentence, determines the quantity N2 of the summary simple sentence, described N2 is the integer more than or equal to 1.

Step S3034, first simple sentence of N2 is true before the 3rd weighted value in the sentence list is come It is set to text snippet.

Still optionally further, in the embodiment of the present invention, step S301 to step S3014 and the step in second embodiment S201 is identical to step S2014, and here is omitted.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3015 compares the goodness of fit and is less than described the Two predetermined threshold values, then rejecting the embodiment of the stop words in the text data and everyday words can include：At text The goodness of fit that reason unit compares step S3012 determinations is less than the second predetermined threshold value, determines the first text chunk in text data Get blamed text snippet, and rejects stop words and everyday words in the text data.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3016, which is extracted, eliminates the stop words and described normal The embodiment of third feature Chinese word segmentation in the text header of the text data of word can include：At text The stop words and everyday words managed in the third feature Chinese word segmentation in unit extraction text header, text title have been removed.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3017 is raw according to the third feature Chinese word segmentation Embodiment into fourth feature vocabulary can include：The third feature that text-processing unit is extracted according to step S3016 Chinese word segmentation, generates fourth feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3018, which is extracted, eliminates the stop words and described normal Can except the embodiment of the fourth feature Chinese word segmentation in the data of the text header in the text data of word With including：Text-processing unit extracts the fourth feature Chinese word segmentation in the first text data, and above-mentioned first text data is to pick Except stop words and everyday words and except the text data of text header.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3020 determines the of the fourth feature Chinese word segmentation The embodiment of two weights can include：Text-processing unit obtains the single feature Chinese in fourth feature Chinese word segmentation The total degree of all feature Chinese word segmentation occurrence numbers in the occurrence number of participle, and acquisition fourth feature Chinese word segmentation, According to the total degree of above-mentioned occurrence number and occurrence number, the second weight is determined.For example, fourth feature Chinese word segmentation is " car " (occurring 3 times), " family expenses " (occurring 4 times), " preferential " (occurring 2 times), it is determined that the second weight of feature Chinese word segmentation：" car " Second weight of feature Chinese word segmentation is that 0.33 (3/ (3+4+2)), second weight of " family expenses " feature Chinese word segmentation are 0.45 (4/ Second weight of (3+4+2), " preferential " feature Chinese word segmentation is 0.22.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3021 is special by the 4th in the fifth feature vocabulary Levying Chinese word segmentation can include according to the second specified tactic embodiment of second weight：Text-processing list Fourth feature Chinese word segmentation in fifth feature vocabulary is specified order to be ranked up by member according to the second of the second weight, and this second Specified order can include but is not limited to：It is descending, it is ascending, etc..

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3022 judges whether deposited in the fifth feature vocabulary It can include with the embodiment of the third feature Chinese word segmentation identical fourth feature Chinese word segmentation：Text-processing list Member judges whether deposited in fifth feature vocabulary and third feature Chinese word segmentation identical fourth feature Chinese word segmentation.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3023 judges to deposit in the fifth feature vocabulary With the third feature Chinese word segmentation identical fourth feature Chinese word segmentation, then by the fifth feature vocabulary with described Second weight of three feature Chinese word segmentation identical fourth feature Chinese word segmentations is adjusted to all fourth feature Chinese The embodiment of the maximum of second weight of participle can include：Text-processing unit judges go out fifth feature word It can include one or more with third feature Chinese word segmentation identical fourth feature Chinese word segmentation present in table, if in the presence of Only have one with third feature Chinese word segmentation identical fourth feature Chinese word segmentation, then directly by the of this feature Chinese word segmentation Two weights are set to the maximum of the second weight of all fourth feature Chinese word segmentations, if exist with third feature Chinese word segmentation Second weight of above-mentioned multiple feature Chinese word segmentations then is disposed as owning by identical fourth feature Chinese word segmentation including multiple The maximum of second weight of fourth feature Chinese word segmentation.As fourth feature Chinese word segmentation has：" from trade area " (second weight 0.43), " patent " (the second weight 0.17), " trade mark " (the second weight 0.28), " copyright " (the second weight 0.12), with the 3rd Feature Chinese word segmentation identical feature Chinese word segmentation is " patent " (the second weight 0.17), then by this feature Chinese word segmentation " patent " The second weight be set to 0.48 (0.48>0.47>0.28>0.17).

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3024 is by described in the fifth feature vocabulary Four feature Chinese word segmentations specify the specific embodiment party that order is rearranged according to described second of second weight after adjustment Formula can include：The second weight after text-processing unit updates according to step S2023 is resequenced, and the after being updated the 5th Feature vocabulary.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3025 is according to specified punctuation mark by the textual data It can include according to the embodiment for being divided into multiple simple sentences：Text-processing unit can be according to formulating punctuation mark by text Data carry out simple sentence division, obtain multiple first simple sentences, the above-mentioned punctuation mark specified can include but is not limited to：Fullstop, point Number, etc..

Still optionally further, in the embodiment of the present invention, the clause that above-mentioned steps S3026 extracts in the multiple simple sentence is old Stating the embodiment of the first simple sentence of clause can include：The first simple sentence that text-processing unit divides step S3025 In statement clause extract.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3027 is according to institute of the clause of extraction for statement clause The first simple sentence is stated, the embodiment of generation sentence list can include：Text-processing unit is extracted according to step S3026 Clause generates simple sentence list to state the first simple sentence of clause according to first document.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3028 judges whether include institute in first simple sentence State fourth feature Chinese word segmentation.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3029 judges that first simple sentence includes institute State fourth feature Chinese word segmentation, it is determined that the embodiment of the 3rd weight of first simple sentence can include：At text Reason unit judges go out the first simple sentence and include fourth feature Chinese word segmentation, calculate the 3rd weight of each the first simple sentence, and this Second weight for the fourth feature Chinese word segmentation that two weights are included in the first simple sentence plus and with first simple sentence Length is determined.

Still optionally further, in the embodiment of the present invention, if above-mentioned steps S3030 judges not include in first simple sentence The fourth feature Chinese word segmentation, it is determined that the embodiment that the 3rd weight of first simple sentence is zero can be wrapped Include：Text-processing unit

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3031 is single by described first in the sentence list Sentence can include according to the 3rd embodiment for specifying order to be arranged of the 3rd weight：Text-processing unit will First simple sentence in the sentence list specifies order to be arranged according to the 3rd of the 3rd weight, and the 3rd specifies Order can include but is not limited to：It is descending, it is ascending, etc..

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3032 is according to the second predetermined coefficient, the textual data The quantity of the multiple simple sentence in and the text number of words, it is determined that the specific embodiment party of the quantity reference value of summary simple sentence Formula can include：The quantity of the multiple simple sentence of the text-processing unit in the second predetermined coefficient, the text data with And the text number of words, it is determined that the quantity reference value of summary simple sentence, second predetermined coefficient can be preset, such as second presets Coefficient is 200, and the quantity of multiple simple sentences is 18, and text number of words is 242, it is determined that the quantity reference value of summary simple sentence is 14.9。

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3033 is referred to according to the quantity of the summary simple sentence Value, determining the quantity N2, the N2 of the summary simple sentence can include for the embodiment of the integer more than or equal to 1： Text-processing unit is according to the quantity reference value of summary simple sentence, according to the principle that rounds up, it is determined that the quantity N2 of summary simple sentence.

Still optionally further, in the embodiment of the present invention, above-mentioned steps S3034 weighs the described 3rd in the sentence list Weight values come first N2 first simple sentence and are defined as text snippet, as shown in Figure 3 b, and it is real specifically to pass through following steps It is existing：

Step S30341, extracts the 3rd weighted value in the sentence list and comes first N2 first list Sentence.

Step S30342, first simple sentence of N2 is in the text data before being come according to the 3rd weighted value The order of appearance, arranges the 3rd weighted value and comes first N2 first simple sentence.

Step S30343, first simple sentence of N2 is defined as text before the 3rd weighted value after arrangement is come Summary.

Text snippet acquisition methods shown in Fig. 3 a, Fig. 3 b, by obtaining the text data in file destination；Judge described Whether summary keyword is included in text data, and the summary keyword is used to indicate the text snippet institute in the text data Position；If judging, the text data includes the summary keyword, count it is described summary keyword where The number of words of text fragment；The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than described first Predetermined threshold value, then be defined as text snippet by the text fragment where the summary keyword.Text shown in Fig. 3 a, Fig. 3 b is plucked Acquisition methods are wanted, text snippet can quickly be recognized according to summary keyword, the first text fragment, the first simple sentence, improve text Summary identification and efficiency, the precision and comprehensive obtained.

Referring to Fig. 4, Fig. 4 is a kind of structural representation of text snippet acquisition device disclosed in the embodiment of the present invention, use In text snippet acquisition methods disclosed in the execution embodiment of the present invention.In the embodiment of the present invention, device can include but is not limited to： Smart mobile phone, PC, tablet personal computer, personal digital assistant (Personal Digital Assistant, PAD), media player And wearable portable devices etc..As shown in figure 4, text summary acquisition device can specifically include：

Acquiring unit 401, for obtaining the text data in file destination；

Judging unit 402, for judging whether include summary keyword in the text data, the summary keyword is used Position where the text snippet in the text data is indicated；

Statistic unit 403, when judging that the text data includes the summary keyword for the judging unit, The number of words of text fragment where the statistics summary keyword；

Comparing unit 404, the number of words and the first predetermined threshold value for the statistic unit to be counted are compared；

Determining unit 405, compares the number of words less than first predetermined threshold value, then by institute for the comparing unit Text fragment where stating summary keyword is defined as text snippet.

In the embodiment of the present invention, above-mentioned text snippet acquisition device can include mobile device, tablet personal computer, intelligently refer to Ring, intelligent watch, smart home and Wearable, etc..

Text snippet acquisition device shown in Fig. 4 is by obtaining the text data in file destination；Judge the textual data Whether include summary keyword in, the summary keyword is used to indicate the position where the text snippet in the text data Put；If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword The number of words fallen；The number of words of statistics and the first predetermined threshold value are compared；If the number of words is less than the described first default threshold Value, then be defined as text snippet by the text fragment where the summary keyword.Text snippet acquisition device energy shown in Fig. 4 It is enough that text snippet is quickly recognized according to summary keyword, improve text snippet identification and the efficiency obtained, precision and comprehensive.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can To instruct the hardware of correlation to complete by program, the program can be stored in a computer-readable recording medium, storage Medium can include：Flash disk, read-only storage (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..

Above disclosure is only preferred embodiment of present invention, can not limit the right model of the present invention with this certainly Enclose, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims

1. a kind of text snippet acquisition methods, it is characterised in that including：

Obtain the text data in file destination；

Judge whether include summary keyword in the text data, the summary keyword is used to indicate in the text data Text snippet where position；

If judging, the text data includes the summary keyword, counts the text chunk where the summary keyword The number of words fallen；

If comparing the number of words less than first predetermined threshold value, the text fragment where the summary keyword is determined For text snippet；

Wherein, it is described judge that the text data includes the summary keyword before, it is described to judge the text data In whether include summary keyword after, in addition to：

If judging not include the summary keyword in the text data, the first text in the text data is extracted Fisrt feature Chinese word segmentation in paragraph；

Fisrt feature vocabulary is generated according to the fisrt feature Chinese word segmentation；

Extract the second feature Chinese word segmentation in the text data；

Second feature vocabulary is generated according to the second feature Chinese word segmentation；

The first weight of the second feature Chinese word segmentation is determined, first weight is in the second feature Chinese word segmentation All feature Chinese word segmentation occurrence numbers in the occurrence number of single feature Chinese word segmentation and the second feature Chinese word segmentation Plus and determine；

Order is specified to arrange according to the first of first weight second feature Chinese word segmentation in the second feature vocabulary；

Extract first weight in the second feature vocabulary and come the first N1 second feature Chinese word segmentation, it is described N1 is determined with the first predetermined coefficient by the quantity of the fisrt feature Chinese word segmentation, and is the integer more than or equal to 1；

The second feature Chinese word segmentation of N1 values position before the first weight comes according to the second feature vocabulary, generation Third feature vocabulary；

Determine the goodness of fit of the fisrt feature vocabulary and the third feature vocabulary；

The goodness of fit is compared with the second predetermined threshold value；

If comparing the goodness of fit more than or equal to second predetermined threshold value, first text fragment is defined as text This summary.

2. according to the method described in claim 1, it is characterised in that described to calculate the fisrt feature vocabulary and the described 3rd spy Levying the goodness of fit of vocabulary includes：

Count in the fisrt feature vocabulary with the second feature Chinese word segmentation identical institute in the third feature vocabulary State the sum of the fisrt feature Chinese word segmentation in the quantity of the first Chinese word segmentation, and the statistics fisrt feature vocabulary Amount；

According to the quantity of statistics and the total quantity of the fisrt feature Chinese word segmentation, the fisrt feature vocabulary and institute are calculated State the goodness of fit of third feature vocabulary.

3. method according to claim 2, it is characterised in that described to be compared the goodness of fit with the second predetermined threshold value It is described to compare the goodness of fit more than or equal to before second predetermined threshold value after relatively, in addition to：

If comparing the goodness of fit less than second predetermined threshold value, reject the stop words in the text data and commonly use Word；

Extract the third feature Chinese in the text header for eliminating the stop words and the text data of the everyday words Participle；

According to the third feature Chinese word segmentation, fourth feature vocabulary is generated；

Extract in the text data for eliminating the stop words and the everyday words except in the data of the text header Fourth feature Chinese word segmentation；

Fifth feature vocabulary is generated according to the fourth feature Chinese word segmentation；

The second weight of the fourth feature Chinese word segmentation is determined, second weight is in the fourth feature Chinese word segmentation All feature Chinese word segmentation occurrence numbers in the occurrence number of single feature Chinese word segmentation and the fourth feature Chinese word segmentation Total degree determine；

Order is specified to arrange according to the second of second weight fourth feature Chinese word segmentation in the fifth feature vocabulary；

Judge whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical fourth feature Chinese word segmentation；

If judging in the fifth feature vocabulary to exist and third feature Chinese word segmentation identical fourth feature Chinese point Word, then by the fifth feature vocabulary with the third feature Chinese word segmentation identical fourth feature Chinese word segmentation described Two weights are adjusted to the maximum of second weight of all fourth feature Chinese word segmentations；

By the fourth feature Chinese word segmentation in the fifth feature vocabulary according to described in second weight after adjustment Second specifies order to rearrange；

The text data is divided into by multiple simple sentences according to specified punctuation mark；

Extract first simple sentence of the clause in the multiple simple sentence for statement clause；

According to first simple sentence of the clause of extraction for statement clause, sentence list is generated；

Judge whether include the fourth feature Chinese word segmentation in first simple sentence；

If judging, first simple sentence includes the fourth feature Chinese word segmentation, it is determined that the 3rd power of first simple sentence Weight, second weight for the fourth feature Chinese word segmentation that the 3rd weight is included in first simple sentence plus and Determined with the length of first simple sentence；

If judging in first simple sentence not include the fourth feature Chinese word segmentation, it is determined that first simple sentence it is described 3rd weight is zero；

Order is specified to be arranged according to the 3rd of the 3rd weight first simple sentence in the sentence list；

The quantity of the multiple simple sentence in the second predetermined coefficient, the text data and the text number of words, it is determined that The quantity reference value of summary simple sentence；

According to the quantity reference value of the summary simple sentence, the quantity N2, the N2 for determining the summary simple sentence are more than or equal to 1 Integer；

First simple sentence of N2 is defined as text snippet before the 3rd weighted value in the sentence list is come.

4. method according to claim 3, it is characterised in that the 3rd weighted value by the sentence list First simple sentence of N2, which is defined as text snippet, before coming includes：

Extract the 3rd weighted value in the sentence list and come first N2 first simple sentence；

The order that first simple sentence of N2 occurs in the text data before being come according to the 3rd weighted value, arrangement 3rd weighted value comes first N2 first simple sentence；

First simple sentence of N2 is defined as text snippet before the 3rd weighted value after arrangement is come.

5. a kind of text snippet acquisition device, it is characterised in that including：

Acquiring unit, for obtaining the text data in file destination；

Judging unit, for judging whether include summary keyword in the text data, the summary keyword is used to indicate The position where text snippet in the text data；

Statistic unit, when judging that the text data includes the summary keyword for the judging unit, counts institute State summary keyword where text fragment number of words；

Determining unit, compares the number of words less than first predetermined threshold value, then by the summary for the comparing unit Text fragment where keyword is defined as text snippet.

Wherein, described device also includes：

Extraction unit, when judging not including the summary keyword in the text data for the judging unit, is extracted The fisrt feature Chinese word segmentation in the first text fragment in the text data；

First generation unit, for generating fisrt feature vocabulary according to the fisrt feature Chinese word segmentation；

Second extraction unit, for extracting the second feature Chinese word segmentation in the text data；

Second generation unit, for generating second feature vocabulary according to the second feature Chinese word segmentation；

First determining unit, the first weight for determining the second feature Chinese word segmentation, first weight is by described All spies in the occurrence number of single feature Chinese word segmentation in two feature Chinese word segmentations and the second feature Chinese word segmentation Levy adding and determination for Chinese word segmentation occurrence number；

First sequencing unit, it is single for the second feature Chinese word segmentation in the second feature vocabulary to be determined according to described first The first of first weight that member is determined specifies order to arrange；

3rd extraction unit, first weight for extracting in the second feature vocabulary comes first N1 described second Feature Chinese word segmentation, the N1 determines by quantity and the first predetermined coefficient of the fisrt feature Chinese word segmentation, and be more than or Integer equal to 1；

3rd generation unit, for the second feature Chinese word segmentation extracted according to the 3rd extraction unit, generation the 3rd Feature vocabulary；

Second determining unit, the goodness of fit for determining the fisrt feature vocabulary and the third feature vocabulary；

Second comparing unit, for the goodness of fit to be compared with the second predetermined threshold value；

3rd determining unit, the goodness of fit is compared more than or equal to the described second default threshold for second comparing unit During value, then first text fragment is defined as text snippet.

6. device according to claim 5, it is characterised in that second determining unit also includes：

With the second feature Chinese in the third feature vocabulary in first statistic unit, the statistics fisrt feature vocabulary Fisrt feature Chinese in the quantity of first Chinese word segmentation described in participle identical, and the statistics fisrt feature vocabulary The total quantity of participle；

Computing unit, for the quantity and the total quantity counted according to first statistic unit, calculates described first The goodness of fit of feature vocabulary and the third feature vocabulary.

7. device according to claim 6, it is characterised in that also include：

Culling unit, when comparing the goodness of fit less than second predetermined threshold value for second comparing unit, is rejected Stop words and everyday words in the text data；

4th extraction unit, the text header of the stop words and the text data of the everyday words is eliminated for extracting In third feature Chinese word segmentation；

4th generation unit, for according to the third feature Chinese word segmentation, generating fourth feature vocabulary；

4th extraction unit, eliminates for extracting and the text is removed in the text data of the stop words and the everyday words Fourth feature Chinese word segmentation in the data of this title；

5th generation unit, for generating fifth feature vocabulary according to the fourth feature Chinese word segmentation；

4th determining unit, the second power for determining the fourth feature Chinese word segmentation that the 4th extraction unit is extracted Weight, second weight is by the occurrence number of the single feature Chinese word segmentation in the fourth feature Chinese word segmentation and the described 4th The total degree of all feature Chinese word segmentation occurrence numbers in feature Chinese word segmentation is determined；

Second sequencing unit, for the fourth feature Chinese in the fifth feature vocabulary that generates the 5th generation unit The second of second weight that participle is determined according to the 4th determining unit specifies order to arrange；

First judging unit, for judging whether deposited in the fifth feature vocabulary and the third feature Chinese word segmentation identical Fourth feature Chinese word segmentation；

Adjustment unit, judges to deposit in the fifth feature vocabulary and third feature Chinese for first judging unit During participle identical fourth feature Chinese word segmentation, by the fifth feature vocabulary with the third feature Chinese word segmentation identical Second weight of fourth feature Chinese word segmentation is adjusted to second weight of all fourth feature Chinese word segmentations Maximum；

3rd sequencing unit, for the fourth feature Chinese word segmentation in the fifth feature vocabulary is single according to the adjustment Described second specified order of second weight after member adjustment is rearranged；

Division unit, for the text data to be divided into multiple simple sentences according to specified punctuation mark；

5th extraction unit, is the of statement clause for extracting the clause in the multiple simple sentence that the division unit is divided One simple sentence；

6th generation unit, the clause for being extracted according to the 5th extraction unit is first simple sentence of statement clause, Generate sentence list；

Second judging unit, for judging whether include the fourth feature Chinese word segmentation in first simple sentence；

5th determining unit, judges that first simple sentence includes the fourth feature Chinese for second judging unit During participle, determine the 3rd weight of first simple sentence, the 3rd weight included in first simple sentence the described 4th Second weight of feature Chinese word segmentation plus and determined with the length of first simple sentence；

6th determining unit, judges not include in the fourth feature in first simple sentence for second judging unit During literary participle, the 3rd weight for determining first simple sentence is zero；

4th sequencing unit, for first simple sentence in the sentence list to be specified according to the 3rd of the 3rd weight Order is arranged；

7th determining unit, for the multiple simple sentence in the second predetermined coefficient, the text data quantity and The text number of words, it is determined that the quantity reference value of summary simple sentence；

8th determining unit, for the quantity reference value according to the summary simple sentence, determines the quantity N2 of the summary simple sentence, institute It is the integer more than or equal to 1 to state N2；

9th determining unit, for the 3rd weighted value in the sentence list to be come to first N2 first simple sentence It is defined as text snippet.

8. device according to claim 7, it is characterised in that the 9th determining unit also includes：

6th extraction unit, it is single that the 3rd weighted value for extracting in the sentence list comes first N2 described first Sentence；

5th sequencing unit, for being come according to the 3rd weighted value before first simple sentence of N2 in the text data The order of middle appearance, arranges first simple sentence that the 6th extraction unit is extracted；

Tenth determining unit, text is defined as the 3rd weighted value after arrangement to be come into first N2 first simple sentence This summary.