CN102385574B - Method and device for extracting sentences from document - Google Patents
Method and device for extracting sentences from document Download PDFInfo
- Publication number
- CN102385574B CN102385574B CN201010268675.6A CN201010268675A CN102385574B CN 102385574 B CN102385574 B CN 102385574B CN 201010268675 A CN201010268675 A CN 201010268675A CN 102385574 B CN102385574 B CN 102385574B
- Authority
- CN
- China
- Prior art keywords
- sentence
- cue
- structure pattern
- predetermined special
- special significance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a method and a device for extracting sentences with prospective special meanings from a document. The method comprises the following steps of: obtaining a sentence structure mode of a sentence with a prospective special meaning; obtaining cue words, wherein a sentence containing the cue words is more possibly to be the sentence with the prospective special meaning than a sentence without the cue words; combining the sentence structure mode and the cue words to obtain the combined sentence structure mode-cue words which is in accordance with a sentence grammatical structure; based on the sentence structure mode-cue words contained in the sentence in the document, determining the score of the sentence; and based on the score of the sentence, the sentence with the prospective special meaning is extracted from the document. The method and the device for extracting sentences with prospective special meanings from the document are utilized, accordingly, the interference caused by noise sentences can be reduced, and the sentences with the prospective special meanings can be extracted more accurately and efficiently.
Description
Technical field
The present invention relates generally to document process and information extraction, relates more specifically to extract from document the method and apparatus of sentence.
Background technology
A lot of technology of automatically extracting sentence or form documentation summary from document have been proposed.
At patent documentation US7051024 B2, be entitled as Document summarizer for word processors, in MICROSOFT CORP, a kind of method of automatic formation documentation summary is proposed, wherein, the frequency that content words in statistic document occurs, by suing for peace to obtain the scoring of sentence to the corresponding frequency of each content words comprising in a sentence; According to the scoring of sentence, each sentence is sorted.In addition, some potential problem phrase or vocabulary have been pre-defined, in the document, be referred to as cue phrase (cue-phrase), its implication is that the sentence that includes such problem phrase or vocabulary should not be added in documentation summary, or only has certain first to carry to be just introduced in documentation summary in the situation that condition sets up; When carrying out the frequency statistics of content words appearance, phrase in each sentence is compared with predefined cue phrase, if it comprises cue phrase, whether determine will be outside this sentence eliminating and documentation summary, still conditionally using it as the candidate who adds documentation summary.
In addition, at patent documentation US Patent 5924108-Document summarizer for word processors, in MICROSOFT CORP, according to whether comprising cue or cue in sentence, combine to judge whether to be crucial sentence.
In addition, at S Teufel, the Sentence extraction as a classification task of M Moens, in In Proceedings of the ACL ' 97/EACL ' 97 Workshop on Intelligent Scalable Text Summarization (July 1997), prompting phrase was used for filter element comment (meta-discourse), prompting phrase is manually divided into 5 classes, and the corresponding sentence that comprises cue belongs to the different possibilities of summing up sentence respectively.According to prompting phrase, position in article, sentence length, word occurrence number in dictionary, suitably name occurs, each sentence is given a mark according to each feature, so just obtain sentence, appears at the possibility in summary.
Summary of the invention
But, there are some problems in above-mentioned classic method.For instance, in the situation that think that the sentence that comprises introducer tends to the sentence into expectation, find conventionally in one piece of document, though there are a lot of introducer unexpected sentences (hereinafter, being referred to as noise sentence) of comprising.So, utilize above-mentioned classic method, usually can not suitably find expectation sentence.
In addition, inventor finds, in many cases, may expect from document, extract some acquire a special sense or the sentence of special role.For example, for patent document, expectation extracts the sentence that the technical matters that will solve is invented in explanation automatically.Again for example, in product description, the sentence about product advantage is extracted in expectation.For another example, for contract, expectation extract wherein to disadvantageous clause of party B etc.
According to an aspect of the present invention, provide a kind of method that extracts the sentence with predetermined Special Significance from literary composition gear, can comprise the steps: to obtain the sentence structure pattern of the sentence with predetermined Special Significance; Obtain cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue; Combination sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure; Sentence structure pattern-cue that sentence based in described document comprises, determines the mark of sentence; And the mark based on sentence, from described document, extract the sentence with predetermined Special Significance.
According to a further aspect in the invention, provide a kind of device that extracts the sentence with predetermined Special Significance from literary composition gear, can comprise: sentence structure pattern obtains parts, for obtaining the sentence structure pattern of the sentence with predetermined Special Significance; Cue obtains parts, and for obtaining cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue; Sentence structure pattern-cue combiner, for combining sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure; Sentence mark determining means, the sentence structure pattern-cue comprising for the sentence based on described document, determines the mark of sentence; And sentence extraction parts, for the mark based on sentence, from described document, extract the sentence with predetermined Special Significance.
According to another aspect of the invention, provide a kind of method that extracts the sentence with predetermined Special Significance from literary composition gear, can comprise the steps: to obtain the sentence structure pattern of the sentence with predetermined Special Significance; Obtain cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue; Combination sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure; And sentence structure pattern-cue of comprising of the sentence based in described document, from described document, extract the sentence with predetermined Special Significance.
Utilize the method and apparatus that extracts the sentence with predetermined Special Significance from document of the present invention, can alleviate the interference that noise sentence brings, extract more accurately and efficiently the sentence with predetermined Special Significance.
Accompanying drawing explanation
Fig. 1 extracts the overall flow figure of the method for the sentence with predetermined Special Significance according to an embodiment of the invention from literary composition gear;
Fig. 2 extracts the process flow diagram of the method for the sentence with predetermined Special Significance in accordance with another embodiment of the present invention from document;
Fig. 3 be according to the present invention another embodiment from document, extract the process flow diagram of the method for the sentence with predetermined Special Significance;
Fig. 4 extracts the schematic block diagram of the device of the sentence with predetermined Special Significance according to an embodiment of the invention from literary composition gear; And
Fig. 5 can put into practice exemplary computer system of the present invention according to an embodiment of the invention.
Embodiment
In order to make those skilled in the art understand better the present invention, below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
For ease of understanding and describing conveniently, first set forth general plotting of the present invention below.As mentioned before, only rely on cue phrase may obtain many noise sentences, noise sentence comprises cue phrase but the sentence not wanted.And manyly there is the sentence of Special Significance or special role conventionally to there is some specific sentence structure pattern.Therefore,, if both considered that sentence structure pattern also considered cue phrase simultaneously during the sentence acquiring a special sense in extraction, expection will obtain more gratifying extraction result.
In the application's document, phrase can refer to single word or the word being comprised of a plurality of words, and word and single word (word) refer in Chinese a word in a word or English.
In addition, for fear of obscuring main points of the present invention, in the application's document, known feature or structure are not described, for example, in sentence extraction, conventionally first to carry out subordinate sentence, participle to document, and when the importance of word is assessed, consider that some word features are as the position of word frequency, inverted entry frequency, word, word length, part of speech etc.About participle.Subordinate sentence has a lot of known technology as participle technique ICTCLAS etc., but these aspects that to be not the present invention pay close attention to, therefore not to it for being described in detail, but, it should be noted that, this does not represent that the present invention cannot comprise these known feature or structures, and these participles and word feature selection technology may be used to the present invention on the contrary.
For ease of understanding and describing conveniently, the sentence of usually describing technical solution problem to extract for patent documentation illustrates below.But, it is emphasized that the present invention is not limited to the sentence that extracts description technical solution problem, in fact extracts any sentence acquiring a special sense and can apply the present invention from document, for example, from product description, extract the sentence about product advantage; From contract, extract disadvantageous clause of party B etc.
Fig. 1 shows the overall flow figure that extracts according to an embodiment of the invention the method for the sentence with predetermined Special Significance from literary composition gear.
As shown in Figure 1, the method 100 that extracts the sentence with predetermined Special Significance from literary composition gear according to an embodiment of the invention can comprise: sentence structure pattern obtains step S110, cue obtains step S120, sentence structure pattern-cue combination step S130, sentence mark determining step S140, sentence extraction step S150.Below each step is specifically described.
At step S110, obtain the sentence structure pattern of the sentence with predetermined Special Significance.
The sentence structure pattern with the sentence of predetermined special doubt refers to that the spelling words intellectual of this tactic pattern of coupling may be more the sentence with predetermined Special Significance than the spelling words intellectual that does not mate this tactic pattern.For example, can portray sentence structure pattern from the following aspects: the corresponding part of speech of effect in sentence structure that comprises more than 2 between phrase, phrase that punctuation mark or word by predetermined number separates, phrase has with it, for example, if phrase as the adverbial modifier, may be adverbial phrase; If phrase as subject, may be noun or pronoun phrase; If phrase as predicate trunk, may be verb phrase; And so on.
The difference of sentence structure pattern and phrase is, from sentence structure pattern, people can get a glimpse of or know the framework of a sentence, and wants the aspect described and general more complicated.And phrase is generally that the sentence of the level between word and sentence forms unit, there is the more fixing expression meaning, but generally can not know from it framework of sentence.
The sentence of the technical matters that the description of take in extraction patent documentation solves is example, typical sentence structure pattern has: " accordingly; the object of this method " (hereinafter referred to as sentence structure Mode S P1), " as a result; the problem of the paper " (hereinafter referred to as sentence structure Mode S P2), " therefore ... " (hereinafter referred to as sentence structure Mode S P3) is typical sentence structure pattern for what is needed for{4,20}invention.Wherein { 4,20} represents that intermediate character number is 4 to 20.
The sentence structure pattern with the sentence of predetermined special doubt for example can obtain by automatic learning from training collection of document, also can manually be defined by the experienced expert of association area.The in the situation that of automatic learning from training collection of document, training collection of document can be by training in a large number document to form, for the situation of extracting the sentence of technical problem from patent documentation, training collection of document can be comprised of as Patent Application Publication document a large amount of patent documentations, and by manual confirmation, the sentence of technical problem has wherein been carried out to mark.At this moment can be such as learning the sentence structure pattern of the sentence of this technical problem by sentence whole matching or sentence part coupling etc., and the sentence structure pattern that can store the sentence of the technical problem obtaining through study.
For obtained sentence structure pattern.Can not be equal to and treat with making any distinction between.But mode as an alternative, also can set different weights to obtained sentence structure pattern, and the frequency that for example can occur in training collection of document according to this sentence structure pattern is set weight.
At step S120, obtain cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue.
The sentence acquiring a special sense can contain some cue conventionally.For example, for the sentence of the technical problem in patent documentation, the vocabulary usually occurring has solve, provide, need, increase, decrease, optimize, high, poorer etc.These vocabulary can be extracted as cue.
Equally, cue can obtain or manually be determined by experienced expert by automatic learning from training collection of document.
Similarly, can not make any distinction between and put on an equal footing different cues, or can set different weights for different cues.
In addition, it should be noted that, sentence structure pattern and the cue with the sentence of predetermined Special Significance can be obtained by outside, in this case, can be that the calculation element of carrying out object identification from another by network obtains, or be inputted by user, can certainly be that the identifying information of having identified is in advance stored in the removable storage medium of flash memory for example, then from this removable storage medium, read identifying information, method or the means of acquisition are not construed as limiting the invention.
At step S130, combination sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure.
For example, sentence pattern SP1, " accordingly, the object of this method " can with cue solve, provide, need, increase, decrease, optimize combination, but be not suitable for combining with high, poorer.
Equally, can according to a large amount of training documents draw which sentence structure pattern can with the combination of which cue, and which sentence structure pattern is never or seldom with which cue combine.Certainly, also can by the expert of association area, rule of thumb be come manually to define.
Similarly, can not make any distinction between and put on an equal footing the combination of different sentence structure patterns-cue, or can set different weights for the combination of different sentence structure patterns-cue.About how, specifically set or the weight of study sentence structure pattern-cue combination, after will with reference to figure 3, be described in detail specially.
At step S140, the sentence structure pattern that the sentence based in document comprises-cue combines, and determines the mark of sentence.
Obtained sentence structure pattern-programmed alarm contamination by abovementioned steps S110, S120, S130 after, can be for each sentence in any one document (hereinafter referred to as test document), according to this sentence, whether contain the sentence structure pattern-cue combination having obtained, calculate the mark of sentence.
For example, suppose that a sentence in document is " accordingly; the object of this method is to provide an improved inkjet printing system having a specialized orifice plate ", this sentence comprises accordingly, the object of this method (SP1 pattern)-provide (cue) combination, supposing that the weight of all sentence structure pattern-prompting contaminations is identical is 1, and to try to achieve be 1 to the mark of this sentence.
Suppose that another sentence in document is " A description will be given below; with reference to the drawings; of embodiments of the present invention ", because do not comprise the combination of any sentence structure pattern-cue in this sentence.Therefore can for example to try to achieve be 0 to the mark of this sentence.
In above-mentioned sentence mark computation process, according to sentence, whether comprise sentence structure pattern-cue simply and combine to calculate sentence mark.But this is only example and is to provide for the object of being convenient to understand, and can also have the method that other calculates sentence mark.For example, can not comprise for comprising sentence structure pattern sentence setting and the corresponding weight of sentence structure pattern of cue, and for comprising cue but do not comprise sentence setting and the corresponding weight of cue of sentence structure pattern.
In addition, for some cue, can be equipped with synonym, approximate word or can carry out the list of the alternative phrase of synonym, and being equipped with corresponding ratio molecule, such as 0.9.Thus, when sentence being carried out to sentence structure pattern-cue coupling or retrieval, in the situation that do not find the sentence structure pattern-cue of coupling, synonym, approximate word can be retrieved or the list of the alternative phrase of synonym can be carried out, and whether exist sentence structure pattern and this synonym, approximate word in this case maybe can carry out the combination of the alternative phrase of synonym, and can be in the hope of corresponding mark, for example, be to exist the mark in sentence structure pattern-cue situation of mating to be multiplied by 0.9.
At step S150, the mark based on sentence extracts the sentence with predetermined Special Significance from described document.
For example, can extract the sentence that mark surpasses predetermined threshold, or extract mark and sort forward sentence as the sentence with predetermined Special Significance.
The sentence of the Special Significance extracting can output to output device as display, printer etc., also can output to another electronic equipment for further or processing.
Above with reference to accompanying drawing 1, described and from literary composition gear, extracted according to an embodiment of the invention the method for the sentence with predetermined Special Significance.But, it should be noted that, above-described embodiment is only an example, should be as restriction of the present invention.Can have a lot of substituting or modification, these do not exceed protection scope of the present invention.
For example, the step of the mark of above-mentioned definite sentence is not necessary, but the sentence structure pattern-cue can the sentence based in described document comprising utilizes certain sorting algorithm or learning algorithm directly from described document, to extract the sentence with predetermined Special Significance.For example, the most simply, can only check whether a sentence comprises the combination of sentence structure pattern-cue, if comprised, this sentence is extracted as the sentence acquiring a special sense, and not have the operation of explicit calculating mark.
Or, for example, can utilize decision tree to classify.At this moment, each the sentence structure pattern-cue of can usining combines judging characteristic or the variable as the node of decision tree, for example, at a Nodes, judge whether a sentence exists sentence structure pattern-cue combination A, and judge at another Nodes whether sentence exists sentence structure pattern-cue combination B, and according to judged result bifurcated in addition, finally at leaf node place, obtain the classification results of sentence, wherein utilize training document sets to train constructed decision tree.In this example, when a test sentence is judged, there is not the operation of the mark of determining this test sentence yet, but the situation of the sentence structure pattern contained according to this test sentence-cue combination, with decision tree, judge, see which leaf node it can go to, be classified to the affiliated classification of this leaf node.
Again for example, in the situation that utilizing Bayes classifier, can, by the statistics of training document sets is obtained to the prior probability in various situations, thereby try to achieve sentence in the situation that there is the combination of each sentence structure pattern-cue, be the probability of Special Significance sentence.And the probability that belongs to Special Significance for test sentence calculates according to this, and and then sort out.At this moment do not need to determine the operation of sentence mark yet.
With decision tree and Bayes classifier, illustrated and utilized learning algorithm to carry out the situation of learning training and test above.But this is only example, such as logistic regression sorting technique, rule-based method etc. of other learning algorithm may be used to the present invention, and below with reference to Fig. 3 for utilizing logistic regression sorting technique calculate the weight of sentence structure pattern-cue combination and test sentence is classified and is described in detail.
In addition, above-mentioned cue has been only the cue certainly acting on, and thinks that the sentence that comprises cue may be more the sentence acquiring a special sense than the sentence that does not comprise cue.But this is only example.For example, can introduce negates the cue of effect, then for comprising this sentence that works the cue of negating effect, set penalty factor, for example, should reduce its mark, or comprising this sentence that works the cue of negating effect, be excluded in outside the sentence acquiring a special sense simply.
In addition,, in above-mentioned example, only considered to have the sentence structure pattern of the sentence of predetermined Special Significance.Alternatively or as a supplement, can consider to obtain the sentence structure pattern of noise sentence, noise sentence refers to that this sentence contains cue but is not the sentence with predetermined Special Significance; Then judge whether the sentence in described document meets the sentence structure pattern of noise sentence; And from described document, delete the sentence be judged as noise sentence.For example, for extract the sentence of describing technical solution problem from patent documentation, a noise sentence pattern can be " invention ... problem. "
In addition, it is also conceivable that the sentence structure pattern of the fixing non-sentence acquiring a special sense of some form, the sentence with this sentence structure pattern does not generally have the Special Significance of expectation.Then by the sentence structure pattern that checks whether sentence mates this non-sentence acquiring a special sense, if coupling is excluded in this sentence outside the sentence acquiring a special sense, or be this sentence setting penalty factor.
In addition, also can consider the sentence structure pattern of the non-sentence acquiring a special sense and play negates the cue combination of effect, then check that whether sentence mates the sentence structure pattern of so non-sentence acquiring a special sense and play is negated the prompting contamination of effect, if coupling, this sentence is excluded in outside the sentence acquiring a special sense, or is this sentence setting penalty factor.
In addition, it should be noted that, the document here (no matter being training document or test document) is wide in range concept, can be both the full document of common meaning, can be also a part for document.
Fig. 2 extracts the process flow diagram of the method 200 of the sentence with predetermined Special Significance in accordance with another embodiment of the present invention from document.
Step S110 shown in step S210 shown in Fig. 2, S220, S250 and Fig. 1, S120, S150 are basic identical, and the descriptions thereof are omitted here.
The method 200 of the sentence with predetermined Special Significance shown in Fig. 2 is different from the method 100 shown in Fig. 1 is to have introduced cue bunch, is no longer to take cue to consider as angle, but considers from the angle of cue bunch or prompting phrase.This be because, in some cases, may there are a lot of cues, the number of the sentence structure pattern-prompting contamination at this moment existing will sharply increase, especially true in sentence structure pattern also more situation.If at this moment considered with cue Cu Wei unit, by complexity and the calculated amount of reduction place problem greatly, saving resource.
Particularly, at step S221, the cue obtaining for step S220 carries out cluster, obtains some cues bunch.
Cluster is a kind of non-supervisory machine learning algorithm, and for each individuality or sample are gathered for some classes, each individuality can be considered as a point in feature space.Its basic thought is that it is a class or cluster that the nearer and intensive point of feature space middle distance is gathered.
In cue cluster herein, each word is each sample, and the similarity between word can be considered as the distance between word.Thus, existing various clustering algorithm is for example entitled as " Clustering to Find Exemplar Terms for Keyphrase Extraction ", Zhiyuan Liu, Peng Li, Yabin Zheng, Maosong Sun, the clustering algorithm of mentioning in the article of relevant meeting EMNLP 2009, the 257-266 pages of natural language processing all can be applied to the present invention.
About last cluster, obtain bunch number k can be predetermined, be for example the number of the key words of user or system appointment, or can be also uncertain, according to the operation result that clustering algorithm is last, determine.
The objective function of cluster can be that the introducer of same cluster has identical semantic or identical sentence grammer and part of speech.Or, the objective function of cluster it is also conceivable that bunch and bunch between distance and/or each bunch in member's factors such as number.Clustering method can comprise the clustering method based on the meaning of one's words, the clustering method based on grammer, or both combinations, etc.
Similarity between word can be determine and be stored in word similarity database in advance, can be also that scene calculates from processed object document.Can utilize mutual information method to calculate the similarity between word, or can also utilize the statistical methods such as log-likelihood ratio (Log Likelihood Ratio), Chi-square Test (Chi-squared), and the knowledge method that gives dictionary (for example WordNet, knows net) calculates.
The following describes the simple examples of a cluster process.For example,, for the cue solve in above-mentioned example, provide, need, increase, decrease, optimize, high, poorer, according to part of speech (verb and adjective), solve, provide, need, increase, decrease, optimize, high, poorer can be divided into 2 bunches, i.e. " solve, provide, need; increase, decrease, optimize " and " high, poorer " (being below called a bunch C3).
And then, according to semanteme, for example, mean that solution also means lifting, cue solve, provide, need, increase, decrease, optimize can be divided into again 2 bunches of " solve, provide; need " (being below called C1), " increase, decrease, optimize " (being below called C2).So altogether obtained 3 promptings bunch C1, C2 and C3.
The number of cue above, cue bunch and cue bunch is only example, can relate to as required the number of different cues, cue bunch and cue bunch.
A large benefit of introducing cue bunch is, the status of all words in cue bunch, effect, weight etc. think it is all identical.Thus, without considering these factors for each cue bunch, can reduce the workload of processing.
At step S230, different from the step S130 shown in Fig. 1, not combination sentence structure pattern and cue, but combination sentence structure pattern and cue bunch, to obtain the sentence structure pattern-cue bunch after the combination that meets sentence syntactic structure.
For example, for typical sentence structure pattern mentioned above: " accordingly, the object of this method " (SP1), " as a result, the problem of the paper " (SP2), " therefore ... what is needed for{4, 20}invention " (SP3), and above-mentioned cue bunch " solve, provide, need " (C1), " increase, decrease, optimize " (C2), " high, poorer ", we can obtain the combination of following significant sentence structure pattern-introducer bunch: SP1-C1, SP1-C2, SP2-C3, SP3-C2, SP3-C3.
In the situation that considering weight, do not considering separately the weight of each sentence structure pattern-guiding contamination, but be reduced to the weight of considering each sentence structure pattern-guiding contamination.Thus, further reduced the workload of processing.
At step S240, based on test document, whether comprise the combination of sentence structure pattern-cue bunch, determine the mark of sentence.Thereby at step S250, the mark based on sentence extracts the sentence acquiring a special sense from document.
Similarly, the extracting method of above-mentioned Special Significance sentence is only example.Can in sentence extracting method, consider further noise sentence structure pattern and/or play negates the cue of effect.
Fig. 3 be according to the present invention another embodiment from document, extract the process flow diagram of the method 300 of the sentence with predetermined Special Significance.
Step S210 shown in step S310 shown in Fig. 3, S320, S321, S330, S350 and Fig. 2, S220, S221, S230, S250 are basic identical, omit it here and specifically narrate.
Method shown in Fig. 3 300 and method shown in Fig. 2 200 different have been step S331 many, for determining the weight of sentence structure pattern-programmed alarm word after combination bunch.And the step S340 of the mark of definite sentence may be correspondingly different.
Can by sorting technique, by training, the sentence in collection of document be categorized as and has the weight that the sentence of predetermined Special Significance and the non-sentence with predetermined Special Significance calculate sentence structure pattern-cue bunch.Sorting technique can be logistic regression sorting technique, bayes classification method, at least one in rule and method and functional method or combination.
Provide the example of weight of determining the combination of sentence structure pattern-cue bunch by logistic regression sorting technique below.
Suppose, the mark of sentence represents with variable z, the x1 for combination (being provided with k) of sentence structure pattern-cue bunch, x2, xk represents, in the situation that adopting linear logic to return sorting technique, the mark z of sentence can represent with following linear logic regression formula (1).
z=β0+β1*x1+β2*x2+…+βk*Xk,……(1)
Wherein β 0, and β 1, and β 2, β k, and philosophy is the combination x1 of sentence structure pattern-cue bunch, x2, the coefficient of xk, is also the weight of the combination of each sentence structure pattern-cue bunch.
At the SP1-C1 that is combined as of sentence structure pattern-cue bunch, SP1-C2, in the situation of SP2-C3, k=3, above-mentioned formula (1) becomes formula (2)
z=β0+β1*x1+β2*x2+β3*X2,……(2)
Use training collection of document, for each sentence, according to it, whether be that its corresponding mark set in the sentence with predetermined Special Significance, and according to the combination that whether contains sentence structure pattern-cue bunch, x1, x2, the value of x3 also (is for example determined, there is this combination, value is 1, not this combination, value is 0).Thus can be in the hope of factor beta 0, β 1, β 2, be also the combination S P1-C1 of sentence structure pattern-cue bunch, SP1-C2, the weight of SP2-C3.
The combination S P1-C1 of sentence structure pattern-cue bunch, SP1-C2, the weight of SP2-C3, with each sentence in all sentence-introducers bunch pattern match test document, and adopts linear method accumulation model weight, obtains sentence value.
For example, suppose that sentence S is " accordingly; the object of this method is to provide an improved inkjet printing system having a specialized orifice plate ". its coupling SP1-C1 pattern, so the mark of sentence will be
Score(S)=β0+β1。
In above-mentioned example, adopt logistic regression sorting technique to determine the weight of the combination of sentence structure pattern-cue bunch, but alternatively also can adopt for example bayes classification method, the weight of the combination of sentence structure pattern-cue bunch is determined at least one in rule and method and functional method or combination.
Fig. 4 extracts the schematic block diagram of the device 400 of the sentence with predetermined Special Significance according to an embodiment of the invention from literary composition gear.
The device 400 that extracts the sentence with predetermined Special Significance from literary composition gear can comprise: sentence structure pattern obtains parts 410, for obtaining the sentence structure pattern of the sentence with predetermined Special Significance; Cue obtains parts 420, and for obtaining cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue; Sentence structure pattern-cue combiner 430, for combining sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure; Sentence mark determining means 440, the sentence structure pattern-cue comprising for the sentence based on described document, determines the mark of sentence; And sentence extraction parts 450, for the mark based on sentence, from described document, extract the sentence with predetermined Special Significance.
The described sentence structure pattern with the sentence of predetermined Special Significance can obtain by automatic learning from training collection of document, or obtained by artificial definition.
The sentence structure pattern with the sentence of predetermined special doubt can refer to that the spelling words intellectual of matching structure pattern may be more the sentence with predetermined Special Significance than the spelling words intellectual that does not mate any tactic pattern.
Described device 400 can also comprise for determining the parts of the weight of the sentence structure pattern-cue after each combination, and sentence structure pattern-cue of comprising based on each sentence in document of sentence mark determining means and the weight of corresponding sentence structure pattern-cue, determine the mark of sentence.
Device 400 can also comprise for the cue for obtained and carry out cluster, obtains the parts of cue bunch.These sentence structure pattern-cue combiner 430 combination sentence structure pattern and cues bunch.Device 400 can also comprise for determining the parts of the weight of sentence structure pattern-cue after each combination bunch.And the weight of sentence structure pattern-cue that sentence mark determining means 440 can comprise based on each sentence in described document bunch and corresponding sentence structure pattern-cue bunch, determines the mark of sentence.
For determining that the parts of the weight of sentence structure pattern-cue after each combination bunch can be categorized as the sentence of training collection of document to have the sentence of predetermined Special Significance and the weight that the non-sentence with predetermined Special Significance calculates sentence structure pattern-cue by sorting technique.Sorting technique can be logistic regression sorting technique, bayes classification method, at least one in rule and method and functional method or combination.
Device 400 can also comprise for obtaining the parts of the sentence structure pattern of noise sentence, and noise sentence refers to that this sentence contains cue but is not the sentence with predetermined Special Significance; For judging whether the sentence of described document meets the parts of the sentence structure pattern of noise sentence; And for delete the parts of the sentence that is judged as noise sentence from described document.
Fig. 5 can put into practice the schematic diagram of exemplary computer system 700 of the present invention according to an embodiment of the invention.
With reference to Fig. 5, provide as the description of example that realizes the hardware configuration of above-mentioned multi-object recognition device.CPU (CPU (central processing unit)) 701 carries out various processing according to the program being stored in ROM (ROM (read-only memory)) 702 or storage area 708.For example, CPU carry out describe in the above-described embodiments from literary composition gear, extract the program of the method for the sentence with predetermined Special Significance.RAM (random access memory) 703 suitably stores the program carried out by CPU 701, data etc.CPU 301, ROM 702 and RAM 703 interconnect by bus 704.
CPU 701 is connected in input/output interface 705 by bus 704.Comprise the importation 706 of keyboard, mouse, microphone etc. and comprise that the output of display, loudspeaker etc. is connected in input/output interface 705.CPU 701 carries out various processing according to the instruction of 706 inputs from importation.The result that CPU 701 processes to output 707 outputs.
The storage area 708 that is connected in input/output interface 705 comprises for example hard disk, and stores program and the various data of being carried out by CPU701.Communications portion 709 is come and communication with external apparatus by the network such as the Internet, LAN (Local Area Network) etc.
Be connected in the removable medium 711 of driver 710 driving such as disk, CD, magneto-optic disk or the semiconductor memories etc. of input/output interface 705, and acquisition is recorded in the program, data etc. there.The program obtaining and data are transferred to storage area 708 when needed, and are stored in there.
Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand whole or any steps or the parts of method and apparatus of the present invention, can be in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, with hardware, firmware, software or their combination, realized, this is that those of ordinary skills use their basic programming skill just can realize in the situation that having read explanation of the present invention.
Therefore, object of the present invention can also realize by move a program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can be only by providing the program product that comprises the program code of realizing described method or device to realize.That is to say, such program product also forms the present invention, and the storage medium that stores such program product also forms the present invention.Obviously, described storage medium can be any storage medium developing in any known storage medium or future.
Also it is pointed out that in apparatus and method of the present invention, obviously, each parts or each step can decompose and/or reconfigure.These decomposition and/or reconfigure and should be considered as equivalents of the present invention.And, carrying out the step of above-mentioned series of processes can order naturally following the instructions carry out in chronological order, but do not need necessarily according to said sequence, to carry out, but may can change execution sequence, for example between the step based on historical identifying information correction identifying information and the step based on mutual relationship correction identifying information between object, there is no strict precedence relationship.
Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be understood that, depend on designing requirement and other factors, various modifications, combination, sub-portfolio can occur and substitute.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection domain of the present invention.
Claims (10)
1. from literary composition gear, extract a method for the sentence with predetermined Special Significance, comprise the steps:
Acquisition has the sentence structure pattern of the sentence of predetermined Special Significance;
Obtain cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Combination sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure;
Sentence structure pattern-cue that sentence based in described document comprises, determines the mark of sentence, thereby has both considered that sentence structure pattern also considered cue simultaneously during the sentence acquiring a special sense in extraction; And
Mark based on sentence extracts the sentence with predetermined Special Significance from described document.
2. the method for claim 1, described in there is the sentence of predetermined Special Significance sentence structure pattern automatic learning from training collection of document obtain, or obtained by artificial definition.
3. the method for claim 1, the sentence structure pattern with the sentence of predetermined special doubt refers to that the spelling words intellectual of this tactic pattern of coupling may be more the sentence with predetermined Special Significance than the spelling words intellectual that does not mate this tactic pattern.
4. the method for claim 1, also comprises:
Determine the weight of the sentence structure pattern-cue after each combination;
The mark of wherein said definite sentence comprises: the sentence structure pattern-cue comprising based on each sentence in described document and the weight of corresponding sentence structure pattern-cue, determine the mark of sentence.
5. method as claimed in claim 4, wherein:
For obtained cue, carry out cluster, obtain cue bunch;
Combination sentence structure pattern and cue bunch;
Determine the weight of the sentence structure pattern-cue bunch after each combination; And
The weight of the sentence structure pattern-cue comprising based on each sentence in described document bunch and corresponding sentence structure pattern-cue bunch, determines the mark of sentence.
6. the method for claim 1, is wherein categorized as the sentence in training collection of document to have the sentence of predetermined Special Significance and the weight that the non-sentence with predetermined Special Significance calculates sentence structure pattern-cue by sorting technique.
7. method as claimed in claim 5, is wherein categorized as the sentence in training collection of document to have the weight that the sentence of predetermined Special Significance and the non-sentence with predetermined Special Significance calculate sentence structure pattern-cue bunch by sorting technique.
8. the method for claim 1, also comprises:
Obtain the sentence structure pattern of noise sentence, noise sentence refers to that this sentence contains cue but is not the sentence with predetermined Special Significance;
Judge whether the sentence in described document meets the sentence structure pattern of noise sentence; And
From described document, delete the sentence that is judged as noise sentence.
9. from literary composition gear, extract a device for the sentence with predetermined Special Significance, comprising:
Sentence structure pattern obtains parts, for obtaining the sentence structure pattern of the sentence with predetermined Special Significance;
Cue obtains parts, and for obtaining cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Sentence structure pattern-cue combiner, for combining sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure;
Sentence mark determining means, the sentence structure pattern-cue comprising for the sentence based on described document, determines the mark of sentence, thereby has both considered that sentence structure pattern also considered cue simultaneously during the sentence acquiring a special sense in extraction; And
Sentence extraction parts for the mark based on sentence, extract the sentence with predetermined Special Significance from described document.
10. from literary composition gear, extract a method for the sentence with predetermined Special Significance, comprise the steps:
Acquisition has the sentence structure pattern of the sentence of predetermined Special Significance;
Obtain cue, the sentence that wherein contains this cue may be more the sentence with predetermined Special Significance than the sentence that does not contain this cue;
Combination sentence structure pattern and cue, to obtain the sentence structure pattern-cue after the combination that meets sentence syntactic structure; And
Sentence structure pattern-cue that sentence based in described document comprises extracts and has the sentence of predetermined Special Significance from described document, thereby has both considered that sentence structure pattern also considered cue simultaneously during the sentence acquiring a special sense in extraction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010268675.6A CN102385574B (en) | 2010-09-01 | 2010-09-01 | Method and device for extracting sentences from document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010268675.6A CN102385574B (en) | 2010-09-01 | 2010-09-01 | Method and device for extracting sentences from document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102385574A CN102385574A (en) | 2012-03-21 |
CN102385574B true CN102385574B (en) | 2014-08-20 |
Family
ID=45824995
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010268675.6A Expired - Fee Related CN102385574B (en) | 2010-09-01 | 2010-09-01 | Method and device for extracting sentences from document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102385574B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959312B (en) | 2017-05-23 | 2021-01-29 | 华为技术有限公司 | Method, device and terminal for generating multi-document abstract |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5924108A (en) * | 1996-03-29 | 1999-07-13 | Microsoft Corporation | Document summarizer for word processors |
CN1470047A (en) * | 2000-11-20 | 2004-01-21 | ���չ�˾ | Method of vector analysis for a document |
US7051024B2 (en) * | 1999-04-08 | 2006-05-23 | Microsoft Corporation | Document summarizer for word processors |
CN101382962A (en) * | 2008-10-29 | 2009-03-11 | 西北工业大学 | Superficial layer analyzing and auto document summary method based on abstraction degree of concept |
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
-
2010
- 2010-09-01 CN CN201010268675.6A patent/CN102385574B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5924108A (en) * | 1996-03-29 | 1999-07-13 | Microsoft Corporation | Document summarizer for word processors |
US7051024B2 (en) * | 1999-04-08 | 2006-05-23 | Microsoft Corporation | Document summarizer for word processors |
CN1470047A (en) * | 2000-11-20 | 2004-01-21 | ���չ�˾ | Method of vector analysis for a document |
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
CN101382962A (en) * | 2008-10-29 | 2009-03-11 | 西北工业大学 | Superficial layer analyzing and auto document summary method based on abstraction degree of concept |
Also Published As
Publication number | Publication date |
---|---|
CN102385574A (en) | 2012-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tabassum et al. | A survey on text pre-processing & feature extraction techniques in natural language processing | |
CN107180023B (en) | Text classification method and system | |
US20190347571A1 (en) | Classifier training | |
Alhumoud et al. | Arabic sentiment analysis using recurrent neural networks: a review | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
KR101136007B1 (en) | System and method for anaylyzing document sentiment | |
CN109885686A (en) | A kind of multilingual file classification method merging subject information and BiLSTM-CNN | |
Kmail et al. | An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures | |
Franco-Salvador et al. | Language variety identification using distributed representations of words and documents | |
CN114528919A (en) | Natural language processing method and device and computer equipment | |
US20150212976A1 (en) | System and method for rule based classification of a text fragment | |
Fkih et al. | Hidden data states-based complex terminology extraction from textual web data model | |
Venčkauskas et al. | Problems of authorship identification of the national language electronic discourse | |
Sun et al. | Twitter part-of-speech tagging using pre-classification Hidden Markov model | |
CN114265943A (en) | Causal relationship event pair extraction method and system | |
CN112632272B (en) | Microblog emotion classification method and system based on syntactic analysis | |
US11599580B2 (en) | Method and system to extract domain concepts to create domain dictionaries and ontologies | |
Kapočiūtė-Dzikienė et al. | Improving topic classification for highly inflective languages | |
CN102385574B (en) | Method and device for extracting sentences from document | |
Shrawankar et al. | Construction of news headline from detailed news article | |
Yeom et al. | study of machine-learning classifier and feature set selection for intent classification of Korean tweets about food safety | |
El Idrissi Esserhrouchni et al. | Learning domain taxonomies: The TaxoLine approach | |
CN113158693A (en) | Uygur language keyword generation method and device based on Chinese keywords, electronic equipment and storage medium | |
Wang et al. | Natural language processing systems and Big Data analytics | |
CN112182228A (en) | Method and device for mining and summarizing short text hot topic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140820 Termination date: 20200901 |