CN108268431A

CN108268431A - The method and apparatus of paragraph vectorization

Info

Publication number: CN108268431A
Application number: CN201611260591.1A
Authority: CN
Inventors: 石鹏; 姜珂
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2018-07-10
Anticipated expiration: 2036-12-30
Also published as: CN108268431B; WO2018121145A1

Abstract

The invention discloses a kind of method and apparatus of paragraph vectorization.Wherein, this method includes：Structure includes the characteristic set of multiple Feature Words；The word in processing paragraph is treated based on default knowledge base to be substituted, paragraph after being converted；Feature of the word as paragraph after conversion of characteristic set will be belonged in paragraph after conversion, row vector is dropped into conversion back segment.The technical issues of vector that the present invention solves the sentence that the methods of being calculated distance using word-based, sentence context when carrying out vectorization to paragraph in the prior art, then passed through cluster again is calculated can not reflect the content structure feature of normative text.

Description

The method and apparatus of paragraph vectorization

Technical field

The present invention relates to natural language processing field, in particular to a kind of method and apparatus of paragraph vectorization.

Background technology

The vectorization of natural language is NLP (Natural Language Processing, natural language processing) technology one Item difficult task, is the basis using various natural language models, and the quality of vectorization directly affects last accuracy rate. Although much companies are all in the technology for using various vectorizations, and also have the tool of certain vectorization in Open Source Platform, Including word2vector, sentence2vector etc., but for different document features and different demands, it is difficult to use system One abstract method takes out the characteristic point really needed in demand.Such as when parsing law letter normalization text, need Small paragraph in text is carried out according to content being divided into big paragraph, it is therefore desirable to row vector is dropped into each segment, with Take out the information point of segmentation, however some existing vectorization technologies are essentially all word-based, sentence context to count The methods of calculating distance, then passing through cluster calculates the vector of sentence, due to the text of the Relative Generalizeds range such as legal documents, has more Carefully and neatly done structure and the language of specification, it is therefore desirable to the content structure described in text and crucial description content into row vector Change is classified, the vector analyzed using existing vectorization technology, can not reflect the content structure feature of text, and law Document is rigorous class document, has many contexts similar but practical significance lacks the keyword to differ greatly, existing vectorization skill Art can not distinguish these keywords.

In for the above-mentioned prior art to paragraph carry out vectorization when using word-based, sentence context calculate away from From the vector for the sentence that the methods of then passing through cluster again is calculated can not reflect the content structure feature of normative text Problem, currently no effective solution has been proposed.

Invention content

An embodiment of the present invention provides a kind of method and apparatus of paragraph vectorization, at least to solve in the prior art right Paragraph be calculated the methods of being calculated distance using word-based, sentence context during vectorization, then passed through cluster again Sentence vector the technical issues of can not reflecting the content structure feature of normative text.

One side according to embodiments of the present invention provides a kind of method of paragraph vectorization, including：Structure includes more The characteristic set of a Feature Words；The word in processing paragraph is treated based on default knowledge base to be substituted, paragraph after being converted；It will Belong to feature of the word of characteristic set as paragraph after conversion after conversion in paragraph, row vector is dropped into conversion back segment.

Another aspect according to embodiments of the present invention additionally provides a kind of device of paragraph vectorization, including：Build mould Block, for building the characteristic set for including multiple Feature Words；Modular converter is treated for being based on default knowledge base in processing paragraph Word substituted, paragraph after being converted；Vectorization module, for belonging to the word conduct of characteristic set in paragraph after converting The feature of paragraph after conversion drops into row vector to conversion back segment.

In embodiments of the present invention, the characteristic set of multiple Feature Words is included by building in advance, is then based on presetting and know Know the word treated in processing paragraph in library to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization, It is the spy for belonging to prebuild after selection is converted in paragraph to be treated in the embodiment of the present invention when processing section drops into row feature selecting Word in collection conjunction, therefore it is characterized in best embodying the feature of paragraph structure feature after feature selecting, it is achieved thereby that most Obtain eventually vector can reflect paragraph structures technique effect, and then solve in the prior art to paragraph into The sentence that the methods of calculating distance using word-based, sentence context during row vector, then passing through cluster again is calculated Vector the technical issues of can not reflecting the content structure feature of normative text.

Description of the drawings

Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is a kind of method flow diagram of according to embodiments of the present invention 1 paragraph vectorization；

Fig. 2 is according to embodiments of the present invention 1 legal norm text schematic diagram；

Fig. 3 is a kind of structure drawing of device of according to embodiments of the present invention 2 paragraph vectorization；

Fig. 4 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization；

Fig. 5 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization；And

Fig. 6 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization.

Specific embodiment

In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects It encloses.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.

Embodiment 1

According to embodiments of the present invention, a kind of embodiment of the method for the method of paragraph vectorization is provided, it should be noted that Step shown in the flowchart of the accompanying drawings can perform in the computer system of such as a group of computer-executable instructions, and And although showing logical order in flow charts, in some cases, can institute be performed with the sequence being different from herein The step of showing or describing.

Fig. 1 is the method for paragraph vectorization according to embodiments of the present invention, as shown in Figure 1, this method comprises the following steps：

Step S102, structure include the characteristic set of multiple Feature Words.

Specifically, Feature Words namely can be to characterize the word of text to a certain degree, the present invention is primarily directed to paragraph Vectorization, therefore the Feature Words in the present invention in characteristic set mainly can be to characterize the word of paragraph to a certain degree, the present invention In the number of Feature Words that characteristic set includes is not limited, and characteristic set can pass through directly select multiple vocabulary It is built as the mode of Feature Words, sample can also be marked, and to the spy of paragraph mark sample by collecting a large amount of paragraph The mode concluded is levied to build.

Step S104 treats the word in processing paragraph based on default knowledge base and is substituted, paragraph after being converted.

Specifically, when needing to carry out vectorization to a paragraph i.e. pending paragraph, need first to pending Each word in paragraph carries out once abstract transformation, that is, whole words in pending paragraph or part word are transformed to Other representations, such as be converted to dictionary word or only represented or the classification of word substitutes with part of speech, conversion is base It is carried out in default knowledge base, optionally, a large amount of conversion vocabulary etc. in default knowledge base can be provided, treating processing paragraph In the conversion of each word after, that is, whole section falls and is converted, paragraph after being converted.

Step S106 will belong to feature of the word as paragraph after conversion of characteristic set, after conversion in paragraph after conversion Paragraph carries out vectorization.

Specifically, many features can be included after conversion in paragraph, when dropping into row feature extraction to conversion back segment, if adopted With the prior art, many meaningless words can be extracted or the word of paragraph feature cannot be characterized, the present invention is to paragraph after conversion When carrying out feature extraction, only belong to the word of Feature Words in characteristic set after extraction conversion in paragraph, that is, only extraction being capable of table The word of paragraph feature is levied, as the feature of paragraph after conversion, row vector can be dropped into conversion back segment later.

Optionally, it is determined that after conversion after the feature of paragraph, when dropping into row vector to conversion back segment, may be used Vectorization mode of the prior art, by the word frequency of statistical nature word come realize conversion after paragraph vectorization, wherein, word frequency The methods that statistical method also has many maturations, for example, hash, transformation trie trees the methods of.

In embodiments of the present invention, the characteristic set of multiple Feature Words is included by building in advance, is then based on presetting and know Know the word treated in processing paragraph in library to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization, It is the spy for belonging to prebuild after selection is converted in paragraph to be treated in the embodiment of the present invention when processing section drops into row feature selecting Word in collection conjunction, therefore it is characterized in best embodying the feature of paragraph structure feature after feature selecting, it is achieved thereby that most The vector obtained eventually can reflect the technique effect of the structures of paragraph, and can gather during feature selecting The regular feature that human eye can not be discovered, complexity is low, and then solves and adopted in the prior art when carrying out vectorization to paragraph The vector of sentence that the methods of being calculated distance with word-based, sentence context, then passed through cluster again is calculated can not be anti- The technical issues of reflecting the content structure feature of normative text.

In a kind of optional embodiment, step S102, including：

Step S202, the word in paragraph set based on default knowledge base to including multiple paragraphs are substituted, are turned Change rear paragraph set.

Step S204 determines the feature of paragraph set after conversion.

Step S206 chooses the feature of preset quantity, constitutive characteristic set after conversion in the feature of paragraph set.

Specifically, if characteristic set is to mark sample, and to the spy of paragraph mark sample by collecting a large amount of paragraph The mode concluded is levied to build, it is necessary first to collect a large amount of paragraph mark sample, that is, structure includes multiple paragraphs Paragraph set, then whole words in each paragraph in paragraph set or part base are replaced in default knowledge base Generation, can obtain be abstracted transformed conversion after paragraph set, later it needs to be determined that conversion after paragraph set feature, due to The word of paragraph feature is understood there are many meaningless word and cannot characterized in characteristic statistics, and can characterize the word of paragraph feature Possible quantity is also more, can only filter out more crucial word, it is therefore desirable to be selected from a large amount of feature of statistics default The feature of quantity, finally constitutes characteristic set.

In a kind of optional embodiment, preset quantity is chosen in step S206 in the feature of paragraph set after conversion Feature, including：

Step S302 calculates the information gain of each feature in the feature of paragraph set after converting.

Step S304 chooses in the feature of paragraph set after conversion according to the sequence of information gain from big to small and presets The feature of quantity.

Specifically, the mode of the gain for the comentropy for calculating feature may be used to choose the feature of preset quantity, true After fixed conversion after the feature of paragraph set, the gain of the comentropy of each feature can be calculated, and choose the increasing of comentropy The feature of the larger preset quantity of benefit, herein it should be noted that the size of preset quantity self-defined can be set, in the present invention The concrete numerical value of preset quantity is not specifically limited.

Specifically, entropy (entropy) is a widely used module in information theory, arbitrary sample set can be portrayed Purity, be the opposite sample set of attribute A for objective attribute target attribute A, S with values different c, then S is relative to c shape The formula of the classification entropy of state is as follows：

Wherein, in above formula, Pi is the ratio that S belongs to classification i, it should be noted that the truth of a matter remains as 2, if objective attribute target attribute With c probable value, then entropy maximum possible is log₂(c)。

In the case where having had entropy as the standard for weighing training examples set purity, it can be classified with defined attribute and instructed Practice the module of the effect of data, this standard is thus referred to as " information gain (information gain) ", in simple terms, The information gain of one attribute precisely due to dividing sample using this attribute caused by it is expected that entropy reduces, more accurately say, Information gain Gains (S, A) of one attribute A with respect to sample set S is defined as：

Wherein, V (A) namely Values (A) is the set of all probable values of attribute A, and the value that Sv is attribute A in S is v Subset, first item is exactly the entropy of former set S in above formula, and Section 2 is the desired value of the entropy after S of being classified with A, what Section 2 described It is exactly the weighted sum of the entropy of each subset it is expected entropy, and weights are that the sample of attribute Sv accounts for the ratio of original sample S | Sv |/| S |, institute With Gain (S, A) be caused by knowing the value of attribute A it is expected entropy reduce, in other words for, Gain (S, A) be due to Determine the information about target function value obtained from the value of attribute A, when the desired value coding of an arbitrary member to S, The value of Gain (S, A) is the number of bits that can be saved after the value for knowing attribute.

In a kind of optional embodiment, based on default knowledge base to including the paragraph set of multiple paragraphs in step S202 In word carry out instead before, including：Step S402 is carried out the word of identical meanings in paragraph set using single order dependency grammar Cluster.

Specifically, root identical in paragraph is different according to the linguistic context of appearance, different meanings might have, and different words There may be identical meaning, therefore to being carried out based on the word in paragraph set of the default knowledge base to including multiple paragraphs Before replacement, the word of identical meanings can be gathered using dependency grammar, can specifically use single order dependency grammar phase Word with meaning is gathered, and by using dependency grammar, the present invention can solve to be directed on many in existing vectorization technology Hereafter similar but practical significance lacks the keyword to differ greatly leads to the ropy technical problem of vectorization without distinguishing.

In a kind of optional embodiment, knowledge base includes word domain knowledge base, the specialized vocabulary dictionary knowledge base of word woods With name Entity recognition knowledge base.

Specifically, a large amount of root is namely carried out the knowledge of classification composition by the word domain knowledge base of word woods according to meaning Library optionally, can give the word distribution classification number per a kind of identical meanings, following table may be the word domain knowledge base of word woods Middle partial content：

Specifically, specialized vocabulary dictionary knowledge base can be the knowledge that the vocabulary of a certain professional domain and dictionary are formed Library, such as legal profession, specialized vocabulary dictionary knowledge base may include case by, measuresof property preservation, party role, method Restrain abbreviation, law court, nationality, administrative omission behavior, administrative law interior classification, administrative behavior, administration behavior behavior, marriage shape Condition, technology case _ case are by vocabulary or dictionaries such as, technology case _ keyword, role, amount of money detail, cause of civil lawsuit, nationalitys.

Specifically, name Entity recognition knowledge base can be named the knowledge base of Entity recognition, wherein naming Entity be exactly name, place name, mechanism name and other have with the entity of entitled mark, can be with based on name Entity recognition knowledge base Identify name (NH), place name (NS), organization name (NI) etc..

In a kind of optional embodiment, after being converted based on above three knowledge base to a paragraph, after conversion Paragraph is：@nh with=>%Dk17B23@nh=># cause of civil lawsuit@nh=># cases by # cause of civil lawsuit=>%Dk17B23# cases by =>The behavior of %Dk17B23# administration behaviors=>%Dk17B23%Dk17B23=>@nh, optionally, in the embodiment of the present invention Can preset the rule of conversion when abstract conversion is carried out to paragraph, for example, can specify that use "=>" represent to modify Relationship, such as A and B represent two replaced words, " A=respectively>B " means that A words are modified B times, uses specialized vocabulary dictionary The dictionary word of knowledge base is started after replacing using No. #, and@is used after being named Entity recognition using name Entity recognition knowledge base Beginning, using % after being replaced using the classification number of the word domain knowledge base of word woods, and can be using space come by each feature It is detached, based on the rule of above-mentioned preset conversion, then for a name, either " Zhang San ", " Li Si " still " king Five ", it can be represented using "@nh " in above-mentioned transformed paragraph, therefore transformed paragraph complexity is relatively low, and When carrying out characteristic statistics, reduce the quantity of feature, therefore statistics cost can be saved, improve statistical efficiency, and after conversion Paragraph feature it is more obvious, the content structure feature that can more show Paragraph.

In a kind of optional embodiment, if carrying out signature conclusion to 10000 paragraphs, 150000 are counted Different features due to there is the word of many individual characteies in 150000 different features, that is, meaningless word or is unable to table Sign or the word that paragraph structure feature cannot be characterized very well, it is therefore desirable to preset quantity is selected from 15000 features Feature, wherein, preset quantity could be provided as 3000, and 3000 features of maximum can be selected with the gain of use information entropy, In can be as shown in the table for the result of calculation of the gain of the comentropy of the Partial Feature after conversion in paragraph set：

In a kind of optional embodiment, the present invention is suitable for carrying out a large amount of small paragraphs of normative text according to content Division forms big paragraph, as shown in Fig. 2, Fig. 2 is a legal norm text, wherein paragragh " 4：Plaintiff：Tang * * ", " 5：Committee Hold in the palm agent：Contain *, Lee * * " and " 6：Defendant：King * " can be actually incorporated into as participant in proceedings's paragraph, by using this hair Above-mentioned 3 paragraghs can be carried out vectorization, above-mentioned 3 are obtained according to the result of vectorization by the method for bright paragraph vectorization The information point of paragragh, so as to incorporate into above-mentioned 3 paragraghs for a big paragraph.

Embodiment 2

According to embodiments of the present invention, a kind of product embodiments of the device of paragraph vectorization are provided, Fig. 3 is according to this hair The device of the paragraph vectorization of bright embodiment, as shown in figure 3, the device includes structure module 101, modular converter 103 and vector Change module 105.

Wherein, module 101 is built, for building the characteristic set for including multiple Feature Words；Modular converter 103, for base The word in processing paragraph is treated in default knowledge base to be substituted, paragraph after being converted；Vectorization module 105, for that will turn Feature of the word for belonging to characteristic set in rear paragraph as paragraph after conversion is changed, row vector is dropped into conversion back segment.

In embodiments of the present invention, the characteristic set of multiple Feature Words is included by building the advance structure of module 101, then By modular converter 103 based on default knowledge base treat processing paragraph in word substituted, paragraph after being converted, finally to Quantization modules 105 will belong to characteristic set feature of the word as paragraph after conversion in paragraph after conversion, drops into conversion back segment Row vector has achieved the purpose that carry out paragraph vectorization, has treated processing section in the embodiment of the present invention and drop into row feature selecting When be word in the characteristic set for belong in paragraph after selection conversion prebuild, therefore be characterized in most after feature selecting The feature of paragraph structure feature is embodied, it is achieved thereby that the vector finally obtained can reflect the skill of the structures of paragraph Art effect, and can gather the regular feature that human eye can not be discovered during feature selecting, complexity is low, and then solves It has determined and distance is calculated using word-based, sentence context when carrying out vectorization to paragraph in the prior art, then passed through again The technical issues of vector for the sentence that the methods of cluster is calculated can not reflect the content structure feature of normative text.

Herein it should be noted that above-mentioned structure module 101, modular converter 103 and vectorization module 105 correspond to implementation Step S102 to step S106 in example 1, above-mentioned module is identical with example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be such as one as a part of of device It is performed in the computer system of group computer executable instructions.

In a kind of optional embodiment, as shown in figure 4, structure module 101 includes：Alternative module 201, determining module 203 and choose module 205.Wherein, alternative module 201, for based on default knowledge base to including the paragraph set of multiple paragraphs In word substituted, paragraph set after being converted；Determining module 203, for determining the feature of paragraph set after converting；Choosing Modulus block 205, for choosing the feature of preset quantity, constitutive characteristic set in the feature of paragraph set after conversion.

Herein it should be noted that above-mentioned alternative module 201, determining module 203 and selection module 205 correspond to embodiment Step S202 to step S206 in 1, the example and application scenarios that above-mentioned module and corresponding step are realized are identical but unlimited In 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be at such as one group as a part of of device It is performed in the computer system of computer executable instructions.

In a kind of optional embodiment, as shown in figure 5, choosing module 205, including computing module 301 and submodule is chosen Block 303, wherein, computing module 301, for the information gain of each feature in the feature of paragraph set after calculating conversion； Submodule 303 is chosen, it is pre- for being chosen in the feature of paragraph set after conversion according to the sequence of information gain from big to small If the feature of quantity.

Herein it should be noted that above-mentioned computing module 301 and selection submodule 303 correspond to the step in embodiment 1 S302 to step S304, above-mentioned module is identical with example and application scenarios that corresponding step is realized, but is not limited to above-mentioned reality Apply 1 disclosure of that of example.It should be noted that above-mentioned module can be such as one group of computer can as a part of of device It is performed in the computer system of execute instruction.

In a kind of optional embodiment, as shown in fig. 6, structure module 101 further include cluster module 401, for for It is interdependent using single order before being substituted for module 201 based on the word in paragraph set of the default knowledge base to including multiple paragraphs Grammer clusters the word of identical meanings in paragraph set.

Herein it should be noted that above-mentioned cluster module 401 correspond to embodiment 1 in step S402, above-mentioned module with The example that corresponding step is realized is identical with application scenarios, but is not limited to the above embodiments 1 disclosure of that.It needs to illustrate , above-mentioned module can hold as a part of of device in the computer system of such as a group of computer-executable instructions Row.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.

The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or Part steps.And aforementioned storage medium includes：USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code Medium.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. method of paragraph vectorization, which is characterized in that including：

Structure includes the characteristic set of multiple Feature Words；

The word in processing paragraph is treated based on default knowledge base to be substituted, paragraph after being converted；

Feature of the word as paragraph after the conversion of the characteristic set will be belonged in paragraph after the conversion, to the conversion Back segment drops into row vector.
2. according to the method described in claim 1, it is characterized in that, structure includes the characteristic sets of multiple Feature Words, including：

The word in paragraph set based on the default knowledge base to including multiple paragraphs substitutes, paragraph collection after being converted It closes；

Determine the feature of paragraph set after the conversion；

The feature of preset quantity is chosen from the feature of paragraph set after the conversion, forms the characteristic set.
3. it according to the method described in claim 2, it is characterized in that, is chosen from the feature of paragraph set after the conversion default The feature of quantity, including：

Calculate the information gain of each feature in the feature of paragraph set after the conversion；

The spy of preset quantity is chosen from the feature of paragraph set after the conversion according to the sequence of information gain from big to small Sign.
4. according to the method described in claim 2, it is characterized in that, based on the default knowledge base to including the sections of multiple paragraphs Before falling the word progress instead in set, including：

The word of identical meanings in the paragraph set is clustered using single order dependency grammar.
5. according to claim 1-4 any one of them methods, which is characterized in that the word classification that the knowledge base includes word woods is known Know library, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.
6. a kind of device of paragraph vectorization, which is characterized in that including：

Module is built, for building the characteristic set for including multiple Feature Words；

Modular converter is substituted, paragraph after being converted for treating the word in processing paragraph based on default knowledge base；

Vectorization module, for the word using the characteristic set is belonged in paragraph after the conversion as paragraph after the conversion Feature drops into row vector to the conversion back segment.
7. device according to claim 6, which is characterized in that the structure module includes：

Alternative module for being substituted based on the word in paragraph set of the default knowledge base to including multiple paragraphs, is obtained Paragraph set after to conversion；

Determining module, for determining the feature of paragraph set after the conversion；

Module is chosen, for choosing the feature of preset quantity from the feature of paragraph set after the conversion, forms the feature Set.
8. device according to claim 7, which is characterized in that the selection module, including：

Computing module, for calculating the information gain of each feature in the feature of paragraph set after the conversion；

Choose submodule, for according to information gain sequence from big to small from being selected in the feature of paragraph set after the conversion Take the feature of preset quantity.
9. device according to claim 7, which is characterized in that the structure module further includes：

Cluster module, in the alternative module based on the default knowledge base to including the paragraph set of multiple paragraphs In word substituted before, the word of identical meanings in the paragraph set is clustered using single order dependency grammar.
10. according to claim 6-9 any one of them devices, which is characterized in that the word that the knowledge base includes word woods is classified Knowledge base, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.