CN108268431B

CN108268431B - The method and apparatus of paragraph vectorization

Info

Publication number: CN108268431B
Application number: CN201611260591.1A
Authority: CN
Inventors: 石鹏; 姜珂
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2019-12-03
Anticipated expiration: 2036-12-30
Also published as: WO2018121145A1; CN108268431A

Abstract

The invention discloses a kind of method and apparatus of paragraph vectorization.Wherein, this method comprises: building includes the characteristic set of multiple Feature Words；It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted；Using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after conversion, row vector is dropped into conversion back segment.The present invention solves the technical issues of calculating distance using word-based, sentence context when carrying out vectorization to paragraph in the prior art, the vector for the sentence being then calculated again by the methods of cluster can not reflect the content structure feature of normative text.

Description

The method and apparatus of paragraph vectorization

Technical field

The present invention relates to natural language processing fields, in particular to a kind of method and apparatus of paragraph vectorization.

Background technique

The vectorization of natural language is NLP (Natural Language Processing, natural language processing) technology one Item difficult task, is using the basis of various natural language models, the quality of vectorization directly affects last accuracy rate. Although much companies are all in the technology using various vectorizations, and also have the tool of certain vectorization in Open Source Platform, Including word2vector, sentence2vector etc., but it is directed to different document features and different demands, is difficult to use system One abstract method takes out the characteristic point really needed in demand.Such as when parsing law letter normalization text, need Small paragraph in text is carried out being divided into big paragraph according to content, it is therefore desirable to row vector is dropped into each segment, with The information point of segmentation is taken out, however existing some vectorization technologies are essentially all word-based, sentence context to count Distance is calculated, then is had more by the vector that the methods of cluster calculates sentence due to the text of the Relative Generalizeds range such as legal documents The language of carefully and neatly done structure and specification, it is therefore desirable to which vector is carried out to content structure described in text and crucial description content Change classification, the vector analyzed using existing vectorization technology can not reflect the content structure feature of text, and law Document is rigorous class document, has that many contexts are similar but practical significance lacks the keyword to differ greatly, existing vectorization skill Art can not distinguish these keywords.

For it is above-mentioned in the prior art to paragraph carry out vectorization when using word-based, sentence context calculate away from From the vector for the sentence being then calculated again by the methods of cluster can not reflect the content structure feature of normative text Problem, currently no effective solution has been proposed.

Summary of the invention

The embodiment of the invention provides a kind of method and apparatus of paragraph vectorization, at least to solve in the prior art right Distance is calculated using word-based, sentence context when paragraph carries out vectorization, is then calculated again by the methods of cluster The vector of sentence the technical issues of can not reflecting the content structure feature of normative text.

According to an aspect of an embodiment of the present invention, a kind of method of paragraph vectorization is provided, comprising: building includes more The characteristic set of a Feature Words；It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted；It will Belong to feature of the word of characteristic set as paragraph after conversion after conversion in paragraph, row vector is dropped into conversion back segment.

According to another aspect of an embodiment of the present invention, a kind of device of paragraph vectorization is additionally provided, comprising: building mould Block, for constructing the characteristic set including multiple Feature Words；Conversion module, for being treated in processing paragraph based on default knowledge base Word substituted, paragraph after being converted；Vectorization module, for belonging to the word conduct of characteristic set in paragraph after converting The feature of paragraph after conversion drops into row vector to conversion back segment.

In embodiments of the present invention, it by constructing the characteristic set including multiple Feature Words in advance, is then based on to preset and know Know the word that library is treated in processing paragraph to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization, Being treated when processing section drops into row feature selecting in the embodiment of the present invention is the spy for belonging to prebuild after selection is converted in paragraph Word in collection conjunction, therefore be characterized in best embodying the feature of paragraph structure feature after feature selecting, to realize most The vector obtained eventually is able to reflect the technical effect of the structures of paragraph, so solve in the prior art to paragraph into Distance is calculated using word-based, sentence context when row vector, the sentence being then calculated again by the methods of cluster Vector the technical issues of can not reflecting the content structure feature of normative text.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of method flow diagram of according to embodiments of the present invention 1 paragraph vectorization；

Fig. 2 is according to embodiments of the present invention 1 legal norm text schematic diagram；

Fig. 3 is a kind of structure drawing of device of according to embodiments of the present invention 2 paragraph vectorization；

Fig. 4 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization；

Fig. 5 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization；And

Fig. 6 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

Embodiment 1

According to embodiments of the present invention, the embodiment of the method for a kind of method of paragraph vectorization is provided, it should be noted that Step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, and It, in some cases, can be to be different from sequence execution institute herein and although logical order is shown in flow charts The step of showing or describing.

Fig. 1 is the method for paragraph vectorization according to an embodiment of the present invention, as shown in Figure 1, this method comprises the following steps:

Step S102, building include the characteristic set of multiple Feature Words.

Specifically, Feature Words namely can be to characterize the word of text to a certain degree, the present invention is primarily directed to paragraph Vectorization, thus the Feature Words in the present invention in characteristic set be mainly can be to characterize the word of paragraph to a certain degree, the present invention In to the numbers of the Feature Words for including in characteristic set without limitation, and characteristic set can be by directly selecting multiple vocabulary It is constructed as the mode of Feature Words, sample can also be marked by collecting a large amount of paragraph, and to the spy of paragraph mark sample The mode concluded is levied to construct.

Step S104 treats the word in processing paragraph based on default knowledge base and is substituted, paragraph after being converted.

Specifically, needing first when needing to carry out vectorization to a paragraph i.e. paragraph to be processed to be processed Each word in paragraph carries out once abstract transformation, that is, by paragraph to be processed whole words or part word be transformed to Other representations, such as be converted to dictionary word or only indicated with part of speech or the classification of word substitutes, conversion is base It is carried out in default knowledge base, optionally, a large amount of conversion vocabulary etc. can be provided in default knowledge base, treating processing paragraph In the conversion of each word after, that is, whole section falls and is converted, paragraph after available conversion.

Step S106, using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after converting, after conversion Paragraph carries out vectorization.

Specifically, will include many features in paragraph after conversion, when dropping into row feature extraction to conversion back segment, if adopted With the prior art, many meaningless words can be extracted or the word of paragraph feature cannot be characterized, the present invention is to paragraph after conversion When carrying out feature extraction, the word for belonging to Feature Words in characteristic set after converting in paragraph is only extracted, that is, only extracting being capable of table The word for levying paragraph feature can drop into row vector to conversion back segment later as the feature of paragraph after conversion.

Optionally, it is determined that after conversion after the feature of paragraph, when dropping into row vector to conversion back segment, can use Vectorization mode in the prior art realizes the vectorization of paragraph after conversion by the word frequency of statistical nature word, wherein word frequency Statistical method also has the methods of the method, such as hash, the trie tree of transformation of many maturations.

In embodiments of the present invention, it by constructing the characteristic set including multiple Feature Words in advance, is then based on to preset and know Know the word that library is treated in processing paragraph to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization, Being treated when processing section drops into row feature selecting in the embodiment of the present invention is the spy for belonging to prebuild after selection is converted in paragraph Word in collection conjunction, therefore be characterized in best embodying the feature of paragraph structure feature after feature selecting, to realize most The vector obtained eventually is able to reflect the technical effect of the structures of paragraph, and can gather out during feature selecting The regular feature that human eye can not be discovered, complexity is low, and then solves and adopt in the prior art when carrying out vectorization to paragraph Distance is calculated with word-based, sentence context, the vector for the sentence being then calculated again by the methods of cluster can not be anti- The technical issues of reflecting the content structure feature of normative text.

In an alternative embodiment, step S102, comprising:

Step S202 substitutes the word in the paragraph set including multiple paragraphs based on default knowledge base, is turned Change rear paragraph set.

Step S204 determines the feature of paragraph set after conversion.

Step S206, from the feature for choosing preset quantity after conversion in the feature of paragraph set, constitutive characteristic set.

Specifically, if characteristic set is to mark sample by collecting a large amount of paragraph, and mark the spy of sample to paragraph The mode concluded is levied to construct, it is necessary first to collect a large amount of paragraph mark sample, that is, building includes multiple paragraphs Paragraph set, then in each paragraph in paragraph set whole words or part base replace in default knowledge base Generation, paragraph set after the conversion after available abstract conversion, later it needs to be determined that conversion after paragraph set feature, due to The word of paragraph feature is understood there are many meaningless word and cannot characterized in characteristic statistics, and can characterize the word of paragraph feature Possible quantity is also more, can only filter out more crucial word, it is therefore desirable to select from a large amount of feature of statistics default The feature of quantity, finally constitutes characteristic set.

In an alternative embodiment, in step S206 from preset quantity is chosen after conversion in the feature of paragraph set Feature, comprising:

Step S302 calculates the information gain of each feature in the feature of paragraph set after converting.

Step S304, it is default from being chosen in the feature of paragraph set after conversion according to the sequence of information gain from big to small The feature of quantity.

Specifically, the feature of preset quantity can be chosen by the way of calculating the gain of comentropy of feature, true After fixed conversion after the feature of paragraph set, the gain of the comentropy of each feature can be calculated, and choose the increasing of comentropy The feature of the biggish preset quantity of benefit, herein it should be noted that the size of preset quantity customized can be arranged, in the present invention The specific value of preset quantity is not specifically limited.

Specifically, entropy (entropy) is a widely used module in information theory, any sample set can be portrayed Purity, be the opposite sample set of attribute A for objective attribute target attribute A, S with c different values, then S is relative to c shape The formula of the classification entropy of state is as follows:

Wherein, in above formula, Pi is the ratio that S belongs to classification i, it should be noted that the truth of a matter remains as 2, if objective attribute target attribute With c probable value, then entropy maximum possible is log₂(c)。

In the case where having there is entropy as the standard for measuring training examples set purity, it can be classified with defined attribute and be instructed Practice the module of the effect of data, this standard is thus referred to as " information gain (information gain) ", in simple terms, The information gain of one attribute more accurately says precisely due to expectation entropy reduces caused by dividing sample using this attribute, Information gain Gain (S, A) of one attribute A with respect to sample set S is defined as:

Wherein, V (A) namely Values (A) is the set of all probable values of attribute A, and Sv is that the value of attribute A in S is v Subset, first item is exactly the entropy of former set S in above formula, and Section 2 is the desired value of the entropy after classifying S with A, what Section 2 described It is expected that entropy is exactly the weighted sum of the entropy of each subset, weight is that the sample of attribute Sv accounts for the ratio of original sample S | Sv |/| S |, institute With Gain (S, A) be due to caused by knowing the value of attribute A it is expected entropy reduce, in other words for, Gain (S, A) be due to Determine the information obtained from the value of attribute A about target function value, when the target value of any member to S coding, The value of Gain (S, A) is the number of bits that can be saved after the value for knowing attribute.

In an alternative embodiment, in step S202 based on default knowledge base to the paragraph set including multiple paragraphs In word carry out instead before, comprising: step S402, using single order dependency grammar by paragraph set identical meanings word carry out Cluster.

Specifically, identical root is different according to the context of appearance in paragraph, different meanings might have, and different words May have the same meaning, thus to based on default knowledge base to include multiple paragraphs paragraph set in word carry out Before substitution, dependency grammar can be used, the word of identical meanings is gathered, single order dependency grammar specifically can be used phase Word with meaning is gathered, and by using dependency grammar, the present invention can solve in existing vectorization technology and be directed on very much Hereafter similar but practical significance, which lacks the keyword to differ greatly, leads to the ropy technical problem of vectorization without distinguishing.

In an alternative embodiment, knowledge base includes the word domain knowledge base of word woods, specialized vocabulary dictionary knowledge base With name Entity recognition knowledge base.

Specifically, a large amount of root is namely carried out the knowledge of classification composition by the word domain knowledge base of word woods according to meaning Library optionally can distribute classification number to the word of every a kind of identical meanings, following table may be the word domain knowledge base of word woods Middle partial content:

Specifically, the knowledge that specialized vocabulary dictionary knowledge base can be the vocabulary of a certain professional domain and dictionary is constituted Library, such as legal profession, specialized vocabulary dictionary knowledge base may include case by, measuresof property preservation, party role, method Restrain abbreviation, law court, nationality, administrative omission behavior, classification, administrative behavior, administration behavior behavior, marriage shape in administrative law Condition, technology case _ case are by vocabulary or dictionaries such as, technology case _ keyword, role, amount of money detail, cause of civil lawsuit, nationalitys.

Specifically, name Entity recognition knowledge base can be the knowledge base that can be named Entity recognition, wherein naming Entity be exactly name, place name, mechanism name and other have with the entity of entitled mark, can be with based on name Entity recognition knowledge base Identify name (NH), place name (NS), organization name (NI) etc..

In an alternative embodiment, after being converted based on above three knowledge base to a paragraph, after conversion Paragraph are as follows:@nh and=> %Dk17B23@nh=> # cause of civil lawsuit@nh=> # case by # cause of civil lawsuit=> %Dk17B23# case by => %Dk17B23# administration behavior behavior=> %Dk17B23%Dk17B23=>@nh, optionally, in the embodiment of the present invention The rule of conversion can be preset when carrying out abstract conversion to paragraph, for example, can specify that using "=> " indicates to modify Relationship, such as A and B respectively indicate two replaced words, and " A=> B " means that A word is modified B times, use specialized vocabulary dictionary Using No. # beginning after the dictionary word replacement of knowledge base ,@is used after being named Entity recognition using name Entity recognition knowledge base Beginning uses % after the classification number replacement using the word domain knowledge base of word woods, and space can be used by each feature It is separated, based on the rule of above-mentioned preset conversion, is then directed to a name, either " Zhang San ", " Li Si " still " king Five ", it can be indicated using "@nh " in the paragraph after above-mentioned conversion, therefore the paragraph complexity after conversion is lower, and When carrying out characteristic statistics, reduce the quantity of feature, therefore statistics cost can be saved, improves statistical efficiency, and after conversion Paragraph feature it is more obvious, the content structure feature that can more show Paragraph.

In an alternative embodiment, if carrying out signature conclusion to 10000 paragraphs, 150000 are counted Different features due to there is the word of many individual characteies in 150000 different features, that is, meaningless word or is unable to table Sign or the word that paragraph structure feature cannot be characterized very well, it is therefore desirable to preset quantity is selected from 15000 features Feature, wherein preset quantity can be set to 3000, the gain of comentropy can be used to select maximum 3000 features, In can be as shown in the table for the calculated result of the gain of the comentropy of the Partial Feature after conversion in paragraph set:

In an alternative embodiment, the present invention is suitable for carrying out a large amount of small paragraphs of normative text according to content Division forms big paragraph, as shown in Fig. 2, Fig. 2 be a legal norm text, wherein paragragh " 4: plaintiff: Tang * * ", " 5: committee Support agent: containing *, Lee * * " and " 6: defendant: king * " can actually incorporate into as participant in proceedings's paragraph, by using this hair Above-mentioned 3 paragraghs can be carried out vectorization, obtain above-mentioned 3 according to the result of vectorization by the method for bright paragraph vectorization The information point of paragragh, to incorporating into above-mentioned 3 paragraghs for a big paragraph.

Embodiment 2

According to embodiments of the present invention, a kind of product embodiments of the device of paragraph vectorization are provided, Fig. 3 is according to this hair The device of the paragraph vectorization of bright embodiment, as shown in figure 3, the device includes building module 101, conversion module 103 and vector Change module 105.

Wherein, module 101 is constructed, for constructing the characteristic set including multiple Feature Words；Conversion module 103 is used for base It treats the word in processing paragraph in default knowledge base to be substituted, paragraph after being converted；Vectorization module 105, for that will turn Feature of the word for belonging to characteristic set in rear paragraph as paragraph after conversion is changed, row vector is dropped into conversion back segment.

It in embodiments of the present invention, include the characteristic set of multiple Feature Words by the building in advance of building module 101, then By conversion module 103 based on default knowledge base treat processing paragraph in word substituted, paragraph after being converted, finally to Quantization modules 105 drop into conversion back segment using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after conversion Row vector has achieved the purpose that carry out paragraph vectorization, has treated processing section in the embodiment of the present invention and drop into row feature selecting When be word in the characteristic set for belong in paragraph after selection conversion prebuild, therefore be characterized in most capable of after feature selecting The feature for embodying paragraph structure feature, to realize the skill that finally obtained vector is able to reflect the structures of paragraph Art effect, and the regular feature that human eye out can not be discovered can be gathered during feature selecting, complexity is low, and then solves It has determined and distance is calculated using word-based, sentence context when carrying out vectorization to paragraph in the prior art, then passed through again The technical issues of vector for the sentence that the methods of cluster is calculated can not reflect the content structure feature of normative text.

Herein it should be noted that above-mentioned building module 101, conversion module 103 and vectorization module 105 correspond to implementation Step S102 to step S106 in example 1, above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be such as one as a part of of device It is executed in the computer system of group computer executable instructions.

In an alternative embodiment, as shown in figure 4, building module 101 includes: alternative module 201, determining module 203 and choose module 205.Wherein, alternative module 201, for based on default knowledge base to the paragraph set including multiple paragraphs In word substituted, paragraph set after being converted；Determining module 203, the feature for paragraph set after determining conversion；Choosing Modulus block 205, for from after conversion in the feature of paragraph set choose preset quantity feature, constitutive characteristic set.

Herein it should be noted that above-mentioned alternative module 201, determining module 203 and selection module 205 correspond to embodiment Step S202 to step S206 in 1, the example and application scenarios that above-mentioned module and corresponding step are realized are identical but unlimited In 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be at such as one group as a part of of device It is executed in the computer system of computer executable instructions.

In an alternative embodiment, as shown in figure 5, choosing module 205, including computing module 301 and selection submodule Block 303, wherein computing module 301, for calculate conversion after paragraph set feature in each feature information gain； Submodule 303 is chosen, for pre- from choosing in the feature of paragraph set after conversion according to the sequence of information gain from big to small If the feature of quantity.

Herein it should be noted that above-mentioned computing module 301 and selection submodule 303 correspond to the step in embodiment 1 S302 to step S304, above-mentioned module is identical as example and application scenarios that corresponding step is realized, but is not limited to above-mentioned reality Apply 1 disclosure of that of example.It should be noted that above-mentioned module can be such as one group of computer can as a part of of device It is executed in the computer system executed instruction.

In an alternative embodiment, as shown in fig. 6, building module 101 further includes cluster module 401, for replacing It is interdependent using single order before being substituted based on default knowledge base to the word in the paragraph set including multiple paragraphs for module 201 Grammer clusters the word of identical meanings in paragraph set.

Herein it should be noted that above-mentioned cluster module 401 correspond to embodiment 1 in step S402, above-mentioned module with The example that corresponding step is realized is identical with application scenarios, but is not limited to the above embodiments 1 disclosure of that.It needs to illustrate , above-mentioned module can hold in a computer system such as a set of computer executable instructions as a part of of device Row.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of method of paragraph vectorization characterized by comprising

Building includes the characteristic set of multiple Feature Words；

It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted；

The word of the characteristic set will be belonged in paragraph after the conversion as the feature of paragraph after the conversion, to the conversion Back segment drops into row vector；

Wherein, it treats the word in processing paragraph based on default knowledge base to be substituted, paragraph includes: after being converted

It is converted, is obtained based on representation of the default knowledge base to whole words or part word in the paragraph to be processed Paragraph after to the conversion；

Building includes that the characteristic set of multiple Feature Words includes: based on the default knowledge base to the paragraph collection including multiple paragraphs Word in conjunction is substituted, paragraph set after being converted；Determine the feature of paragraph set after the conversion；After the conversion The feature that preset quantity is chosen in the feature of paragraph set, constitutes the characteristic set.

2. the method according to claim 1, wherein default from being chosen in the feature of paragraph set after the conversion The feature of quantity, comprising:

Calculate the information gain of each feature in the feature of paragraph set after the conversion；

According to the sequence of information gain from big to small from after the conversion in the feature of paragraph set choose preset quantity spy Sign.

3. the method according to claim 1, wherein based on the default knowledge base to the section including multiple paragraphs Before falling the word progress instead in set, comprising:

The word of identical meanings in the paragraph set is clustered using single order dependency grammar.

4. method according to claim 1-3, which is characterized in that the knowledge base includes that the word classification of word woods is known Know library, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.

5. a kind of device of paragraph vectorization characterized by comprising

Module is constructed, for constructing the characteristic set including multiple Feature Words；

Conversion module is substituted, paragraph after being converted for treating the word in processing paragraph based on default knowledge base；

Vectorization module, for the word using the characteristic set is belonged in paragraph after the conversion as paragraph after the conversion Feature drops into row vector to the conversion back segment；

Wherein, the conversion module is used to be replaced by following steps to treat the word in processing paragraph based on default knowledge base Generation, paragraph after being converted: the expression based on the default knowledge base to whole words or part word in the paragraph to be processed Mode is converted, and paragraph after the conversion is obtained；

The building module includes: alternative module, for based on the default knowledge base to the paragraph set including multiple paragraphs In word substituted, paragraph set after being converted；Determining module, for determining the feature of paragraph set after the conversion； Module is chosen, for constituting the characteristic set from the feature for choosing preset quantity after the conversion in the feature of paragraph set.

6. device according to claim 5, which is characterized in that the selection module, comprising:

Computing module, for calculating the information gain of each feature in the feature of paragraph set after the conversion；

Choose submodule, for according to information gain sequence from big to small from being selected in the feature of paragraph set after the conversion Take the feature of preset quantity.

7. device according to claim 5, which is characterized in that the building module further include:

Cluster module, in the alternative module based on the default knowledge base to the paragraph set including multiple paragraphs In word substituted before, the word of identical meanings in the paragraph set is clustered using single order dependency grammar.

8. according to the described in any item devices of claim 5-7, which is characterized in that the knowledge base includes that the word classification of word woods is known Know library, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.