CN108268431B - The method and apparatus of paragraph vectorization - Google Patents
The method and apparatus of paragraph vectorization Download PDFInfo
- Publication number
- CN108268431B CN108268431B CN201611260591.1A CN201611260591A CN108268431B CN 108268431 B CN108268431 B CN 108268431B CN 201611260591 A CN201611260591 A CN 201611260591A CN 108268431 B CN108268431 B CN 108268431B
- Authority
- CN
- China
- Prior art keywords
- paragraph
- feature
- word
- conversion
- knowledge base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
Abstract
The invention discloses a kind of method and apparatus of paragraph vectorization.Wherein, this method comprises: building includes the characteristic set of multiple Feature Words;It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted;Using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after conversion, row vector is dropped into conversion back segment.The present invention solves the technical issues of calculating distance using word-based, sentence context when carrying out vectorization to paragraph in the prior art, the vector for the sentence being then calculated again by the methods of cluster can not reflect the content structure feature of normative text.
Description
Technical field
The present invention relates to natural language processing fields, in particular to a kind of method and apparatus of paragraph vectorization.
Background technique
The vectorization of natural language is NLP (Natural Language Processing, natural language processing) technology one
Item difficult task, is using the basis of various natural language models, the quality of vectorization directly affects last accuracy rate.
Although much companies are all in the technology using various vectorizations, and also have the tool of certain vectorization in Open Source Platform,
Including word2vector, sentence2vector etc., but it is directed to different document features and different demands, is difficult to use system
One abstract method takes out the characteristic point really needed in demand.Such as when parsing law letter normalization text, need
Small paragraph in text is carried out being divided into big paragraph according to content, it is therefore desirable to row vector is dropped into each segment, with
The information point of segmentation is taken out, however existing some vectorization technologies are essentially all word-based, sentence context to count
Distance is calculated, then is had more by the vector that the methods of cluster calculates sentence due to the text of the Relative Generalizeds range such as legal documents
The language of carefully and neatly done structure and specification, it is therefore desirable to which vector is carried out to content structure described in text and crucial description content
Change classification, the vector analyzed using existing vectorization technology can not reflect the content structure feature of text, and law
Document is rigorous class document, has that many contexts are similar but practical significance lacks the keyword to differ greatly, existing vectorization skill
Art can not distinguish these keywords.
For it is above-mentioned in the prior art to paragraph carry out vectorization when using word-based, sentence context calculate away from
From the vector for the sentence being then calculated again by the methods of cluster can not reflect the content structure feature of normative text
Problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of method and apparatus of paragraph vectorization, at least to solve in the prior art right
Distance is calculated using word-based, sentence context when paragraph carries out vectorization, is then calculated again by the methods of cluster
The vector of sentence the technical issues of can not reflecting the content structure feature of normative text.
According to an aspect of an embodiment of the present invention, a kind of method of paragraph vectorization is provided, comprising: building includes more
The characteristic set of a Feature Words;It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted;It will
Belong to feature of the word of characteristic set as paragraph after conversion after conversion in paragraph, row vector is dropped into conversion back segment.
According to another aspect of an embodiment of the present invention, a kind of device of paragraph vectorization is additionally provided, comprising: building mould
Block, for constructing the characteristic set including multiple Feature Words;Conversion module, for being treated in processing paragraph based on default knowledge base
Word substituted, paragraph after being converted;Vectorization module, for belonging to the word conduct of characteristic set in paragraph after converting
The feature of paragraph after conversion drops into row vector to conversion back segment.
In embodiments of the present invention, it by constructing the characteristic set including multiple Feature Words in advance, is then based on to preset and know
Know the word that library is treated in processing paragraph to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion
Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization,
Being treated when processing section drops into row feature selecting in the embodiment of the present invention is the spy for belonging to prebuild after selection is converted in paragraph
Word in collection conjunction, therefore be characterized in best embodying the feature of paragraph structure feature after feature selecting, to realize most
The vector obtained eventually is able to reflect the technical effect of the structures of paragraph, so solve in the prior art to paragraph into
Distance is calculated using word-based, sentence context when row vector, the sentence being then calculated again by the methods of cluster
Vector the technical issues of can not reflecting the content structure feature of normative text.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of method flow diagram of according to embodiments of the present invention 1 paragraph vectorization;
Fig. 2 is according to embodiments of the present invention 1 legal norm text schematic diagram;
Fig. 3 is a kind of structure drawing of device of according to embodiments of the present invention 2 paragraph vectorization;
Fig. 4 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization;
Fig. 5 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization;And
Fig. 6 is a kind of structure drawing of device of according to embodiments of the present invention 2 optional paragraph vectorization.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
Embodiment 1
According to embodiments of the present invention, the embodiment of the method for a kind of method of paragraph vectorization is provided, it should be noted that
Step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, and
It, in some cases, can be to be different from sequence execution institute herein and although logical order is shown in flow charts
The step of showing or describing.
Fig. 1 is the method for paragraph vectorization according to an embodiment of the present invention, as shown in Figure 1, this method comprises the following steps:
Step S102, building include the characteristic set of multiple Feature Words.
Specifically, Feature Words namely can be to characterize the word of text to a certain degree, the present invention is primarily directed to paragraph
Vectorization, thus the Feature Words in the present invention in characteristic set be mainly can be to characterize the word of paragraph to a certain degree, the present invention
In to the numbers of the Feature Words for including in characteristic set without limitation, and characteristic set can be by directly selecting multiple vocabulary
It is constructed as the mode of Feature Words, sample can also be marked by collecting a large amount of paragraph, and to the spy of paragraph mark sample
The mode concluded is levied to construct.
Step S104 treats the word in processing paragraph based on default knowledge base and is substituted, paragraph after being converted.
Specifically, needing first when needing to carry out vectorization to a paragraph i.e. paragraph to be processed to be processed
Each word in paragraph carries out once abstract transformation, that is, by paragraph to be processed whole words or part word be transformed to
Other representations, such as be converted to dictionary word or only indicated with part of speech or the classification of word substitutes, conversion is base
It is carried out in default knowledge base, optionally, a large amount of conversion vocabulary etc. can be provided in default knowledge base, treating processing paragraph
In the conversion of each word after, that is, whole section falls and is converted, paragraph after available conversion.
Step S106, using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after converting, after conversion
Paragraph carries out vectorization.
Specifically, will include many features in paragraph after conversion, when dropping into row feature extraction to conversion back segment, if adopted
With the prior art, many meaningless words can be extracted or the word of paragraph feature cannot be characterized, the present invention is to paragraph after conversion
When carrying out feature extraction, the word for belonging to Feature Words in characteristic set after converting in paragraph is only extracted, that is, only extracting being capable of table
The word for levying paragraph feature can drop into row vector to conversion back segment later as the feature of paragraph after conversion.
Optionally, it is determined that after conversion after the feature of paragraph, when dropping into row vector to conversion back segment, can use
Vectorization mode in the prior art realizes the vectorization of paragraph after conversion by the word frequency of statistical nature word, wherein word frequency
Statistical method also has the methods of the method, such as hash, the trie tree of transformation of many maturations.
In embodiments of the present invention, it by constructing the characteristic set including multiple Feature Words in advance, is then based on to preset and know
Know the word that library is treated in processing paragraph to be substituted, paragraph after being converted finally will belong to characteristic set in paragraph after conversion
Word as conversion after paragraph feature, to conversion back segment drop into row vector, achieved the purpose that paragraph carry out vectorization,
Being treated when processing section drops into row feature selecting in the embodiment of the present invention is the spy for belonging to prebuild after selection is converted in paragraph
Word in collection conjunction, therefore be characterized in best embodying the feature of paragraph structure feature after feature selecting, to realize most
The vector obtained eventually is able to reflect the technical effect of the structures of paragraph, and can gather out during feature selecting
The regular feature that human eye can not be discovered, complexity is low, and then solves and adopt in the prior art when carrying out vectorization to paragraph
Distance is calculated with word-based, sentence context, the vector for the sentence being then calculated again by the methods of cluster can not be anti-
The technical issues of reflecting the content structure feature of normative text.
In an alternative embodiment, step S102, comprising:
Step S202 substitutes the word in the paragraph set including multiple paragraphs based on default knowledge base, is turned
Change rear paragraph set.
Step S204 determines the feature of paragraph set after conversion.
Step S206, from the feature for choosing preset quantity after conversion in the feature of paragraph set, constitutive characteristic set.
Specifically, if characteristic set is to mark sample by collecting a large amount of paragraph, and mark the spy of sample to paragraph
The mode concluded is levied to construct, it is necessary first to collect a large amount of paragraph mark sample, that is, building includes multiple paragraphs
Paragraph set, then in each paragraph in paragraph set whole words or part base replace in default knowledge base
Generation, paragraph set after the conversion after available abstract conversion, later it needs to be determined that conversion after paragraph set feature, due to
The word of paragraph feature is understood there are many meaningless word and cannot characterized in characteristic statistics, and can characterize the word of paragraph feature
Possible quantity is also more, can only filter out more crucial word, it is therefore desirable to select from a large amount of feature of statistics default
The feature of quantity, finally constitutes characteristic set.
In an alternative embodiment, in step S206 from preset quantity is chosen after conversion in the feature of paragraph set
Feature, comprising:
Step S302 calculates the information gain of each feature in the feature of paragraph set after converting.
Step S304, it is default from being chosen in the feature of paragraph set after conversion according to the sequence of information gain from big to small
The feature of quantity.
Specifically, the feature of preset quantity can be chosen by the way of calculating the gain of comentropy of feature, true
After fixed conversion after the feature of paragraph set, the gain of the comentropy of each feature can be calculated, and choose the increasing of comentropy
The feature of the biggish preset quantity of benefit, herein it should be noted that the size of preset quantity customized can be arranged, in the present invention
The specific value of preset quantity is not specifically limited.
Specifically, entropy (entropy) is a widely used module in information theory, any sample set can be portrayed
Purity, be the opposite sample set of attribute A for objective attribute target attribute A, S with c different values, then S is relative to c shape
The formula of the classification entropy of state is as follows:
Wherein, in above formula, Pi is the ratio that S belongs to classification i, it should be noted that the truth of a matter remains as 2, if objective attribute target attribute
With c probable value, then entropy maximum possible is log2(c)。
In the case where having there is entropy as the standard for measuring training examples set purity, it can be classified with defined attribute and be instructed
Practice the module of the effect of data, this standard is thus referred to as " information gain (information gain) ", in simple terms,
The information gain of one attribute more accurately says precisely due to expectation entropy reduces caused by dividing sample using this attribute,
Information gain Gain (S, A) of one attribute A with respect to sample set S is defined as:
Wherein, V (A) namely Values (A) is the set of all probable values of attribute A, and Sv is that the value of attribute A in S is v
Subset, first item is exactly the entropy of former set S in above formula, and Section 2 is the desired value of the entropy after classifying S with A, what Section 2 described
It is expected that entropy is exactly the weighted sum of the entropy of each subset, weight is that the sample of attribute Sv accounts for the ratio of original sample S | Sv |/| S |, institute
With Gain (S, A) be due to caused by knowing the value of attribute A it is expected entropy reduce, in other words for, Gain (S, A) be due to
Determine the information obtained from the value of attribute A about target function value, when the target value of any member to S coding,
The value of Gain (S, A) is the number of bits that can be saved after the value for knowing attribute.
In an alternative embodiment, in step S202 based on default knowledge base to the paragraph set including multiple paragraphs
In word carry out instead before, comprising: step S402, using single order dependency grammar by paragraph set identical meanings word carry out
Cluster.
Specifically, identical root is different according to the context of appearance in paragraph, different meanings might have, and different words
May have the same meaning, thus to based on default knowledge base to include multiple paragraphs paragraph set in word carry out
Before substitution, dependency grammar can be used, the word of identical meanings is gathered, single order dependency grammar specifically can be used phase
Word with meaning is gathered, and by using dependency grammar, the present invention can solve in existing vectorization technology and be directed on very much
Hereafter similar but practical significance, which lacks the keyword to differ greatly, leads to the ropy technical problem of vectorization without distinguishing.
In an alternative embodiment, knowledge base includes the word domain knowledge base of word woods, specialized vocabulary dictionary knowledge base
With name Entity recognition knowledge base.
Specifically, a large amount of root is namely carried out the knowledge of classification composition by the word domain knowledge base of word woods according to meaning
Library optionally can distribute classification number to the word of every a kind of identical meanings, following table may be the word domain knowledge base of word woods
Middle partial content:
Specifically, the knowledge that specialized vocabulary dictionary knowledge base can be the vocabulary of a certain professional domain and dictionary is constituted
Library, such as legal profession, specialized vocabulary dictionary knowledge base may include case by, measuresof property preservation, party role, method
Restrain abbreviation, law court, nationality, administrative omission behavior, classification, administrative behavior, administration behavior behavior, marriage shape in administrative law
Condition, technology case _ case are by vocabulary or dictionaries such as, technology case _ keyword, role, amount of money detail, cause of civil lawsuit, nationalitys.
Specifically, name Entity recognition knowledge base can be the knowledge base that can be named Entity recognition, wherein naming
Entity be exactly name, place name, mechanism name and other have with the entity of entitled mark, can be with based on name Entity recognition knowledge base
Identify name (NH), place name (NS), organization name (NI) etc..
In an alternative embodiment, after being converted based on above three knowledge base to a paragraph, after conversion
Paragraph are as follows:@nh and=> %Dk17B23@nh=> # cause of civil lawsuit@nh=> # case by # cause of civil lawsuit=> %Dk17B23# case by
=> %Dk17B23# administration behavior behavior=> %Dk17B23%Dk17B23=>@nh, optionally, in the embodiment of the present invention
The rule of conversion can be preset when carrying out abstract conversion to paragraph, for example, can specify that using "=> " indicates to modify
Relationship, such as A and B respectively indicate two replaced words, and " A=> B " means that A word is modified B times, use specialized vocabulary dictionary
Using No. # beginning after the dictionary word replacement of knowledge base ,@is used after being named Entity recognition using name Entity recognition knowledge base
Beginning uses % after the classification number replacement using the word domain knowledge base of word woods, and space can be used by each feature
It is separated, based on the rule of above-mentioned preset conversion, is then directed to a name, either " Zhang San ", " Li Si " still " king
Five ", it can be indicated using "@nh " in the paragraph after above-mentioned conversion, therefore the paragraph complexity after conversion is lower, and
When carrying out characteristic statistics, reduce the quantity of feature, therefore statistics cost can be saved, improves statistical efficiency, and after conversion
Paragraph feature it is more obvious, the content structure feature that can more show Paragraph.
In an alternative embodiment, if carrying out signature conclusion to 10000 paragraphs, 150000 are counted
Different features due to there is the word of many individual characteies in 150000 different features, that is, meaningless word or is unable to table
Sign or the word that paragraph structure feature cannot be characterized very well, it is therefore desirable to preset quantity is selected from 15000 features
Feature, wherein preset quantity can be set to 3000, the gain of comentropy can be used to select maximum 3000 features,
In can be as shown in the table for the calculated result of the gain of the comentropy of the Partial Feature after conversion in paragraph set:
In an alternative embodiment, the present invention is suitable for carrying out a large amount of small paragraphs of normative text according to content
Division forms big paragraph, as shown in Fig. 2, Fig. 2 be a legal norm text, wherein paragragh " 4: plaintiff: Tang * * ", " 5: committee
Support agent: containing *, Lee * * " and " 6: defendant: king * " can actually incorporate into as participant in proceedings's paragraph, by using this hair
Above-mentioned 3 paragraghs can be carried out vectorization, obtain above-mentioned 3 according to the result of vectorization by the method for bright paragraph vectorization
The information point of paragragh, to incorporating into above-mentioned 3 paragraghs for a big paragraph.
Embodiment 2
According to embodiments of the present invention, a kind of product embodiments of the device of paragraph vectorization are provided, Fig. 3 is according to this hair
The device of the paragraph vectorization of bright embodiment, as shown in figure 3, the device includes building module 101, conversion module 103 and vector
Change module 105.
Wherein, module 101 is constructed, for constructing the characteristic set including multiple Feature Words;Conversion module 103 is used for base
It treats the word in processing paragraph in default knowledge base to be substituted, paragraph after being converted;Vectorization module 105, for that will turn
Feature of the word for belonging to characteristic set in rear paragraph as paragraph after conversion is changed, row vector is dropped into conversion back segment.
It in embodiments of the present invention, include the characteristic set of multiple Feature Words by the building in advance of building module 101, then
By conversion module 103 based on default knowledge base treat processing paragraph in word substituted, paragraph after being converted, finally to
Quantization modules 105 drop into conversion back segment using the word for belonging to characteristic set in paragraph after conversion as the feature of paragraph after conversion
Row vector has achieved the purpose that carry out paragraph vectorization, has treated processing section in the embodiment of the present invention and drop into row feature selecting
When be word in the characteristic set for belong in paragraph after selection conversion prebuild, therefore be characterized in most capable of after feature selecting
The feature for embodying paragraph structure feature, to realize the skill that finally obtained vector is able to reflect the structures of paragraph
Art effect, and the regular feature that human eye out can not be discovered can be gathered during feature selecting, complexity is low, and then solves
It has determined and distance is calculated using word-based, sentence context when carrying out vectorization to paragraph in the prior art, then passed through again
The technical issues of vector for the sentence that the methods of cluster is calculated can not reflect the content structure feature of normative text.
Herein it should be noted that above-mentioned building module 101, conversion module 103 and vectorization module 105 correspond to implementation
Step S102 to step S106 in example 1, above-mentioned module is identical as example and application scenarios that corresponding step is realized, but not
It is limited to 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be such as one as a part of of device
It is executed in the computer system of group computer executable instructions.
In an alternative embodiment, as shown in figure 4, building module 101 includes: alternative module 201, determining module
203 and choose module 205.Wherein, alternative module 201, for based on default knowledge base to the paragraph set including multiple paragraphs
In word substituted, paragraph set after being converted;Determining module 203, the feature for paragraph set after determining conversion;Choosing
Modulus block 205, for from after conversion in the feature of paragraph set choose preset quantity feature, constitutive characteristic set.
Herein it should be noted that above-mentioned alternative module 201, determining module 203 and selection module 205 correspond to embodiment
Step S202 to step S206 in 1, the example and application scenarios that above-mentioned module and corresponding step are realized are identical but unlimited
In 1 disclosure of that of above-described embodiment.It should be noted that above-mentioned module can be at such as one group as a part of of device
It is executed in the computer system of computer executable instructions.
In an alternative embodiment, as shown in figure 5, choosing module 205, including computing module 301 and selection submodule
Block 303, wherein computing module 301, for calculate conversion after paragraph set feature in each feature information gain;
Submodule 303 is chosen, for pre- from choosing in the feature of paragraph set after conversion according to the sequence of information gain from big to small
If the feature of quantity.
Herein it should be noted that above-mentioned computing module 301 and selection submodule 303 correspond to the step in embodiment 1
S302 to step S304, above-mentioned module is identical as example and application scenarios that corresponding step is realized, but is not limited to above-mentioned reality
Apply 1 disclosure of that of example.It should be noted that above-mentioned module can be such as one group of computer can as a part of of device
It is executed in the computer system executed instruction.
In an alternative embodiment, as shown in fig. 6, building module 101 further includes cluster module 401, for replacing
It is interdependent using single order before being substituted based on default knowledge base to the word in the paragraph set including multiple paragraphs for module 201
Grammer clusters the word of identical meanings in paragraph set.
Herein it should be noted that above-mentioned cluster module 401 correspond to embodiment 1 in step S402, above-mentioned module with
The example that corresponding step is realized is identical with application scenarios, but is not limited to the above embodiments 1 disclosure of that.It needs to illustrate
, above-mentioned module can hold in a computer system such as a set of computer executable instructions as a part of of device
Row.
In an alternative embodiment, knowledge base includes the word domain knowledge base of word woods, specialized vocabulary dictionary knowledge base
With name Entity recognition knowledge base.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (8)
1. a kind of method of paragraph vectorization characterized by comprising
Building includes the characteristic set of multiple Feature Words;
It treats the word in processing paragraph based on default knowledge base to be substituted, paragraph after being converted;
The word of the characteristic set will be belonged in paragraph after the conversion as the feature of paragraph after the conversion, to the conversion
Back segment drops into row vector;
Wherein, it treats the word in processing paragraph based on default knowledge base to be substituted, paragraph includes: after being converted
It is converted, is obtained based on representation of the default knowledge base to whole words or part word in the paragraph to be processed
Paragraph after to the conversion;
Building includes that the characteristic set of multiple Feature Words includes: based on the default knowledge base to the paragraph collection including multiple paragraphs
Word in conjunction is substituted, paragraph set after being converted;Determine the feature of paragraph set after the conversion;After the conversion
The feature that preset quantity is chosen in the feature of paragraph set, constitutes the characteristic set.
2. the method according to claim 1, wherein default from being chosen in the feature of paragraph set after the conversion
The feature of quantity, comprising:
Calculate the information gain of each feature in the feature of paragraph set after the conversion;
According to the sequence of information gain from big to small from after the conversion in the feature of paragraph set choose preset quantity spy
Sign.
3. the method according to claim 1, wherein based on the default knowledge base to the section including multiple paragraphs
Before falling the word progress instead in set, comprising:
The word of identical meanings in the paragraph set is clustered using single order dependency grammar.
4. method according to claim 1-3, which is characterized in that the knowledge base includes that the word classification of word woods is known
Know library, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.
5. a kind of device of paragraph vectorization characterized by comprising
Module is constructed, for constructing the characteristic set including multiple Feature Words;
Conversion module is substituted, paragraph after being converted for treating the word in processing paragraph based on default knowledge base;
Vectorization module, for the word using the characteristic set is belonged in paragraph after the conversion as paragraph after the conversion
Feature drops into row vector to the conversion back segment;
Wherein, the conversion module is used to be replaced by following steps to treat the word in processing paragraph based on default knowledge base
Generation, paragraph after being converted: the expression based on the default knowledge base to whole words or part word in the paragraph to be processed
Mode is converted, and paragraph after the conversion is obtained;
The building module includes: alternative module, for based on the default knowledge base to the paragraph set including multiple paragraphs
In word substituted, paragraph set after being converted;Determining module, for determining the feature of paragraph set after the conversion;
Module is chosen, for constituting the characteristic set from the feature for choosing preset quantity after the conversion in the feature of paragraph set.
6. device according to claim 5, which is characterized in that the selection module, comprising:
Computing module, for calculating the information gain of each feature in the feature of paragraph set after the conversion;
Choose submodule, for according to information gain sequence from big to small from being selected in the feature of paragraph set after the conversion
Take the feature of preset quantity.
7. device according to claim 5, which is characterized in that the building module further include:
Cluster module, in the alternative module based on the default knowledge base to the paragraph set including multiple paragraphs
In word substituted before, the word of identical meanings in the paragraph set is clustered using single order dependency grammar.
8. according to the described in any item devices of claim 5-7, which is characterized in that the knowledge base includes that the word classification of word woods is known
Know library, specialized vocabulary dictionary knowledge base and name Entity recognition knowledge base.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611260591.1A CN108268431B (en) | 2016-12-30 | 2016-12-30 | The method and apparatus of paragraph vectorization |
PCT/CN2017/112593 WO2018121145A1 (en) | 2016-12-30 | 2017-11-23 | Method and device for vectorizing paragraph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611260591.1A CN108268431B (en) | 2016-12-30 | 2016-12-30 | The method and apparatus of paragraph vectorization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108268431A CN108268431A (en) | 2018-07-10 |
CN108268431B true CN108268431B (en) | 2019-12-03 |
Family
ID=62707839
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611260591.1A Active CN108268431B (en) | 2016-12-30 | 2016-12-30 | The method and apparatus of paragraph vectorization |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108268431B (en) |
WO (1) | WO2018121145A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116702723A (en) * | 2018-12-25 | 2023-09-05 | 创新先进技术有限公司 | Training method, device and equipment for contract paragraph annotation model |
CN111538832A (en) * | 2019-02-02 | 2020-08-14 | 富士通株式会社 | Apparatus and method for event annotation of document and recording medium |
CN110472231B (en) * | 2019-07-11 | 2023-05-12 | 创新先进技术有限公司 | Method and device for identifying legal document case |
CN110674635B (en) * | 2019-09-27 | 2023-04-25 | 北京妙笔智能科技有限公司 | Method and device for dividing text paragraphs |
CN111611342B (en) * | 2020-04-09 | 2023-04-18 | 中南大学 | Method and device for obtaining lexical item and paragraph association weight |
CN117688138B (en) * | 2024-02-02 | 2024-04-09 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101034409A (en) * | 2007-03-29 | 2007-09-12 | 浙江大学 | Search method for human motion based on data drive and decision tree analysis |
CN102081631A (en) * | 2009-11-30 | 2011-06-01 | 国际商业机器公司 | Answer support system and method |
CN103500195A (en) * | 2013-09-18 | 2014-01-08 | 小米科技有限责任公司 | Updating method, device, system and equipment for classifier |
CN104199972A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Named entity relation extraction and construction method based on deep learning |
CN104281610A (en) * | 2013-07-08 | 2015-01-14 | 腾讯科技(深圳)有限公司 | Method and device for filtering microblogs |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728B (en) * | 2007-12-25 | 2011-06-08 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
US20140337355A1 (en) * | 2013-05-13 | 2014-11-13 | Gnoetics, Inc. | Indexed Natural Language Processing |
CN104239340B (en) * | 2013-06-19 | 2018-03-16 | 北京搜狗信息服务有限公司 | Search result screening technique and device |
CN105117397B (en) * | 2015-06-18 | 2018-08-28 | 浙江大学 | A kind of medical files semantic association search method based on ontology |
CN106202010B (en) * | 2016-07-12 | 2019-11-26 | 重庆兆光科技股份有限公司 | Method and apparatus based on deep neural network building Law Text syntax tree |
-
2016
- 2016-12-30 CN CN201611260591.1A patent/CN108268431B/en active Active
-
2017
- 2017-11-23 WO PCT/CN2017/112593 patent/WO2018121145A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101034409A (en) * | 2007-03-29 | 2007-09-12 | 浙江大学 | Search method for human motion based on data drive and decision tree analysis |
CN102081631A (en) * | 2009-11-30 | 2011-06-01 | 国际商业机器公司 | Answer support system and method |
CN104281610A (en) * | 2013-07-08 | 2015-01-14 | 腾讯科技(深圳)有限公司 | Method and device for filtering microblogs |
CN103500195A (en) * | 2013-09-18 | 2014-01-08 | 小米科技有限责任公司 | Updating method, device, system and equipment for classifier |
CN104199972A (en) * | 2013-09-22 | 2014-12-10 | 中科嘉速(北京)并行软件有限公司 | Named entity relation extraction and construction method based on deep learning |
CN105512104A (en) * | 2015-12-02 | 2016-04-20 | 上海智臻智能网络科技股份有限公司 | Dictionary dimension reducing method and device and information classifying method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2018121145A1 (en) | 2018-07-05 |
CN108268431A (en) | 2018-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108268431B (en) | The method and apparatus of paragraph vectorization | |
CN106021364B (en) | Foundation, image searching method and the device of picture searching dependency prediction model | |
CN110008335A (en) | The method and device of natural language processing | |
CN110334110A (en) | Natural language classification method, device, computer equipment and storage medium | |
CN109740660A (en) | Image processing method and device | |
CN110502632A (en) | Contract terms reviewing method, device, computer equipment and storage medium based on clustering algorithm | |
CN109800307A (en) | Analysis method, device, computer equipment and the storage medium of product evaluation | |
CN106708940A (en) | Method and device used for processing pictures | |
CN110472043A (en) | A kind of clustering method and device for comment text | |
WO2022121163A1 (en) | User behavior tendency identification method, apparatus, and device, and storage medium | |
EP3377983A1 (en) | Generating feature embeddings from a co-occurrence matrix | |
CN109271516A (en) | Entity type classification method and system in a kind of knowledge mapping | |
CN110110800A (en) | Automatic image marking method, device, equipment and computer readable storage medium | |
CN110969172A (en) | Text classification method and related equipment | |
CN110472040A (en) | Extracting method and device, storage medium, the computer equipment of evaluation information | |
US20230123941A1 (en) | Multiscale Quantization for Fast Similarity Search | |
CN110019822A (en) | A kind of few sample relationship classification method and system | |
CN105159927B (en) | Method and device for selecting subject term of target text and terminal | |
CN111177386A (en) | Proposal classification method and system | |
JP4143234B2 (en) | Document classification apparatus, document classification method, and storage medium | |
CN109960730A (en) | A kind of short text classification method, device and equipment based on feature extension | |
CN110837553B (en) | Method for searching mail and related products | |
Burkhardt et al. | Nkululeko: A tool for rapid speaker characteristics detection | |
CN107092679A (en) | A kind of feature term vector preparation method, file classification method and device | |
KR20210057996A (en) | Multi-task learning classifier learning apparatus and the method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |