CN108572961A

CN108572961A - A kind of the vectorization method and device of text

Info

Publication number: CN108572961A
Application number: CN201710134611.9A
Authority: CN
Inventors: 刘家兵; 刘永波; 吴春龙; 张少松
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2017-03-08
Filing date: 2017-03-08
Publication date: 2018-09-25

Abstract

The present invention discloses a kind of the vectorization method and device of text, is related to text vector field.Wherein, the method includes：Pending text is obtained, and determines the application type of the text, obtains the sample of the text；All monocase elements for extracting the sample, obtain the monocase set of the sample；Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample；The monocase set and double character sets are merged, vocabulary is obtained；It is built to obtain the text vector of the text according to the vocabulary.Invention removes Chinese Word Segmentations, avoid error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment, and have preferable fault-tolerant ability to the wrong word of the colloquial styles sentence such as public sentiment.

Description

A kind of the vectorization method and device of text

Technical field

The present invention relates to text vector fields, and in particular, to a kind of the vectorization method and device of text.

Background technology

For various machine learning algorithms, their input is vector, and output can be continuous value or discrete Value.Text classification or cluster are a very important applications in machine learning field, and text vectorization is then to carry out text The first step of classification or cluster, it directly determines the quality of machine learning final result.

Existing text vector technology is as follows：

TF-IDF (term frequency-inverse document frequency, term frequency-inverse document frequency) is one Common weighting technique of the kind for information retrieval and data mining.The dimension of sentence vector is the number of vocabulary, each dimension Value is the weight that the word of corresponding vocabulary is calculated by TF-IDF methods.TF refers to time that some word occurs in single sentence Number, since sentence length differs, needs to be standardized；IDF refers to inverse document frequency, and calculation formula is log (corpus Sentence sum/(the sentence number+1 for including the word))；Finally the corresponding TF-IDF values of sentence dimension are TF*IDF.

Word2Vec master's technical method to be used have Continuous Bag-of-Word Model (CBOW) and Continuous Skip-gram Model.The principle of CBOW is based on context to predict the probability of current term, and Skip- Gram is then the probability that context is predicted by current term, their core is all neural network algorithm, obtained word to Amount dimension is relatively low (100-400 is more satisfactory), and Word similarity can be calculated easily by vector angle.But we It is difficult to obtain the vector of characterization sentence semantics from term vector, so there is Doc2Vec, can directly sentence be converted into Amount.Doc2Vec methods increase except for an additional outside a sentence vector, do not have too big difference with Word2Vec.

However, there are the following problems for the prior art：

1, the prior art is substantially based on Chinese Word Segmentation, and effect is also in the application of sentence in writing form for Chinese Word Segmentation Row, but handle ineffective as this kind of colloquial style sentence of public sentiment, quite a few error can be introduced.Due to cascading Presence, can on the result of final machine learning generate significantly influence.In addition, Chinese Word Segmentation is depended on, to public sentiment etc. The fault-tolerant ability of colloquial style sentence is poor.

2, the vector dimension that the text vector method of TF-IDF types finally generates is higher, from it is tens of thousands of to hundreds of thousands not Deng causing resource consumption larger, and training process is longer.It, will lost part information if carrying out dimension-reduction treatment. Although final vector dimension is not high by Word2Vec or Doc2Vec, but needs powerful corpus to support, time consumption for training consumption Resource.

3, the vectorization scheme of TF-IDF all damages semanteme, i.e. sentence and vector is many-to-one relationship, and these Sentence is corresponding different types of, can not carry out semantic reduction.Word2Vec or Doc2Vec is based on statistics, similar sentence knot Structure just has similar vector, can not carry out semantic reduction.

Invention content

In view of the drawbacks of the prior art, the present invention provides a kind of vectorization method of text, by removing Chinese Word Segmentation, keeps away Error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment are exempted from, and to spoken languages such as public sentiments The wrong word for changing sentence has preferable fault-tolerant ability.

First aspect according to the present invention, it is proposed that a kind of vectorization method of text, the method includes：

Pending text is obtained, and determines the application type of the text, obtains the sample of the text；

All monocase elements for extracting the sample, obtain the monocase set of the sample；

Double character elements that the sample is extracted according to the application type of the sample obtain double character set of the sample It closes；

The monocase set and double character sets are merged, vocabulary is obtained；

It is built to obtain the text vector of the text according to the vocabulary.

Optionally, after all monocase elements of the extraction sample, the method further includes：

The occurrence frequency corresponding to each monocase element in the sample is counted respectively；

Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Monocase element, to obtain the monocase set of the sample.

Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample This double character sets, including：

In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is Positive sample；

Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.

Optionally, after double character elements all in the extraction positive sample, the method further includes：

It is counted respectively per the occurrence frequency corresponding to double word symbol element；

Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most Small double character elements, to obtain double character sets of the sample.

Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample This double character sets further include：

In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively Symbol element, obtains double character sets of the sample.

Optionally, described to extract respectively in the sample after all double character elements of each classification, the method is also Including：

The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively；

Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively Character element, to obtain the sample each classification double character sets；

Double character sets of each classification of the sample are merged, to obtain double character set of the sample It closes.

In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained Double character sets of the sample.

Optionally, after double character elements all in the extraction sample, the method further includes：

Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Double character elements, to obtain double character sets of the sample.

Second aspect according to the present invention, it is proposed that a kind of vectorization device of text, described device include：

Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text This sample；

First extraction unit, all monocase elements for extracting the sample, obtains the monocase collection of the sample It closes；

Second extraction unit, double character elements for extracting the sample according to the application type of the sample, obtains Double character sets of the sample；

Combining unit obtains vocabulary for merging the monocase set and double character sets；

Construction unit obtains the text vector of the text for being built according to the vocabulary.

Optionally, first extraction unit, is additionally operable to：

Optionally, second extraction unit, is specifically used for：

Optionally, second extraction unit, is additionally operable to：

Through the above technical solutions, obtaining pending public sentiment text, and determine the application type of public sentiment text, obtains carriage The sample of feelings text；All monocase elements for extracting sample, obtain the monocase set of sample；And according to the application class of sample Type extracts double character elements of sample, obtains double character sets of sample；Then, monocase set and double character sets are carried out Merge, obtains vocabulary；Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation and be directed to public sentiment The error and its subsequent error cascading that equal colloquial styles sentence introduces, and have to the wrong word of the colloquial styles sentence such as public sentiment There is preferable fault-tolerant ability.

Description of the drawings

In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described.It should be evident that the accompanying drawings in the following description is only this Some disclosed embodiments for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these figures.

Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides；

Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides；

Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.

Specific implementation mode

Below in conjunction with the attached drawing in the embodiment of the present disclosure, the technical solution in the embodiment of the present disclosure is carried out clear, complete Site preparation describes, it is clear that described embodiment is only disclosure a part of the embodiment, instead of all the embodiments.It is based on Embodiment in the disclosure, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment belongs to the range of disclosure protection.

The partial words referred in the embodiment of the present disclosure are illustrated below.

The user equipment (User Equipment, abbreviation UE) referred in the embodiment of the present disclosure is used mobile terminal Or the equipment such as personal computer (Personal Computer, abbreviation PC).Such as smart mobile phone, personal digital assistant (PDA), Tablet computer, vehicle-mounted computer (carputer), handheld device, intelligent glasses, smartwatch, wearable is set laptop Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).

Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Figure 1, the disclosure one The vectorization method of text that embodiment provides includes：

In step S101, pending text is obtained, and determine the application type of the text, obtain the text Sample.

Wherein, pending text obtains in a user device, and user equipment can be mobile terminal or the ends PC etc., end End is the equipment such as mobile terminal or the ends PA for being used to provide operating service held.The public sentiment text refers to indicating public sentiment Colloquial style sentence, and the quantity of sentence is multiple.The text is public sentiment text.Wherein, public sentiment refer to the more masses about The summation of various phenomenons in society, conviction, attitude, opinion and mood expressed by problem etc. performance.In specific application, The application type that the public sentiment text is determined by the way of manually marking obtains the sample of the public sentiment text.

Then, in step s 102, all monocase elements for extracting the sample, obtain the monocase collection of the sample It closes.

Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.

And then, in step s 103, double character elements that the sample is extracted according to the application type of the sample, obtain To double character sets of the sample.

Wherein, after obtaining the monocase set of sample of public sentiment text, it is also contemplated that the sequence of intercharacter could be from Sentence is restored in text vector.For different application types, there are different semantic sequence acquisition modes.

Specifically, which includes：In the case where the application type is binary classification, double word in the sample is defined The classification for according with negligible amounts is positive sample；Double character elements all in the positive sample are extracted, the double word of the sample is obtained Symbol set；Or, in the case where the application type is multivariate classification, extract each classification in the sample respectively all pairs Character element obtains double character sets of the sample；Or, in the case where the application type is text cluster, institute is extracted Double character elements all in sample are stated, double character sets of the sample are obtained.Specifically, it is binary classification in application type In the case of, positive sample and negative sample are divided by way of manually marking to sample in advance, then text vectorization is only right Positive sample is handled, and in the case where application type is multivariate classification, then it needs to be respectively processed to each.

Wherein, binary classification refers to that the sample of public sentiment text is roughly classified into two major classes by the way of manually marking Classification for public sentiment text.For example, can according to expressed by colloquial style sentence positive emotion or Negative Affect by public sentiment text This sample is divided into two classes, and the colloquial style sentence for expressing positive emotion is classified as first sample, will express the spoken language of Negative Affect Change sentence and is classified as the second sample.Multivariate classification refers to being subdivided into the sample of public sentiment text by the way of manually marking multiple Classification is used for the classification of public sentiment text.For example, can according to news, spit slot, bug, consulting and suggest this five classifications by carriage The sample of feelings text is subdivided into five major class, can also be harassed according to popularization, not getting on the bus starts charging, can not return ticket, Si Jizuo Disadvantage, information are stolen, repeat to pay, refuse to pay fare, dangerous driving, cancel stroke charge and software crash this ten classifications will The sample of public sentiment text is subdivided into ten groups.Text cluster refers to double word all in the directly sample of extraction public sentiment text Symbol element is used for the cluster of public sentiment text.Wherein, application type refers to subsequently using machine learning after text vectorization is completed The type classified to text.Correspondingly, binary classification refers to that pending text is finally divided into two classes by machine learning, more Member, which is classified, refers to that pending text is divided into multiple and different classifications by machine learning, and text cluster is referred to pending text This is clustered.For different machine learning classification modes, the method for text vector is different.In addition, for two Member classification and multivariate classification, before carrying out the classification processing of text vector and machine learning, by way of manually marking Preliminary presort is carried out.Classify in advance to sample, this mode classification of machine learning can be made.Engineering can be this After mode classification, it can be classified with this mode classification to pending text.

Preferably, after double character elements all in the extraction positive sample, the method further includes：It unites respectively Meter is per the occurrence frequency corresponding to double word symbol element；The maximum double character elements of occurrence frequency in the positive sample are removed respectively With double character elements of occurrence frequency minimum in the positive sample, to obtain double character sets of the sample.Wherein, for The maximum double character elements of occurrence frequency, it is believed that semantic reduction can also be carried out by not needing these information.For there is frequency Spend minimum double character elements, it is believed that it is not the double character elements of feature of the positive sample, and discrimination is not high.Therefore, Double character elements of the maximum double character elements of occurrence frequency in positive sample and occurrence frequency minimum can be removed, and by positive sample In double character sets of the remaining double character elements as the sample of public sentiment text.In specific application, for binary classification, Dimension (numbers of double character elements in double character sets) general control is within 1500.Thereby, finally obtained text vector Compare the input for being suitable as machine learning, and the dimension of the text vector of generation can be controlled, reaches preferable classification knot Fruit.

Preferably, described to extract respectively in the sample after all double character elements of each classification, the method is also Including：The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively；Respectively described in removal Double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample, it is described to obtain Double character sets of each classification of sample；Double character sets of each classification of the sample are merged, to obtain Double character sets of the sample.For multivariate classification, it is necessary first to occurrence frequency statistics is carried out to each classification, with binary point Class removes double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample, The double character elements of each class another characteristic are obtained, there is preferable discrimination for each classification.Later, each classification is sought Double word accords with union of sets collection, obtains double character sets of final sample.Dimension general control can obtain preferably in 1000-7000 Classifying quality.Thereby, the double character elements of the key of each classification all can serve as a dimension of vector, and monocase collection is combined into Each classification is shared, and double character sets, which correspond to each classification, has different sets, highlights classification dimensional characteristics, helps to be promoted The accuracy of the classification of public sentiment text, and the dimension of the text vector of generation can be controlled, reach preferable classification results.

Preferably, after double character elements all in the extraction sample, the method further includes：It counts respectively Per the occurrence frequency corresponding to double word symbol element；The maximum double character elements of occurrence frequency in the sample and institute are removed respectively The double character elements for stating occurrence frequency minimum in sample, to obtain double character sets of the sample.For text cluster, need The statistics that occurrence frequency is carried out to the sample of public sentiment text, removes the maximum double character elements of occurrence frequency and appearance in sample After double character elements of frequency minimum, double character sets of sample are obtained, dimension control reaches expected between 1000-7000 Clustering Effect.Thereby, finally obtained text vector is relatively suitable as the input of machine learning, and can control generation Text vector dimension, reach preferable cluster result.

Then, in step S104, the monocase set and double character sets is merged, vocabulary is obtained.

Finally, it in step S105, is built to obtain the text vector of the text according to the vocabulary.

Wherein, each monocase element or each double character elements can be used as and each colloquial style sentence in public sentiment text One dimension of corresponding vector occurs being then 1 in the colloquial style sentence of public sentiment, is otherwise 0, so that structure obtains and public sentiment The corresponding vector of each colloquial style sentence in text, vectorial dimension are the sum of the number of monocase element and double character elements. For binary classification, for vectorial dimension in 4500 dimensions hereinafter, for multivariate classification and text cluster, vectorial dimension is also all control System compares the input for being suitable as machine learning within 10000 dimensions.

The present embodiment determines the application type of public sentiment text by taking pending public sentiment text, obtains public sentiment text Sample；All monocase elements for extracting sample, obtain the monocase set of sample；And it is extracted according to the application type of sample Double character elements of sample obtain double character sets of sample；Then, monocase set and double character sets are merged, Obtain vocabulary；Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation for spoken languages such as public sentiments Change the error and its subsequent error cascading that sentence introduces, and has to the wrong word of the colloquial styles sentence such as public sentiment preferable Fault-tolerant ability.

Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Fig. 2, the disclosure one The vectorization method of text that embodiment provides includes：

In step s 201, the public sentiment text in user equipment is obtained, and determines the application type of the public sentiment text, is obtained To the sample of the public sentiment text.

For example, the sample of public sentiment text is as follows：{ " driver of drop drop is fine "：1, " attitude of driver is fine "： 1, " driver of drop drop refuses to take passengers "：0, " this orange eats very well "：0, " soon to lead certificate "：0}.The sample of this public sentiment text Include the colloquial style sentence of 5 expression public sentiments, and determines the application type of the public sentiment text, tool by the way of manually marking The application type of body is binary classification, and the good colloquial style sentence of drop driver's attitude is dripped in performance is labeled as 1, by performance drop drop The bad colloquial style sentence of driver's attitude and other colloquial style sentences are labeled as 0.

Then, in step S202, all monocase elements of the sample are extracted, and count every in the sample respectively Occurrence frequency corresponding to one monocase element；And the maximum monocase element of occurrence frequency in the sample and institute are removed respectively The monocase element for stating occurrence frequency minimum in sample, to obtain the monocase set of the sample.

Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.Remove the minimum uncommon character (shadow of occurrence frequency Ring the factor be less than a ten thousandth) and the character of occurrence frequency highest (resolution capability is not high) after, the monocase dimension one of acquisition As within 1500.

For the example in step S201, its all monocase element is extracted, and it is right to count each monocase element institute The occurrence frequency answered, obtained result are：

{ " drop "：4, " "：4, " department " 3, " machine "：3, " very "：3, " good "：3, " clothes "：1, " business "：1, " state "：1, " degree "： 1, " refusing "：1, " load "：1, " this "：1, " a "：1, " orange " 1, " son "：1, " eating "：1, " fast "：1, " next "：1, " neck "：1, " certificate "： 1, " "：1}.

It removes in sample after the maximum monocase element of occurrence frequency and the monocase element of occurrence frequency minimum, acquisition Result be：

{ " department " 3, " machine "：3, " very "：3, " good "：3}.

Wherein, the dimension of monocase set is 4 dimensions, the feature often tieed up respectively " department ", " machine ", " very ", " good ".

And then, in step S203, double character elements of the sample are extracted according to the application type of the sample, are obtained To double character sets of the sample.

Since the exemplary application type in step S201 is binary classification, therefore the positive sample of negligible amounts need to be only accorded with from double word Double character elements are extracted in this can obtain double character sets of sample of public sentiment text.Extract all double words symbol in positive sample Element, and count per the occurrence frequency corresponding to double word symbol element, obtained concrete outcome is as follows：

{ " drop drop "：1, " drop "：1, " department "：1, " driver "：2, " machine is very "：1, " fine "：2, " machine "：1, " Clothes "：1, " service "：1, " business state "：1, " attitude "：1, " degree is very "：1}.

After removing the minimum double character elements of occurrence frequency, the result of acquisition is：

{ " driver "：2, " fine "：2}.

Wherein, the dimension of double character sets is 2, often ties up and is characterized as " driver ", " fine ".

Then, in step S204, the monocase set and double character sets is merged, vocabulary is obtained.

For example, according to the monocase set { " department ", " machine ", " very ", " good " } and step obtained in step S202 The double character sets { " driver ", " fine " } obtained in S203 can obtain vocabulary { " department ", " machine ", " very ", " good ", " department Machine ", " fine " }.Wherein, the dimension of vocabulary is 6 dimensions, feature respectively " department ", " machine ", " very ", " good ", " department of each dimension Machine ", " fine ".

Finally, it in step S205, is built to obtain the text vector of the public sentiment text according to the vocabulary.

For the example in step S201, each colloquial style in public sentiment text can be obtained according to the vocabulary in step S204 The corresponding vector of sentence is respectively：

{[1,1,1,1,1,1]:1,[1,1,1,1,1,1]:1,[1,1,0,0,1,0]:0,[0,0,1,1,0,1]:0,[0, 0,0,0,0,0]:0}。

For the sample of public sentiment text, different classes of colloquial style sentence will not correspond to identical vector, avoid The problem of different classes of sentence corresponds to same vector, and the discrimination of the positive negative sample of the vector obtained is apparent.Therefore, Compare suitable machine learning training and can get good text classification effect after engineering meeting.

For embodiment of the method, for simple description, therefore it is all expressed as a series of combination of actions, but this field Technical staff should know that the embodiment of the present disclosure is not limited by the described action sequence, because implementing according to the disclosure Example, certain steps can be performed in other orders or simultaneously.Next, those skilled in the art should also know that, specification Described in embodiment belong to preferred embodiment, necessary to the involved action not necessarily embodiment of the present disclosure.

Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.As shown in figure 3, this public affairs The vectorization device of text for opening embodiment offer includes：

Acquiring unit 301 for obtaining the text in user equipment, and determines the application type of the text, obtains institute State the sample of text；

First extraction unit 302, all monocase elements for extracting the sample, obtains the monocase of the sample Set；

Second extraction unit 303, double character elements for extracting the sample according to the application type of the sample, obtains To double character sets of the sample；

Combining unit 304 obtains vocabulary for merging the monocase set and double character sets；

Construction unit 305 obtains the text vector of the text for being built according to the vocabulary.

In one optional embodiment of the disclosure, first extraction unit 302 is additionally operable to：

In one optional embodiment of the disclosure, second extraction unit 303 is specifically used for：

In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to：

It should be noted that the detail that further relates to of vectorization device of the text provided for the disclosure is in this public affairs It opens and is described in detail in the vectorization method of the text of offer, do not repeated herein.

It should be noted that in all parts of the system of the disclosure, according to the function that it to be realized to therein Component has carried out logical partitioning, and still, the present disclosure is not limited thereto, all parts can be repartitioned as needed or Person combines, for example, can be single component by some component combinations, or can be further broken into some components more Subassembly.

The all parts embodiment of the disclosure can be with hardware realization, or to run on one or more processors Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) realize some or all portions in the system according to the embodiment of the present disclosure The some or all functions of part.The disclosure is also implemented as the part or complete for executing method as described herein The equipment or program of device (for example, computer program and computer program product) in portion.Such program for realizing the disclosure It can may be stored on the computer-readable medium, or can be with the form of one or more signal.Such signal can be with It downloads and obtains from internet website, either provided on carrier signal or provide in any other forms.

The disclosure is limited it should be noted that above-described embodiment illustrates rather than the disclosure, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The disclosure can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame Claim.

Embodiment of above is only suitable for illustrating the disclosure, and is not the limitation to the disclosure, in relation to the common of technical field Technical staff can also make a variety of changes and modification without departing from the spirit and scope of the disclosure, therefore all Equivalent technical solution also belongs to the scope of the disclosure, and the scope of patent protection of the disclosure should be defined by the claims.

Claims

1. a kind of vectorization method of text, which is characterized in that the method includes：

Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample；

It is built to obtain the text vector of the text according to the vocabulary.

2. the vectorization method of text according to claim 1, which is characterized in that all lists of the extraction sample After character element, the method further includes：

The individual character of occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain the monocase set of the sample.

3. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, including：

In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is positive sample This；

4. the vectorization method of text according to claim 3, which is characterized in that own in the extraction positive sample Double character elements after, the method further includes：

Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the positive sample and the positive sample is removed respectively Double character elements, to obtain double character sets of the sample.

5. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, further includes：

In the case where the application type is multivariate classification, all double word symbols of each classification in the sample are extracted respectively Element obtains double character sets of the sample.

6. the vectorization method of text according to claim 5, which is characterized in that described to extract respectively in the sample often After all double character elements of a classification, the method further includes：

The double word of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively to accord with Element, to obtain the sample each classification double character sets；

Double character sets of each classification of the sample are merged, to obtain double character sets of the sample.

7. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, further includes：

In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained described Double character sets of sample.

8. the vectorization method of text according to claim 7, which is characterized in that all in the extraction sample After double character elements, the method further includes：

The double word of occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain double character sets of the sample.

9. a kind of vectorization device of text, which is characterized in that described device includes：

Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text Sample；

First extraction unit, all monocase elements for extracting the sample, obtains the monocase set of the sample；

Second extraction unit, double character elements for extracting the sample according to the application type of the sample obtain described Double character sets of sample；

10. the vectorization device of text according to claim 9, which is characterized in that first extraction unit is also used In：

11. the vectorization device of text according to claim 9, which is characterized in that second extraction unit, it is specific to use In：

12. the vectorization device of text according to claim 11, which is characterized in that second extraction unit is also used In：

13. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used In：

14. the vectorization device of text according to claim 13, which is characterized in that second extraction unit is also used In：

15. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used In：

16. the vectorization device of text according to claim 15, which is characterized in that second extraction unit is also used In：