CN108572961A - A kind of the vectorization method and device of text - Google Patents

A kind of the vectorization method and device of text Download PDF

Info

Publication number
CN108572961A
CN108572961A CN201710134611.9A CN201710134611A CN108572961A CN 108572961 A CN108572961 A CN 108572961A CN 201710134611 A CN201710134611 A CN 201710134611A CN 108572961 A CN108572961 A CN 108572961A
Authority
CN
China
Prior art keywords
sample
double
text
double character
occurrence frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710134611.9A
Other languages
Chinese (zh)
Inventor
刘家兵
刘永波
吴春龙
张少松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201710134611.9A priority Critical patent/CN108572961A/en
Publication of CN108572961A publication Critical patent/CN108572961A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention discloses a kind of the vectorization method and device of text, is related to text vector field.Wherein, the method includes:Pending text is obtained, and determines the application type of the text, obtains the sample of the text;All monocase elements for extracting the sample, obtain the monocase set of the sample;Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample;The monocase set and double character sets are merged, vocabulary is obtained;It is built to obtain the text vector of the text according to the vocabulary.Invention removes Chinese Word Segmentations, avoid error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment, and have preferable fault-tolerant ability to the wrong word of the colloquial styles sentence such as public sentiment.

Description

A kind of the vectorization method and device of text
Technical field
The present invention relates to text vector fields, and in particular, to a kind of the vectorization method and device of text.
Background technology
For various machine learning algorithms, their input is vector, and output can be continuous value or discrete Value.Text classification or cluster are a very important applications in machine learning field, and text vectorization is then to carry out text The first step of classification or cluster, it directly determines the quality of machine learning final result.
Existing text vector technology is as follows:
TF-IDF (term frequency-inverse document frequency, term frequency-inverse document frequency) is one Common weighting technique of the kind for information retrieval and data mining.The dimension of sentence vector is the number of vocabulary, each dimension Value is the weight that the word of corresponding vocabulary is calculated by TF-IDF methods.TF refers to time that some word occurs in single sentence Number, since sentence length differs, needs to be standardized;IDF refers to inverse document frequency, and calculation formula is log (corpus Sentence sum/(the sentence number+1 for including the word));Finally the corresponding TF-IDF values of sentence dimension are TF*IDF.
Word2Vec master's technical method to be used have Continuous Bag-of-Word Model (CBOW) and Continuous Skip-gram Model.The principle of CBOW is based on context to predict the probability of current term, and Skip- Gram is then the probability that context is predicted by current term, their core is all neural network algorithm, obtained word to Amount dimension is relatively low (100-400 is more satisfactory), and Word similarity can be calculated easily by vector angle.But we It is difficult to obtain the vector of characterization sentence semantics from term vector, so there is Doc2Vec, can directly sentence be converted into Amount.Doc2Vec methods increase except for an additional outside a sentence vector, do not have too big difference with Word2Vec.
However, there are the following problems for the prior art:
1, the prior art is substantially based on Chinese Word Segmentation, and effect is also in the application of sentence in writing form for Chinese Word Segmentation Row, but handle ineffective as this kind of colloquial style sentence of public sentiment, quite a few error can be introduced.Due to cascading Presence, can on the result of final machine learning generate significantly influence.In addition, Chinese Word Segmentation is depended on, to public sentiment etc. The fault-tolerant ability of colloquial style sentence is poor.
2, the vector dimension that the text vector method of TF-IDF types finally generates is higher, from it is tens of thousands of to hundreds of thousands not Deng causing resource consumption larger, and training process is longer.It, will lost part information if carrying out dimension-reduction treatment. Although final vector dimension is not high by Word2Vec or Doc2Vec, but needs powerful corpus to support, time consumption for training consumption Resource.
3, the vectorization scheme of TF-IDF all damages semanteme, i.e. sentence and vector is many-to-one relationship, and these Sentence is corresponding different types of, can not carry out semantic reduction.Word2Vec or Doc2Vec is based on statistics, similar sentence knot Structure just has similar vector, can not carry out semantic reduction.
Invention content
In view of the drawbacks of the prior art, the present invention provides a kind of vectorization method of text, by removing Chinese Word Segmentation, keeps away Error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment are exempted from, and to spoken languages such as public sentiments The wrong word for changing sentence has preferable fault-tolerant ability.
First aspect according to the present invention, it is proposed that a kind of vectorization method of text, the method includes:
Pending text is obtained, and determines the application type of the text, obtains the sample of the text;
All monocase elements for extracting the sample, obtain the monocase set of the sample;
Double character elements that the sample is extracted according to the application type of the sample obtain double character set of the sample It closes;
The monocase set and double character sets are merged, vocabulary is obtained;
It is built to obtain the text vector of the text according to the vocabulary.
Optionally, after all monocase elements of the extraction sample, the method further includes:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Monocase element, to obtain the monocase set of the sample.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample This double character sets, including:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
Optionally, after double character elements all in the extraction positive sample, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most Small double character elements, to obtain double character sets of the sample.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample This double character sets further include:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively Symbol element, obtains double character sets of the sample.
Optionally, described to extract respectively in the sample after all double character elements of each classification, the method is also Including:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample It closes.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample This double character sets further include:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained Double character sets of the sample.
Optionally, after double character elements all in the extraction sample, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Double character elements, to obtain double character sets of the sample.
Second aspect according to the present invention, it is proposed that a kind of vectorization device of text, described device include:
Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text This sample;
First extraction unit, all monocase elements for extracting the sample, obtains the monocase collection of the sample It closes;
Second extraction unit, double character elements for extracting the sample according to the application type of the sample, obtains Double character sets of the sample;
Combining unit obtains vocabulary for merging the monocase set and double character sets;
Construction unit obtains the text vector of the text for being built according to the vocabulary.
Optionally, first extraction unit, is additionally operable to:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Monocase element, to obtain the monocase set of the sample.
Optionally, second extraction unit, is specifically used for:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
Optionally, second extraction unit, is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most Small double character elements, to obtain double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively Symbol element, obtains double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample It closes.
Optionally, second extraction unit, is additionally operable to:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained Double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Double character elements, to obtain double character sets of the sample.
Through the above technical solutions, obtaining pending public sentiment text, and determine the application type of public sentiment text, obtains carriage The sample of feelings text;All monocase elements for extracting sample, obtain the monocase set of sample;And according to the application class of sample Type extracts double character elements of sample, obtains double character sets of sample;Then, monocase set and double character sets are carried out Merge, obtains vocabulary;Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation and be directed to public sentiment The error and its subsequent error cascading that equal colloquial styles sentence introduces, and have to the wrong word of the colloquial styles sentence such as public sentiment There is preferable fault-tolerant ability.
Description of the drawings
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described.It should be evident that the accompanying drawings in the following description is only this Some disclosed embodiments for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these figures.
Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides;
Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides;
Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.
Specific implementation mode
Below in conjunction with the attached drawing in the embodiment of the present disclosure, the technical solution in the embodiment of the present disclosure is carried out clear, complete Site preparation describes, it is clear that described embodiment is only disclosure a part of the embodiment, instead of all the embodiments.It is based on Embodiment in the disclosure, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment belongs to the range of disclosure protection.
The partial words referred in the embodiment of the present disclosure are illustrated below.
The user equipment (User Equipment, abbreviation UE) referred in the embodiment of the present disclosure is used mobile terminal Or the equipment such as personal computer (Personal Computer, abbreviation PC).Such as smart mobile phone, personal digital assistant (PDA), Tablet computer, vehicle-mounted computer (carputer), handheld device, intelligent glasses, smartwatch, wearable is set laptop Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Figure 1, the disclosure one The vectorization method of text that embodiment provides includes:
In step S101, pending text is obtained, and determine the application type of the text, obtain the text Sample.
Wherein, pending text obtains in a user device, and user equipment can be mobile terminal or the ends PC etc., end End is the equipment such as mobile terminal or the ends PA for being used to provide operating service held.The public sentiment text refers to indicating public sentiment Colloquial style sentence, and the quantity of sentence is multiple.The text is public sentiment text.Wherein, public sentiment refer to the more masses about The summation of various phenomenons in society, conviction, attitude, opinion and mood expressed by problem etc. performance.In specific application, The application type that the public sentiment text is determined by the way of manually marking obtains the sample of the public sentiment text.
Then, in step s 102, all monocase elements for extracting the sample, obtain the monocase collection of the sample It closes.
Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.
And then, in step s 103, double character elements that the sample is extracted according to the application type of the sample, obtain To double character sets of the sample.
Wherein, after obtaining the monocase set of sample of public sentiment text, it is also contemplated that the sequence of intercharacter could be from Sentence is restored in text vector.For different application types, there are different semantic sequence acquisition modes.
Specifically, which includes:In the case where the application type is binary classification, double word in the sample is defined The classification for according with negligible amounts is positive sample;Double character elements all in the positive sample are extracted, the double word of the sample is obtained Symbol set;Or, in the case where the application type is multivariate classification, extract each classification in the sample respectively all pairs Character element obtains double character sets of the sample;Or, in the case where the application type is text cluster, institute is extracted Double character elements all in sample are stated, double character sets of the sample are obtained.Specifically, it is binary classification in application type In the case of, positive sample and negative sample are divided by way of manually marking to sample in advance, then text vectorization is only right Positive sample is handled, and in the case where application type is multivariate classification, then it needs to be respectively processed to each.
Wherein, binary classification refers to that the sample of public sentiment text is roughly classified into two major classes by the way of manually marking Classification for public sentiment text.For example, can according to expressed by colloquial style sentence positive emotion or Negative Affect by public sentiment text This sample is divided into two classes, and the colloquial style sentence for expressing positive emotion is classified as first sample, will express the spoken language of Negative Affect Change sentence and is classified as the second sample.Multivariate classification refers to being subdivided into the sample of public sentiment text by the way of manually marking multiple Classification is used for the classification of public sentiment text.For example, can according to news, spit slot, bug, consulting and suggest this five classifications by carriage The sample of feelings text is subdivided into five major class, can also be harassed according to popularization, not getting on the bus starts charging, can not return ticket, Si Jizuo Disadvantage, information are stolen, repeat to pay, refuse to pay fare, dangerous driving, cancel stroke charge and software crash this ten classifications will The sample of public sentiment text is subdivided into ten groups.Text cluster refers to double word all in the directly sample of extraction public sentiment text Symbol element is used for the cluster of public sentiment text.Wherein, application type refers to subsequently using machine learning after text vectorization is completed The type classified to text.Correspondingly, binary classification refers to that pending text is finally divided into two classes by machine learning, more Member, which is classified, refers to that pending text is divided into multiple and different classifications by machine learning, and text cluster is referred to pending text This is clustered.For different machine learning classification modes, the method for text vector is different.In addition, for two Member classification and multivariate classification, before carrying out the classification processing of text vector and machine learning, by way of manually marking Preliminary presort is carried out.Classify in advance to sample, this mode classification of machine learning can be made.Engineering can be this After mode classification, it can be classified with this mode classification to pending text.
Preferably, after double character elements all in the extraction positive sample, the method further includes:It unites respectively Meter is per the occurrence frequency corresponding to double word symbol element;The maximum double character elements of occurrence frequency in the positive sample are removed respectively With double character elements of occurrence frequency minimum in the positive sample, to obtain double character sets of the sample.Wherein, for The maximum double character elements of occurrence frequency, it is believed that semantic reduction can also be carried out by not needing these information.For there is frequency Spend minimum double character elements, it is believed that it is not the double character elements of feature of the positive sample, and discrimination is not high.Therefore, Double character elements of the maximum double character elements of occurrence frequency in positive sample and occurrence frequency minimum can be removed, and by positive sample In double character sets of the remaining double character elements as the sample of public sentiment text.In specific application, for binary classification, Dimension (numbers of double character elements in double character sets) general control is within 1500.Thereby, finally obtained text vector Compare the input for being suitable as machine learning, and the dimension of the text vector of generation can be controlled, reaches preferable classification knot Fruit.
Preferably, described to extract respectively in the sample after all double character elements of each classification, the method is also Including:The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;Respectively described in removal Double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample, it is described to obtain Double character sets of each classification of sample;Double character sets of each classification of the sample are merged, to obtain Double character sets of the sample.For multivariate classification, it is necessary first to occurrence frequency statistics is carried out to each classification, with binary point Class removes double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample, The double character elements of each class another characteristic are obtained, there is preferable discrimination for each classification.Later, each classification is sought Double word accords with union of sets collection, obtains double character sets of final sample.Dimension general control can obtain preferably in 1000-7000 Classifying quality.Thereby, the double character elements of the key of each classification all can serve as a dimension of vector, and monocase collection is combined into Each classification is shared, and double character sets, which correspond to each classification, has different sets, highlights classification dimensional characteristics, helps to be promoted The accuracy of the classification of public sentiment text, and the dimension of the text vector of generation can be controlled, reach preferable classification results.
Preferably, after double character elements all in the extraction sample, the method further includes:It counts respectively Per the occurrence frequency corresponding to double word symbol element;The maximum double character elements of occurrence frequency in the sample and institute are removed respectively The double character elements for stating occurrence frequency minimum in sample, to obtain double character sets of the sample.For text cluster, need The statistics that occurrence frequency is carried out to the sample of public sentiment text, removes the maximum double character elements of occurrence frequency and appearance in sample After double character elements of frequency minimum, double character sets of sample are obtained, dimension control reaches expected between 1000-7000 Clustering Effect.Thereby, finally obtained text vector is relatively suitable as the input of machine learning, and can control generation Text vector dimension, reach preferable cluster result.
Then, in step S104, the monocase set and double character sets is merged, vocabulary is obtained.
Finally, it in step S105, is built to obtain the text vector of the text according to the vocabulary.
Wherein, each monocase element or each double character elements can be used as and each colloquial style sentence in public sentiment text One dimension of corresponding vector occurs being then 1 in the colloquial style sentence of public sentiment, is otherwise 0, so that structure obtains and public sentiment The corresponding vector of each colloquial style sentence in text, vectorial dimension are the sum of the number of monocase element and double character elements. For binary classification, for vectorial dimension in 4500 dimensions hereinafter, for multivariate classification and text cluster, vectorial dimension is also all control System compares the input for being suitable as machine learning within 10000 dimensions.
The present embodiment determines the application type of public sentiment text by taking pending public sentiment text, obtains public sentiment text Sample;All monocase elements for extracting sample, obtain the monocase set of sample;And it is extracted according to the application type of sample Double character elements of sample obtain double character sets of sample;Then, monocase set and double character sets are merged, Obtain vocabulary;Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation for spoken languages such as public sentiments Change the error and its subsequent error cascading that sentence introduces, and has to the wrong word of the colloquial styles sentence such as public sentiment preferable Fault-tolerant ability.
Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Fig. 2, the disclosure one The vectorization method of text that embodiment provides includes:
In step s 201, the public sentiment text in user equipment is obtained, and determines the application type of the public sentiment text, is obtained To the sample of the public sentiment text.
For example, the sample of public sentiment text is as follows:{ " driver of drop drop is fine ":1, " attitude of driver is fine ": 1, " driver of drop drop refuses to take passengers ":0, " this orange eats very well ":0, " soon to lead certificate ":0}.The sample of this public sentiment text Include the colloquial style sentence of 5 expression public sentiments, and determines the application type of the public sentiment text, tool by the way of manually marking The application type of body is binary classification, and the good colloquial style sentence of drop driver's attitude is dripped in performance is labeled as 1, by performance drop drop The bad colloquial style sentence of driver's attitude and other colloquial style sentences are labeled as 0.
Then, in step S202, all monocase elements of the sample are extracted, and count every in the sample respectively Occurrence frequency corresponding to one monocase element;And the maximum monocase element of occurrence frequency in the sample and institute are removed respectively The monocase element for stating occurrence frequency minimum in sample, to obtain the monocase set of the sample.
Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.Remove the minimum uncommon character (shadow of occurrence frequency Ring the factor be less than a ten thousandth) and the character of occurrence frequency highest (resolution capability is not high) after, the monocase dimension one of acquisition As within 1500.
For the example in step S201, its all monocase element is extracted, and it is right to count each monocase element institute The occurrence frequency answered, obtained result are:
{ " drop ":4, " ":4, " department " 3, " machine ":3, " very ":3, " good ":3, " clothes ":1, " business ":1, " state ":1, " degree ": 1, " refusing ":1, " load ":1, " this ":1, " a ":1, " orange " 1, " son ":1, " eating ":1, " fast ":1, " next ":1, " neck ":1, " certificate ": 1, " ":1}.
It removes in sample after the maximum monocase element of occurrence frequency and the monocase element of occurrence frequency minimum, acquisition Result be:
{ " department " 3, " machine ":3, " very ":3, " good ":3}.
Wherein, the dimension of monocase set is 4 dimensions, the feature often tieed up respectively " department ", " machine ", " very ", " good ".
And then, in step S203, double character elements of the sample are extracted according to the application type of the sample, are obtained To double character sets of the sample.
Since the exemplary application type in step S201 is binary classification, therefore the positive sample of negligible amounts need to be only accorded with from double word Double character elements are extracted in this can obtain double character sets of sample of public sentiment text.Extract all double words symbol in positive sample Element, and count per the occurrence frequency corresponding to double word symbol element, obtained concrete outcome is as follows:
{ " drop drop ":1, " drop ":1, " department ":1, " driver ":2, " machine is very ":1, " fine ":2, " machine ":1, " Clothes ":1, " service ":1, " business state ":1, " attitude ":1, " degree is very ":1}.
After removing the minimum double character elements of occurrence frequency, the result of acquisition is:
{ " driver ":2, " fine ":2}.
Wherein, the dimension of double character sets is 2, often ties up and is characterized as " driver ", " fine ".
Then, in step S204, the monocase set and double character sets is merged, vocabulary is obtained.
For example, according to the monocase set { " department ", " machine ", " very ", " good " } and step obtained in step S202 The double character sets { " driver ", " fine " } obtained in S203 can obtain vocabulary { " department ", " machine ", " very ", " good ", " department Machine ", " fine " }.Wherein, the dimension of vocabulary is 6 dimensions, feature respectively " department ", " machine ", " very ", " good ", " department of each dimension Machine ", " fine ".
Finally, it in step S205, is built to obtain the text vector of the public sentiment text according to the vocabulary.
For the example in step S201, each colloquial style in public sentiment text can be obtained according to the vocabulary in step S204 The corresponding vector of sentence is respectively:
{[1,1,1,1,1,1]:1,[1,1,1,1,1,1]:1,[1,1,0,0,1,0]:0,[0,0,1,1,0,1]:0,[0, 0,0,0,0,0]:0}。
For the sample of public sentiment text, different classes of colloquial style sentence will not correspond to identical vector, avoid The problem of different classes of sentence corresponds to same vector, and the discrimination of the positive negative sample of the vector obtained is apparent.Therefore, Compare suitable machine learning training and can get good text classification effect after engineering meeting.
For embodiment of the method, for simple description, therefore it is all expressed as a series of combination of actions, but this field Technical staff should know that the embodiment of the present disclosure is not limited by the described action sequence, because implementing according to the disclosure Example, certain steps can be performed in other orders or simultaneously.Next, those skilled in the art should also know that, specification Described in embodiment belong to preferred embodiment, necessary to the involved action not necessarily embodiment of the present disclosure.
Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.As shown in figure 3, this public affairs The vectorization device of text for opening embodiment offer includes:
Acquiring unit 301 for obtaining the text in user equipment, and determines the application type of the text, obtains institute State the sample of text;
First extraction unit 302, all monocase elements for extracting the sample, obtains the monocase of the sample Set;
Second extraction unit 303, double character elements for extracting the sample according to the application type of the sample, obtains To double character sets of the sample;
Combining unit 304 obtains vocabulary for merging the monocase set and double character sets;
Construction unit 305 obtains the text vector of the text for being built according to the vocabulary.
In one optional embodiment of the disclosure, first extraction unit 302 is additionally operable to:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Monocase element, to obtain the monocase set of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is specifically used for:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most Small double character elements, to obtain double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively Symbol element, obtains double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample It closes.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained Double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Double character elements, to obtain double character sets of the sample.
It should be noted that the detail that further relates to of vectorization device of the text provided for the disclosure is in this public affairs It opens and is described in detail in the vectorization method of the text of offer, do not repeated herein.
It should be noted that in all parts of the system of the disclosure, according to the function that it to be realized to therein Component has carried out logical partitioning, and still, the present disclosure is not limited thereto, all parts can be repartitioned as needed or Person combines, for example, can be single component by some component combinations, or can be further broken into some components more Subassembly.
The all parts embodiment of the disclosure can be with hardware realization, or to run on one or more processors Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) realize some or all portions in the system according to the embodiment of the present disclosure The some or all functions of part.The disclosure is also implemented as the part or complete for executing method as described herein The equipment or program of device (for example, computer program and computer program product) in portion.Such program for realizing the disclosure It can may be stored on the computer-readable medium, or can be with the form of one or more signal.Such signal can be with It downloads and obtains from internet website, either provided on carrier signal or provide in any other forms.
The disclosure is limited it should be noted that above-described embodiment illustrates rather than the disclosure, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The disclosure can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame Claim.
Embodiment of above is only suitable for illustrating the disclosure, and is not the limitation to the disclosure, in relation to the common of technical field Technical staff can also make a variety of changes and modification without departing from the spirit and scope of the disclosure, therefore all Equivalent technical solution also belongs to the scope of the disclosure, and the scope of patent protection of the disclosure should be defined by the claims.

Claims (16)

1. a kind of vectorization method of text, which is characterized in that the method includes:
Pending text is obtained, and determines the application type of the text, obtains the sample of the text;
All monocase elements for extracting the sample, obtain the monocase set of the sample;
Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample;
The monocase set and double character sets are merged, vocabulary is obtained;
It is built to obtain the text vector of the text according to the vocabulary.
2. the vectorization method of text according to claim 1, which is characterized in that all lists of the extraction sample After character element, the method further includes:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
The individual character of occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain the monocase set of the sample.
3. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, including:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is positive sample This;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
4. the vectorization method of text according to claim 3, which is characterized in that own in the extraction positive sample Double character elements after, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the positive sample and the positive sample is removed respectively Double character elements, to obtain double character sets of the sample.
5. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, further includes:
In the case where the application type is multivariate classification, all double word symbols of each classification in the sample are extracted respectively Element obtains double character sets of the sample.
6. the vectorization method of text according to claim 5, which is characterized in that described to extract respectively in the sample often After all double character elements of a classification, the method further includes:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
The double word of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively to accord with Element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character sets of the sample.
7. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample Type extracts double character elements of the sample, obtains double character sets of the sample, further includes:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained described Double character sets of sample.
8. the vectorization method of text according to claim 7, which is characterized in that all in the extraction sample After double character elements, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
The double word of occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain double character sets of the sample.
9. a kind of vectorization device of text, which is characterized in that described device includes:
Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text Sample;
First extraction unit, all monocase elements for extracting the sample, obtains the monocase set of the sample;
Second extraction unit, double character elements for extracting the sample according to the application type of the sample obtain described Double character sets of sample;
Combining unit obtains vocabulary for merging the monocase set and double character sets;
Construction unit obtains the text vector of the text for being built according to the vocabulary.
10. the vectorization device of text according to claim 9, which is characterized in that first extraction unit is also used In:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
The individual character of occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain the monocase set of the sample.
11. the vectorization device of text according to claim 9, which is characterized in that second extraction unit, it is specific to use In:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is positive sample This;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
12. the vectorization device of text according to claim 11, which is characterized in that second extraction unit is also used In:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the positive sample and the positive sample is removed respectively Double character elements, to obtain double character sets of the sample.
13. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used In:
In the case where the application type is multivariate classification, all double word symbols of each classification in the sample are extracted respectively Element obtains double character sets of the sample.
14. the vectorization device of text according to claim 13, which is characterized in that second extraction unit is also used In:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
The double word of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively to accord with Element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character sets of the sample.
15. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used In:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained described Double character sets of sample.
16. the vectorization device of text according to claim 15, which is characterized in that second extraction unit is also used In:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
The double word of occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively Symbol element, to obtain double character sets of the sample.
CN201710134611.9A 2017-03-08 2017-03-08 A kind of the vectorization method and device of text Pending CN108572961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710134611.9A CN108572961A (en) 2017-03-08 2017-03-08 A kind of the vectorization method and device of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710134611.9A CN108572961A (en) 2017-03-08 2017-03-08 A kind of the vectorization method and device of text

Publications (1)

Publication Number Publication Date
CN108572961A true CN108572961A (en) 2018-09-25

Family

ID=63576883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710134611.9A Pending CN108572961A (en) 2017-03-08 2017-03-08 A kind of the vectorization method and device of text

Country Status (1)

Country Link
CN (1) CN108572961A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472241A (en) * 2019-07-29 2019-11-19 平安科技(深圳)有限公司 Generate the method and relevant device of de-redundancy information sentence vector
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN1558367A (en) * 2004-01-16 2004-12-29 清华大学 Feature dimension reduction method for automatic classification of Chinese text
CN103020167A (en) * 2012-11-26 2013-04-03 南京大学 Chinese text classification method for computer
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN1558367A (en) * 2004-01-16 2004-12-29 清华大学 Feature dimension reduction method for automatic classification of Chinese text
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN103020167A (en) * 2012-11-26 2013-04-03 南京大学 Chinese text classification method for computer
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472241A (en) * 2019-07-29 2019-11-19 平安科技(深圳)有限公司 Generate the method and relevant device of de-redundancy information sentence vector
CN110472241B (en) * 2019-07-29 2023-11-10 平安科技(深圳)有限公司 Method for generating redundancy-removed information sentence vector and related equipment
CN110705260A (en) * 2019-09-24 2020-01-17 北京工商大学 Text vector generation method based on unsupervised graph neural network structure
CN110705260B (en) * 2019-09-24 2023-04-18 北京工商大学 Text vector generation method based on unsupervised graph neural network structure

Similar Documents

Publication Publication Date Title
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN109460551B (en) Signature information extraction method and device
CN111680145A (en) Knowledge representation learning method, device, equipment and storage medium
CN108009148A (en) Text emotion classification method for expressing based on deep learning
CN105912716A (en) Short text classification method and apparatus
CN110321553A (en) Short text subject identifying method, device and computer readable storage medium
US9785705B1 (en) Generating and applying data extraction templates
CN110929025A (en) Junk text recognition method and device, computing equipment and readable storage medium
CN112330455A (en) Method, device, equipment and storage medium for pushing information
US20220269354A1 (en) Artificial intelligence-based system and method for dynamically predicting and suggesting emojis for messages
CN107145516A (en) A kind of Text Clustering Method and system
CN111339295A (en) Method, apparatus, electronic device and computer readable medium for presenting information
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN110232127A (en) File classification method and device
CN113051480A (en) Resource pushing method and device, electronic equipment and storage medium
CN110990587B (en) Enterprise relation discovery method and system based on topic model
CN113051380A (en) Information generation method and device, electronic equipment and storage medium
Mani et al. Email spam detection using gated recurrent neural network
Al Mostakim et al. Bangla content categorization using text based supervised learning methods
CN108572961A (en) A kind of the vectorization method and device of text
CN110019821A (en) Text category training method and recognition methods, relevant apparatus and storage medium
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180925