CN108572961A - A kind of the vectorization method and device of text - Google Patents
A kind of the vectorization method and device of text Download PDFInfo
- Publication number
- CN108572961A CN108572961A CN201710134611.9A CN201710134611A CN108572961A CN 108572961 A CN108572961 A CN 108572961A CN 201710134611 A CN201710134611 A CN 201710134611A CN 108572961 A CN108572961 A CN 108572961A
- Authority
- CN
- China
- Prior art keywords
- sample
- double
- text
- double character
- occurrence frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention discloses a kind of the vectorization method and device of text, is related to text vector field.Wherein, the method includes:Pending text is obtained, and determines the application type of the text, obtains the sample of the text;All monocase elements for extracting the sample, obtain the monocase set of the sample;Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample;The monocase set and double character sets are merged, vocabulary is obtained;It is built to obtain the text vector of the text according to the vocabulary.Invention removes Chinese Word Segmentations, avoid error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment, and have preferable fault-tolerant ability to the wrong word of the colloquial styles sentence such as public sentiment.
Description
Technical field
The present invention relates to text vector fields, and in particular, to a kind of the vectorization method and device of text.
Background technology
For various machine learning algorithms, their input is vector, and output can be continuous value or discrete
Value.Text classification or cluster are a very important applications in machine learning field, and text vectorization is then to carry out text
The first step of classification or cluster, it directly determines the quality of machine learning final result.
Existing text vector technology is as follows:
TF-IDF (term frequency-inverse document frequency, term frequency-inverse document frequency) is one
Common weighting technique of the kind for information retrieval and data mining.The dimension of sentence vector is the number of vocabulary, each dimension
Value is the weight that the word of corresponding vocabulary is calculated by TF-IDF methods.TF refers to time that some word occurs in single sentence
Number, since sentence length differs, needs to be standardized;IDF refers to inverse document frequency, and calculation formula is log (corpus
Sentence sum/(the sentence number+1 for including the word));Finally the corresponding TF-IDF values of sentence dimension are TF*IDF.
Word2Vec master's technical method to be used have Continuous Bag-of-Word Model (CBOW) and
Continuous Skip-gram Model.The principle of CBOW is based on context to predict the probability of current term, and Skip-
Gram is then the probability that context is predicted by current term, their core is all neural network algorithm, obtained word to
Amount dimension is relatively low (100-400 is more satisfactory), and Word similarity can be calculated easily by vector angle.But we
It is difficult to obtain the vector of characterization sentence semantics from term vector, so there is Doc2Vec, can directly sentence be converted into
Amount.Doc2Vec methods increase except for an additional outside a sentence vector, do not have too big difference with Word2Vec.
However, there are the following problems for the prior art:
1, the prior art is substantially based on Chinese Word Segmentation, and effect is also in the application of sentence in writing form for Chinese Word Segmentation
Row, but handle ineffective as this kind of colloquial style sentence of public sentiment, quite a few error can be introduced.Due to cascading
Presence, can on the result of final machine learning generate significantly influence.In addition, Chinese Word Segmentation is depended on, to public sentiment etc.
The fault-tolerant ability of colloquial style sentence is poor.
2, the vector dimension that the text vector method of TF-IDF types finally generates is higher, from it is tens of thousands of to hundreds of thousands not
Deng causing resource consumption larger, and training process is longer.It, will lost part information if carrying out dimension-reduction treatment.
Although final vector dimension is not high by Word2Vec or Doc2Vec, but needs powerful corpus to support, time consumption for training consumption
Resource.
3, the vectorization scheme of TF-IDF all damages semanteme, i.e. sentence and vector is many-to-one relationship, and these
Sentence is corresponding different types of, can not carry out semantic reduction.Word2Vec or Doc2Vec is based on statistics, similar sentence knot
Structure just has similar vector, can not carry out semantic reduction.
Invention content
In view of the drawbacks of the prior art, the present invention provides a kind of vectorization method of text, by removing Chinese Word Segmentation, keeps away
Error and its subsequent error cascading of the cutting word for the introducing of the colloquial styles sentences such as public sentiment are exempted from, and to spoken languages such as public sentiments
The wrong word for changing sentence has preferable fault-tolerant ability.
First aspect according to the present invention, it is proposed that a kind of vectorization method of text, the method includes:
Pending text is obtained, and determines the application type of the text, obtains the sample of the text;
All monocase elements for extracting the sample, obtain the monocase set of the sample;
Double character elements that the sample is extracted according to the application type of the sample obtain double character set of the sample
It closes;
The monocase set and double character sets are merged, vocabulary is obtained;
It is built to obtain the text vector of the text according to the vocabulary.
Optionally, after all monocase elements of the extraction sample, the method further includes:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively
Monocase element, to obtain the monocase set of the sample.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample
This double character sets, including:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is
Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
Optionally, after double character elements all in the extraction positive sample, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most
Small double character elements, to obtain double character sets of the sample.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample
This double character sets further include:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively
Symbol element, obtains double character sets of the sample.
Optionally, described to extract respectively in the sample after all double character elements of each classification, the method is also
Including:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively
Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample
It closes.
Optionally, the application type according to the sample extracts double character elements of the sample, obtains the sample
This double character sets further include:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained
Double character sets of the sample.
Optionally, after double character elements all in the extraction sample, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively
Double character elements, to obtain double character sets of the sample.
Second aspect according to the present invention, it is proposed that a kind of vectorization device of text, described device include:
Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text
This sample;
First extraction unit, all monocase elements for extracting the sample, obtains the monocase collection of the sample
It closes;
Second extraction unit, double character elements for extracting the sample according to the application type of the sample, obtains
Double character sets of the sample;
Combining unit obtains vocabulary for merging the monocase set and double character sets;
Construction unit obtains the text vector of the text for being built according to the vocabulary.
Optionally, first extraction unit, is additionally operable to:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively
Monocase element, to obtain the monocase set of the sample.
Optionally, second extraction unit, is specifically used for:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is
Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
Optionally, second extraction unit, is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most
Small double character elements, to obtain double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively
Symbol element, obtains double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively
Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample
It closes.
Optionally, second extraction unit, is additionally operable to:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained
Double character sets of the sample.
Optionally, second extraction unit, is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively
Double character elements, to obtain double character sets of the sample.
Through the above technical solutions, obtaining pending public sentiment text, and determine the application type of public sentiment text, obtains carriage
The sample of feelings text;All monocase elements for extracting sample, obtain the monocase set of sample;And according to the application class of sample
Type extracts double character elements of sample, obtains double character sets of sample;Then, monocase set and double character sets are carried out
Merge, obtains vocabulary;Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation and be directed to public sentiment
The error and its subsequent error cascading that equal colloquial styles sentence introduces, and have to the wrong word of the colloquial styles sentence such as public sentiment
There is preferable fault-tolerant ability.
Description of the drawings
In order to illustrate more clearly of the embodiment of the present disclosure or technical solution in the prior art, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described.It should be evident that the accompanying drawings in the following description is only this
Some disclosed embodiments for those of ordinary skill in the art without creative efforts, can be with
Other attached drawings are obtained according to these figures.
Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides;
Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides;
Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.
Specific implementation mode
Below in conjunction with the attached drawing in the embodiment of the present disclosure, the technical solution in the embodiment of the present disclosure is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only disclosure a part of the embodiment, instead of all the embodiments.It is based on
Embodiment in the disclosure, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment belongs to the range of disclosure protection.
The partial words referred in the embodiment of the present disclosure are illustrated below.
The user equipment (User Equipment, abbreviation UE) referred in the embodiment of the present disclosure is used mobile terminal
Or the equipment such as personal computer (Personal Computer, abbreviation PC).Such as smart mobile phone, personal digital assistant (PDA),
Tablet computer, vehicle-mounted computer (carputer), handheld device, intelligent glasses, smartwatch, wearable is set laptop
Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Fig. 1 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Figure 1, the disclosure one
The vectorization method of text that embodiment provides includes:
In step S101, pending text is obtained, and determine the application type of the text, obtain the text
Sample.
Wherein, pending text obtains in a user device, and user equipment can be mobile terminal or the ends PC etc., end
End is the equipment such as mobile terminal or the ends PA for being used to provide operating service held.The public sentiment text refers to indicating public sentiment
Colloquial style sentence, and the quantity of sentence is multiple.The text is public sentiment text.Wherein, public sentiment refer to the more masses about
The summation of various phenomenons in society, conviction, attitude, opinion and mood expressed by problem etc. performance.In specific application,
The application type that the public sentiment text is determined by the way of manually marking obtains the sample of the public sentiment text.
Then, in step s 102, all monocase elements for extracting the sample, obtain the monocase collection of the sample
It closes.
Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved
Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.
And then, in step s 103, double character elements that the sample is extracted according to the application type of the sample, obtain
To double character sets of the sample.
Wherein, after obtaining the monocase set of sample of public sentiment text, it is also contemplated that the sequence of intercharacter could be from
Sentence is restored in text vector.For different application types, there are different semantic sequence acquisition modes.
Specifically, which includes:In the case where the application type is binary classification, double word in the sample is defined
The classification for according with negligible amounts is positive sample;Double character elements all in the positive sample are extracted, the double word of the sample is obtained
Symbol set;Or, in the case where the application type is multivariate classification, extract each classification in the sample respectively all pairs
Character element obtains double character sets of the sample;Or, in the case where the application type is text cluster, institute is extracted
Double character elements all in sample are stated, double character sets of the sample are obtained.Specifically, it is binary classification in application type
In the case of, positive sample and negative sample are divided by way of manually marking to sample in advance, then text vectorization is only right
Positive sample is handled, and in the case where application type is multivariate classification, then it needs to be respectively processed to each.
Wherein, binary classification refers to that the sample of public sentiment text is roughly classified into two major classes by the way of manually marking
Classification for public sentiment text.For example, can according to expressed by colloquial style sentence positive emotion or Negative Affect by public sentiment text
This sample is divided into two classes, and the colloquial style sentence for expressing positive emotion is classified as first sample, will express the spoken language of Negative Affect
Change sentence and is classified as the second sample.Multivariate classification refers to being subdivided into the sample of public sentiment text by the way of manually marking multiple
Classification is used for the classification of public sentiment text.For example, can according to news, spit slot, bug, consulting and suggest this five classifications by carriage
The sample of feelings text is subdivided into five major class, can also be harassed according to popularization, not getting on the bus starts charging, can not return ticket, Si Jizuo
Disadvantage, information are stolen, repeat to pay, refuse to pay fare, dangerous driving, cancel stroke charge and software crash this ten classifications will
The sample of public sentiment text is subdivided into ten groups.Text cluster refers to double word all in the directly sample of extraction public sentiment text
Symbol element is used for the cluster of public sentiment text.Wherein, application type refers to subsequently using machine learning after text vectorization is completed
The type classified to text.Correspondingly, binary classification refers to that pending text is finally divided into two classes by machine learning, more
Member, which is classified, refers to that pending text is divided into multiple and different classifications by machine learning, and text cluster is referred to pending text
This is clustered.For different machine learning classification modes, the method for text vector is different.In addition, for two
Member classification and multivariate classification, before carrying out the classification processing of text vector and machine learning, by way of manually marking
Preliminary presort is carried out.Classify in advance to sample, this mode classification of machine learning can be made.Engineering can be this
After mode classification, it can be classified with this mode classification to pending text.
Preferably, after double character elements all in the extraction positive sample, the method further includes:It unites respectively
Meter is per the occurrence frequency corresponding to double word symbol element;The maximum double character elements of occurrence frequency in the positive sample are removed respectively
With double character elements of occurrence frequency minimum in the positive sample, to obtain double character sets of the sample.Wherein, for
The maximum double character elements of occurrence frequency, it is believed that semantic reduction can also be carried out by not needing these information.For there is frequency
Spend minimum double character elements, it is believed that it is not the double character elements of feature of the positive sample, and discrimination is not high.Therefore,
Double character elements of the maximum double character elements of occurrence frequency in positive sample and occurrence frequency minimum can be removed, and by positive sample
In double character sets of the remaining double character elements as the sample of public sentiment text.In specific application, for binary classification,
Dimension (numbers of double character elements in double character sets) general control is within 1500.Thereby, finally obtained text vector
Compare the input for being suitable as machine learning, and the dimension of the text vector of generation can be controlled, reaches preferable classification knot
Fruit.
Preferably, described to extract respectively in the sample after all double character elements of each classification, the method is also
Including:The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;Respectively described in removal
Double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample, it is described to obtain
Double character sets of each classification of sample;Double character sets of each classification of the sample are merged, to obtain
Double character sets of the sample.For multivariate classification, it is necessary first to occurrence frequency statistics is carried out to each classification, with binary point
Class removes double character elements of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of sample,
The double character elements of each class another characteristic are obtained, there is preferable discrimination for each classification.Later, each classification is sought
Double word accords with union of sets collection, obtains double character sets of final sample.Dimension general control can obtain preferably in 1000-7000
Classifying quality.Thereby, the double character elements of the key of each classification all can serve as a dimension of vector, and monocase collection is combined into
Each classification is shared, and double character sets, which correspond to each classification, has different sets, highlights classification dimensional characteristics, helps to be promoted
The accuracy of the classification of public sentiment text, and the dimension of the text vector of generation can be controlled, reach preferable classification results.
Preferably, after double character elements all in the extraction sample, the method further includes:It counts respectively
Per the occurrence frequency corresponding to double word symbol element;The maximum double character elements of occurrence frequency in the sample and institute are removed respectively
The double character elements for stating occurrence frequency minimum in sample, to obtain double character sets of the sample.For text cluster, need
The statistics that occurrence frequency is carried out to the sample of public sentiment text, removes the maximum double character elements of occurrence frequency and appearance in sample
After double character elements of frequency minimum, double character sets of sample are obtained, dimension control reaches expected between 1000-7000
Clustering Effect.Thereby, finally obtained text vector is relatively suitable as the input of machine learning, and can control generation
Text vector dimension, reach preferable cluster result.
Then, in step S104, the monocase set and double character sets is merged, vocabulary is obtained.
Finally, it in step S105, is built to obtain the text vector of the text according to the vocabulary.
Wherein, each monocase element or each double character elements can be used as and each colloquial style sentence in public sentiment text
One dimension of corresponding vector occurs being then 1 in the colloquial style sentence of public sentiment, is otherwise 0, so that structure obtains and public sentiment
The corresponding vector of each colloquial style sentence in text, vectorial dimension are the sum of the number of monocase element and double character elements.
For binary classification, for vectorial dimension in 4500 dimensions hereinafter, for multivariate classification and text cluster, vectorial dimension is also all control
System compares the input for being suitable as machine learning within 10000 dimensions.
The present embodiment determines the application type of public sentiment text by taking pending public sentiment text, obtains public sentiment text
Sample;All monocase elements for extracting sample, obtain the monocase set of sample;And it is extracted according to the application type of sample
Double character elements of sample obtain double character sets of sample;Then, monocase set and double character sets are merged,
Obtain vocabulary;Finally, it is built to obtain the text vector of public sentiment text according to vocabulary, avoids Chinese Word Segmentation for spoken languages such as public sentiments
Change the error and its subsequent error cascading that sentence introduces, and has to the wrong word of the colloquial styles sentence such as public sentiment preferable
Fault-tolerant ability.
Fig. 2 is the flow chart of the vectorization method for the text that one embodiment of the disclosure provides.As shown in Fig. 2, the disclosure one
The vectorization method of text that embodiment provides includes:
In step s 201, the public sentiment text in user equipment is obtained, and determines the application type of the public sentiment text, is obtained
To the sample of the public sentiment text.
For example, the sample of public sentiment text is as follows:{ " driver of drop drop is fine ":1, " attitude of driver is fine ":
1, " driver of drop drop refuses to take passengers ":0, " this orange eats very well ":0, " soon to lead certificate ":0}.The sample of this public sentiment text
Include the colloquial style sentence of 5 expression public sentiments, and determines the application type of the public sentiment text, tool by the way of manually marking
The application type of body is binary classification, and the good colloquial style sentence of drop driver's attitude is dripped in performance is labeled as 1, by performance drop drop
The bad colloquial style sentence of driver's attitude and other colloquial style sentences are labeled as 0.
Then, in step S202, all monocase elements of the sample are extracted, and count every in the sample respectively
Occurrence frequency corresponding to one monocase element;And the maximum monocase element of occurrence frequency in the sample and institute are removed respectively
The monocase element for stating occurrence frequency minimum in sample, to obtain the monocase set of the sample.
Wherein, the sample of public sentiment text is traversed, all monocase elements of colloquial style sentence in sample, including punctuate are preserved
Symbol, the size of the monocase set of sample is below 3000 under normal circumstances.Remove the minimum uncommon character (shadow of occurrence frequency
Ring the factor be less than a ten thousandth) and the character of occurrence frequency highest (resolution capability is not high) after, the monocase dimension one of acquisition
As within 1500.
For the example in step S201, its all monocase element is extracted, and it is right to count each monocase element institute
The occurrence frequency answered, obtained result are:
{ " drop ":4, " ":4, " department " 3, " machine ":3, " very ":3, " good ":3, " clothes ":1, " business ":1, " state ":1, " degree ":
1, " refusing ":1, " load ":1, " this ":1, " a ":1, " orange " 1, " son ":1, " eating ":1, " fast ":1, " next ":1, " neck ":1, " certificate ":
1, " ":1}.
It removes in sample after the maximum monocase element of occurrence frequency and the monocase element of occurrence frequency minimum, acquisition
Result be:
{ " department " 3, " machine ":3, " very ":3, " good ":3}.
Wherein, the dimension of monocase set is 4 dimensions, the feature often tieed up respectively " department ", " machine ", " very ", " good ".
And then, in step S203, double character elements of the sample are extracted according to the application type of the sample, are obtained
To double character sets of the sample.
Since the exemplary application type in step S201 is binary classification, therefore the positive sample of negligible amounts need to be only accorded with from double word
Double character elements are extracted in this can obtain double character sets of sample of public sentiment text.Extract all double words symbol in positive sample
Element, and count per the occurrence frequency corresponding to double word symbol element, obtained concrete outcome is as follows:
{ " drop drop ":1, " drop ":1, " department ":1, " driver ":2, " machine is very ":1, " fine ":2, " machine ":1, "
Clothes ":1, " service ":1, " business state ":1, " attitude ":1, " degree is very ":1}.
After removing the minimum double character elements of occurrence frequency, the result of acquisition is:
{ " driver ":2, " fine ":2}.
Wherein, the dimension of double character sets is 2, often ties up and is characterized as " driver ", " fine ".
Then, in step S204, the monocase set and double character sets is merged, vocabulary is obtained.
For example, according to the monocase set { " department ", " machine ", " very ", " good " } and step obtained in step S202
The double character sets { " driver ", " fine " } obtained in S203 can obtain vocabulary { " department ", " machine ", " very ", " good ", " department
Machine ", " fine " }.Wherein, the dimension of vocabulary is 6 dimensions, feature respectively " department ", " machine ", " very ", " good ", " department of each dimension
Machine ", " fine ".
Finally, it in step S205, is built to obtain the text vector of the public sentiment text according to the vocabulary.
For the example in step S201, each colloquial style in public sentiment text can be obtained according to the vocabulary in step S204
The corresponding vector of sentence is respectively:
{[1,1,1,1,1,1]:1,[1,1,1,1,1,1]:1,[1,1,0,0,1,0]:0,[0,0,1,1,0,1]:0,[0,
0,0,0,0,0]:0}。
For the sample of public sentiment text, different classes of colloquial style sentence will not correspond to identical vector, avoid
The problem of different classes of sentence corresponds to same vector, and the discrimination of the positive negative sample of the vector obtained is apparent.Therefore,
Compare suitable machine learning training and can get good text classification effect after engineering meeting.
For embodiment of the method, for simple description, therefore it is all expressed as a series of combination of actions, but this field
Technical staff should know that the embodiment of the present disclosure is not limited by the described action sequence, because implementing according to the disclosure
Example, certain steps can be performed in other orders or simultaneously.Next, those skilled in the art should also know that, specification
Described in embodiment belong to preferred embodiment, necessary to the involved action not necessarily embodiment of the present disclosure.
Fig. 3 is the structural schematic diagram of the vectorization device for the text that one embodiment of the disclosure provides.As shown in figure 3, this public affairs
The vectorization device of text for opening embodiment offer includes:
Acquiring unit 301 for obtaining the text in user equipment, and determines the application type of the text, obtains institute
State the sample of text;
First extraction unit 302, all monocase elements for extracting the sample, obtains the monocase of the sample
Set;
Second extraction unit 303, double character elements for extracting the sample according to the application type of the sample, obtains
To double character sets of the sample;
Combining unit 304 obtains vocabulary for merging the monocase set and double character sets;
Construction unit 305 obtains the text vector of the text for being built according to the vocabulary.
In one optional embodiment of the disclosure, first extraction unit 302 is additionally operable to:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
Occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively
Monocase element, to obtain the monocase set of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is specifically used for:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is
Positive sample;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency is removed in the maximum double character elements of occurrence frequency in the positive sample and the positive sample respectively most
Small double character elements, to obtain double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
In the case where the application type is multivariate classification, all double words of each classification in the sample are extracted respectively
Symbol element, obtains double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
Pair of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively
Character element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character set of the sample
It closes.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained
Double character sets of the sample.
In one optional embodiment of the disclosure, second extraction unit 303 is additionally operable to:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively
Double character elements, to obtain double character sets of the sample.
It should be noted that the detail that further relates to of vectorization device of the text provided for the disclosure is in this public affairs
It opens and is described in detail in the vectorization method of the text of offer, do not repeated herein.
It should be noted that in all parts of the system of the disclosure, according to the function that it to be realized to therein
Component has carried out logical partitioning, and still, the present disclosure is not limited thereto, all parts can be repartitioned as needed or
Person combines, for example, can be single component by some component combinations, or can be further broken into some components more
Subassembly.
The all parts embodiment of the disclosure can be with hardware realization, or to run on one or more processors
Software module realize, or realized with combination thereof.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) realize some or all portions in the system according to the embodiment of the present disclosure
The some or all functions of part.The disclosure is also implemented as the part or complete for executing method as described herein
The equipment or program of device (for example, computer program and computer program product) in portion.Such program for realizing the disclosure
It can may be stored on the computer-readable medium, or can be with the form of one or more signal.Such signal can be with
It downloads and obtains from internet website, either provided on carrier signal or provide in any other forms.
The disclosure is limited it should be noted that above-described embodiment illustrates rather than the disclosure, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The disclosure can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be by the same hardware branch
To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and be run after fame
Claim.
Embodiment of above is only suitable for illustrating the disclosure, and is not the limitation to the disclosure, in relation to the common of technical field
Technical staff can also make a variety of changes and modification without departing from the spirit and scope of the disclosure, therefore all
Equivalent technical solution also belongs to the scope of the disclosure, and the scope of patent protection of the disclosure should be defined by the claims.
Claims (16)
1. a kind of vectorization method of text, which is characterized in that the method includes:
Pending text is obtained, and determines the application type of the text, obtains the sample of the text;
All monocase elements for extracting the sample, obtain the monocase set of the sample;
Double character elements that the sample is extracted according to the application type of the sample obtain double character sets of the sample;
The monocase set and double character sets are merged, vocabulary is obtained;
It is built to obtain the text vector of the text according to the vocabulary.
2. the vectorization method of text according to claim 1, which is characterized in that all lists of the extraction sample
After character element, the method further includes:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
The individual character of occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively
Symbol element, to obtain the monocase set of the sample.
3. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample
Type extracts double character elements of the sample, obtains double character sets of the sample, including:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is positive sample
This;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
4. the vectorization method of text according to claim 3, which is characterized in that own in the extraction positive sample
Double character elements after, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the positive sample and the positive sample is removed respectively
Double character elements, to obtain double character sets of the sample.
5. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample
Type extracts double character elements of the sample, obtains double character sets of the sample, further includes:
In the case where the application type is multivariate classification, all double word symbols of each classification in the sample are extracted respectively
Element obtains double character sets of the sample.
6. the vectorization method of text according to claim 5, which is characterized in that described to extract respectively in the sample often
After all double character elements of a classification, the method further includes:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
The double word of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively to accord with
Element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character sets of the sample.
7. the vectorization method of text according to claim 1, which is characterized in that the application class according to the sample
Type extracts double character elements of the sample, obtains double character sets of the sample, further includes:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained described
Double character sets of sample.
8. the vectorization method of text according to claim 7, which is characterized in that all in the extraction sample
After double character elements, the method further includes:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
The double word of occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively
Symbol element, to obtain double character sets of the sample.
9. a kind of vectorization device of text, which is characterized in that described device includes:
Acquiring unit for obtaining the text in user equipment, and determines the application type of the text, obtains the text
Sample;
First extraction unit, all monocase elements for extracting the sample, obtains the monocase set of the sample;
Second extraction unit, double character elements for extracting the sample according to the application type of the sample obtain described
Double character sets of sample;
Combining unit obtains vocabulary for merging the monocase set and double character sets;
Construction unit obtains the text vector of the text for being built according to the vocabulary.
10. the vectorization device of text according to claim 9, which is characterized in that first extraction unit is also used
In:
The occurrence frequency corresponding to each monocase element in the sample is counted respectively;
The individual character of occurrence frequency minimum in the maximum monocase element of occurrence frequency in the sample and the sample is removed respectively
Symbol element, to obtain the monocase set of the sample.
11. the vectorization device of text according to claim 9, which is characterized in that second extraction unit, it is specific to use
In:
In the case where the application type is binary classification, the classification for defining double word symbol negligible amounts in the sample is positive sample
This;
Double character elements all in the positive sample are extracted, double character sets of the sample are obtained.
12. the vectorization device of text according to claim 11, which is characterized in that second extraction unit is also used
In:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
Occurrence frequency minimum in the maximum double character elements of occurrence frequency in the positive sample and the positive sample is removed respectively
Double character elements, to obtain double character sets of the sample.
13. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used
In:
In the case where the application type is multivariate classification, all double word symbols of each classification in the sample are extracted respectively
Element obtains double character sets of the sample.
14. the vectorization device of text according to claim 13, which is characterized in that second extraction unit is also used
In:
The occurrence frequency corresponding to every double word symbol element of each classification in the sample is counted respectively;
The double word of the maximum double character elements of occurrence frequency and occurrence frequency minimum in each classification of the sample is removed respectively to accord with
Element, to obtain the sample each classification double character sets;
Double character sets of each classification of the sample are merged, to obtain double character sets of the sample.
15. the vectorization device of text according to claim 9, which is characterized in that second extraction unit is also used
In:
In the case where the application type is text cluster, double character elements all in the sample are extracted, are obtained described
Double character sets of sample.
16. the vectorization device of text according to claim 15, which is characterized in that second extraction unit is also used
In:
It is counted respectively per the occurrence frequency corresponding to double word symbol element;
The double word of occurrence frequency minimum in the maximum double character elements of occurrence frequency in the sample and the sample is removed respectively
Symbol element, to obtain double character sets of the sample.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710134611.9A CN108572961A (en) | 2017-03-08 | 2017-03-08 | A kind of the vectorization method and device of text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710134611.9A CN108572961A (en) | 2017-03-08 | 2017-03-08 | A kind of the vectorization method and device of text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108572961A true CN108572961A (en) | 2018-09-25 |
Family
ID=63576883
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710134611.9A Pending CN108572961A (en) | 2017-03-08 | 2017-03-08 | A kind of the vectorization method and device of text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108572961A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472241A (en) * | 2019-07-29 | 2019-11-19 | 平安科技(深圳)有限公司 | Generate the method and relevant device of de-redundancy information sentence vector |
CN110705260A (en) * | 2019-09-24 | 2020-01-17 | 北京工商大学 | Text vector generation method based on unsupervised graph neural network structure |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1158460A (en) * | 1996-12-31 | 1997-09-03 | 复旦大学 | Multiple languages automatic classifying and searching method |
CN1558367A (en) * | 2004-01-16 | 2004-12-29 | 清华大学 | Feature dimension reduction method for automatic classification of Chinese text |
CN103020167A (en) * | 2012-11-26 | 2013-04-03 | 南京大学 | Chinese text classification method for computer |
CN103186845A (en) * | 2011-12-29 | 2013-07-03 | 盈世信息科技(北京)有限公司 | Junk mail filtering method |
CN103544246A (en) * | 2013-10-10 | 2014-01-29 | 清华大学 | Method and system for constructing multi-emotion dictionary for internet |
CN106294350A (en) * | 2015-05-13 | 2017-01-04 | 阿里巴巴集团控股有限公司 | A kind of text polymerization and device |
-
2017
- 2017-03-08 CN CN201710134611.9A patent/CN108572961A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1158460A (en) * | 1996-12-31 | 1997-09-03 | 复旦大学 | Multiple languages automatic classifying and searching method |
CN1558367A (en) * | 2004-01-16 | 2004-12-29 | 清华大学 | Feature dimension reduction method for automatic classification of Chinese text |
CN103186845A (en) * | 2011-12-29 | 2013-07-03 | 盈世信息科技(北京)有限公司 | Junk mail filtering method |
CN103020167A (en) * | 2012-11-26 | 2013-04-03 | 南京大学 | Chinese text classification method for computer |
CN103544246A (en) * | 2013-10-10 | 2014-01-29 | 清华大学 | Method and system for constructing multi-emotion dictionary for internet |
CN106294350A (en) * | 2015-05-13 | 2017-01-04 | 阿里巴巴集团控股有限公司 | A kind of text polymerization and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472241A (en) * | 2019-07-29 | 2019-11-19 | 平安科技(深圳)有限公司 | Generate the method and relevant device of de-redundancy information sentence vector |
CN110472241B (en) * | 2019-07-29 | 2023-11-10 | 平安科技(深圳)有限公司 | Method for generating redundancy-removed information sentence vector and related equipment |
CN110705260A (en) * | 2019-09-24 | 2020-01-17 | 北京工商大学 | Text vector generation method based on unsupervised graph neural network structure |
CN110705260B (en) * | 2019-09-24 | 2023-04-18 | 北京工商大学 | Text vector generation method based on unsupervised graph neural network structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804512B (en) | Text classification model generation device and method and computer readable storage medium | |
CN107609121A (en) | Newsletter archive sorting technique based on LDA and word2vec algorithms | |
CN111310476B (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN109460551B (en) | Signature information extraction method and device | |
CN111680145A (en) | Knowledge representation learning method, device, equipment and storage medium | |
CN108009148A (en) | Text emotion classification method for expressing based on deep learning | |
CN105912716A (en) | Short text classification method and apparatus | |
CN110321553A (en) | Short text subject identifying method, device and computer readable storage medium | |
US9785705B1 (en) | Generating and applying data extraction templates | |
CN110929025A (en) | Junk text recognition method and device, computing equipment and readable storage medium | |
CN112330455A (en) | Method, device, equipment and storage medium for pushing information | |
US20220269354A1 (en) | Artificial intelligence-based system and method for dynamically predicting and suggesting emojis for messages | |
CN107145516A (en) | A kind of Text Clustering Method and system | |
CN111339295A (en) | Method, apparatus, electronic device and computer readable medium for presenting information | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
Sheshikala et al. | Natural language processing and machine learning classifier used for detecting the author of the sentence | |
CN110232127A (en) | File classification method and device | |
CN113051480A (en) | Resource pushing method and device, electronic equipment and storage medium | |
CN110990587B (en) | Enterprise relation discovery method and system based on topic model | |
CN113051380A (en) | Information generation method and device, electronic equipment and storage medium | |
Mani et al. | Email spam detection using gated recurrent neural network | |
Al Mostakim et al. | Bangla content categorization using text based supervised learning methods | |
CN108572961A (en) | A kind of the vectorization method and device of text | |
CN110019821A (en) | Text category training method and recognition methods, relevant apparatus and storage medium | |
CN110704611B (en) | Illegal text recognition method and device based on feature de-interleaving |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180925 |