Summary of the invention
This specification one or more embodiment describes a kind of method and apparatus of determining text categories, for solve with
Upper one or more problem.
According in a first aspect, providing a kind of method of determining text categories, comprising: obtain the first text to be processed;
Word cutting processing is carried out to first text, obtains at least one candidate word;Determine the corresponding each conjunction of each candidate word
At term vector, wherein the synthesis term vector of the first vocabulary at least one described candidate word obtains in the following manner: being based on
Attribute description information associated with first vocabulary in preset knowledge mapping extracts first vocabulary at least one
Extend vocabulary;According to term vector model trained in advance, obtains first vocabulary and each extension vocabulary is corresponding each
A word term vector;Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary;
By each synthesis term vector input prediction model trained in advance, and first text is determined according to the output result of prediction model
This text categories.
In one embodiment, attribute description information associated with first vocabulary includes at least one of the following: needle
Vocabulary, the upper description vocabulary of first vocabulary, association vocabulary are specifically described to first vocabulary.
In one embodiment, the text categories include: Risk Text or non-Risk Text.
In one further embodiment, the prediction model is trained in the following manner:
Obtain multiple texts as sample, wherein each text respectively corresponds each group based on the candidate in corresponding text
The synthesis term vector that word determines, and the text label marked in advance, the text label include Risk Text, non-risk text
This;
The successively model that each group synthesis term vector input is selected, and according to corresponding text label training adjustment model ginseng
Number.
In one embodiment, the prediction model is in full Connection Neural Network, decision tree or Recognition with Recurrent Neural Network
It is a kind of.
According to second aspect, a kind of term vector generation method that computer executes is provided, which comprises obtain first
Vocabulary;Based on attribute description information associated with first vocabulary in preset knowledge mapping, first vocabulary is extracted
At least one extension vocabulary;According to term vector model trained in advance, first vocabulary and each extension vocabulary point are obtained
Not corresponding each word term vector;Each word term vector is merged, to generate predefined size for first vocabulary
Synthesize term vector.
In one embodiment, attribute description information associated with first vocabulary includes at least one of the following: needle
Vocabulary, the upper description vocabulary of first vocabulary, association vocabulary are specifically described to first vocabulary.
In one embodiment, the term vector model is word2vec.
In one embodiment, the term vector model that the basis is trained in advance obtains first vocabulary and each expansion
Exhibition vocabulary each word term vector include:
Obtain the unique number string that first vocabulary or the excessively only hotlist of expansion word Correspondent show;
By unique number string input term vector model trained in advance, according to the output result of the term vector model
Determine first vocabulary or the corresponding word term vector of extension vocabulary.
In one embodiment, described to carry out each word term vector merga pass at least one of the following mode:
Maximum pond is carried out to the matrix that each word term vector is arranged in;
It averages to each word term vector with the element of dimension;
Each word term vector is overlapped.
According to the third aspect, a kind of device of determining text categories is provided, comprising:
Receiving unit is configured to obtain the first text to be processed;
Pretreatment unit is configured to carry out word cutting processing to first text, obtains at least one candidate word;
Determination unit is configured to determine the corresponding each synthesis term vector of each candidate word, wherein described at least one
A candidate word includes the first vocabulary, and the synthesis term vector of first vocabulary obtains in the following manner: based on preset knowledge
Attribute description information associated with first vocabulary in map extracts at least one extension vocabulary of first vocabulary;
According to term vector model trained in advance, obtain first vocabulary and the corresponding each word word of each extension vocabulary to
Amount;Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary;
Predicting unit is configured to the prediction model that each synthesis term vector input is trained in advance, and according to prediction model
Output result determine the text categories of first text.
According to fourth aspect, a kind of term vector generating means are provided, described device includes:
Acquiring unit is configured to obtain the first vocabulary;
Expanding element is configured in preset knowledge mapping attribute description letter associated with first vocabulary
Breath extracts at least one extension vocabulary of first vocabulary;
Word processing unit is configured to obtain first vocabulary and each expansion according to term vector model trained in advance
Open up the corresponding each word term vector of vocabulary;
Combining unit is configured to merge each word term vector, to generate predefined size for first vocabulary
Synthesize term vector.
According to the 5th aspect, a kind of computer readable storage medium is provided, computer program is stored thereon with, when described
When computer program executes in a computer, the method that enables computer execute first aspect or second aspect.
According to the 6th aspect, a kind of calculating equipment, including memory and processor are provided, which is characterized in that described to deposit
It is stored with executable code in reservoir, when the processor executes the executable code, realizes first aspect or second aspect
Method.
Pass through the method and apparatus for the determination text categories that this specification embodiment provides, when generating term vector, first base
Attribute description information associated with first vocabulary is extended the vocabulary that is expanded in preset knowledge mapping, and will
First vocabulary and the word term vector for extending vocabulary merge into the synthesis term vector of the first vocabulary.Determining text categories process
In, for each vocabulary in text to be processed, use this synthesis term vector synthesized by vocabulary extension, vector.
Since knowledge mapping includes the rich of information, a variety of attribute informations of vocabulary can be made full use of, the term vector of generation is more
Effectively, to improve the accuracy of text classification.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is an exemplary architecture of this specification embodiment.In the exemplary architecture, terminal and server passes through
Network is communicated.Wherein, terminal can be smart phone, laptop, desktop computer etc..It can be installed in terminal
There are various client applications.Server, which can be, provides the background server of support for various client applications.User can lead to
The client application run on terminal device is crossed to interact with server-side.
Specifically, in an application scenarios, above-mentioned client application for example can be chat tool class and apply (such as QQ
Deng), being also possible to social platform applies (such as microblogging), can also be financial platform class application etc..User can pass through visitor
The application of family end is issued or transmitting text information.The terminal of client application operation thereon, or mentioned for relative client application
For the computing platform of the background server of support, can text to be released to user or transmitting classify.Said herein point
Class can be the classification of broad sense.Specifically, carrying out classification text can be field classification described in determining text, such as economic neck
Domain, political realms, field of medicaments etc. can also be risk of determining text, such as Risk Text, devoid of risk text etc.
Deng being not limited thereto.
In above-mentioned scene, during determining text categories, for text to be processed, often can first it be segmented
Processing, determines the vocabulary that text is included.Then, its term vector determined to each vocabulary, and using the word of each vocabulary to
Amount, determines the classification of text.This specification embodiment mainly improves the determination process of term vector, by utilizing vocabulary
More information promotes the accuracy of text classification.
Term vector generating process is described first below.
Fig. 2 is the flow diagram according to the term vector generation method of one embodiment.Fig. 2 shows term vector generate stream
Journey is adapted to the electronic equipment with certain computing capability.Assuming that any vocabulary be the first vocabulary, as shown in Fig. 2, for this first
The process that vocabulary generates term vector may include: step 201, obtain the first vocabulary;Step 202, it is based on preset knowledge mapping
In attribute description information associated with first vocabulary, extract the first vocabulary at least one extension vocabulary;Step 203,
According to term vector model trained in advance, the first vocabulary and the corresponding each word term vector of each extension vocabulary are obtained;
Step 204, each word term vector is merged, to generate the synthesis term vector of predefined size for the first vocabulary.
Firstly, in step 201, obtaining the first vocabulary.It is appreciated that the first vocabulary can be stored in advance in it is local
Vocabulary, at this point it is possible to directly acquire first vocabulary.First vocabulary can also pass through the operations such as text participle, extraction keyword
It obtains, at this point, the mode for obtaining the first vocabulary is also possible to segment text, removes stop words processing, or extracts and close
Keyword etc..If the processing such as being segmented, extracting keyword to text in advance, a text can also correspond to a word finder,
First vocabulary can also be the vocabulary obtained in order from word finder.In short, in this step, to the acquisition modes of vocabulary
Without limitation.
Then, in step 202, based on attribute description letter associated with first vocabulary in preset knowledge mapping
Breath extracts at least one extension vocabulary of the first vocabulary.By the step, the first vocabulary is extended.
Knowledge mapping (Knowledge Graph) can also be known as mapping knowledge domains, in books and information group, and can claim
Map is mapped for knowledge domain visualization or ken.Knowledge mapping can describe knowledge resource and its load with visualization technique
Body is excavated, analysis, building, is drawn and explicit knowledge and connecting each other between them, development process and structural relation etc..Its
In, knowledge resource can be vocabulary itself, using vocabulary as things of carrier etc..
The knowledge resource for including in knowledge mapping can be indicated by vocabulary.One vocabulary can be the vocabulary itself,
It is also possible to the specific things that this vocabulary is carried.It further, can be with when being described for a concrete knowledge resource
It is described by complicated network structure.These descriptions can be to the vocabulary itself, be also possible to for this
The specific things that a vocabulary is carried.These description informations can be to be specifically described for knowledge resource, can also be with
It is the upper information of knowledge resource, can also be the description information that knowledge resource is described by related resource, these is retouched
State the attribute description information that information can be collectively referred to as knowledge resource.In knowledge mapping, the attribute description of a knowledge resource
Information is associated with the knowledge resource.Such as it is closed with a knowledge resource with connection in visual knowledge mapping
Other various knowledge resources of system, all can serve as the attribute description information of the knowledge resource.
As an example, Fig. 3 gives the knowledge mapping schematic diagram an of vocabulary.As shown in Figure 3, it is assumed that current knowledge money
Source is vocabulary " millet ".In its corresponding knowledge mapping, each attribute of vocabulary " millet " can be described.For
The description of " millet " can be to the vocabulary itself, such as " noun ", " pronoun " etc., be also possible to for " millet " this word
The specific things that remittance is carried, such as characters name millet, food millet etc..
For the first vocabulary, the attribute description information of the first vocabulary may include for the first vocabulary or its carried
The specific descriptions information of things, the upper description information of the first vocabulary, association description information, etc..These attribute descriptions letter
Breath can be embodied in knowledge mapping by vocabulary.Vocabulary associated with the first vocabulary, can be used as the attribute of the first vocabulary
Description information.By taking Fig. 3 as an example, this direct correlation relationship is indicated by line.In fig. 3, it is assumed that the first vocabulary is " millet ",
The vocabulary then connected with millet: characters name, Business Name, plant, Poales, grain, crops, herbaceous plant, five cereals, etc.
Deng can be used as the upper description vocabulary (upper description information) of millet.The vocabulary " herba setariae viridis " connected with millet, as similar
Plant, association plant, nickname etc. can be the association vocabulary (association description information) of " millet ".In the knowledge graph of more levels
In spectrum, there can also be more specific descriptions vocabulary for being directed to " millet ".Such as yellow (color), round shaped grain (shape), the Shanxi (place of production
One of) etc., it can be used as the specific descriptions vocabulary (specifically describing information) that plant " millet " specific features are described.Value
It must illustrate, attribute information category division here is not unique, such as " characters name ", " Business Name " in Fig. 3
It can be used as the specific descriptions information of " millet ".It is upper description vocabulary listed above, association vocabulary, specific for " millet "
Vocabulary is described, all can serve as attribute description information associated with the first vocabulary " millet ".
It may include many knowledge resources in one complete knowledge mapping, such as " crops " processing in Fig. 3 is " small
Except rice ", multiple vocabulary " corn ", " wheat ", " rice ", " sweet potato " etc. can also be corresponded to.And these vocabulary can be right
Answer other description informations, such as " rice " that " rice ", " grain ", " white ", " long grain " etc. can also be corresponded to.Knowledge mapping can
To depict complicated relationship between these knowledge resources.Knowledge mapping can be built based on the knowledge information of various channels
It is vertical, such as scientific encyclopaedia, webpage etc..Knowledge mapping can be pre-established and be stored.
As can be seen that knowledge mapping is just as a complicated network, each kind of a knowledge resource (such as vocabulary)
Property description information can also be distinguished according to the associated layers grade with the knowledge resource.Such as in the knowledge mapping shown in Fig. 3, know
Know resource be the first vocabulary " millet ", be directly linked level on attribute description information may include " name ", " company ",
" virtual portrait ", " grass family ", " crops ", " herbaceous plant ", " grain ", " herba setariae viridis " etc., in respiratory sensation level
Attribute description information may include " plant ", " grass ", " food ", " mechanism " etc..In general, the attribute for being directly linked level is retouched
It is more significant to the first vocabulary to state information.Therefore, in some embodiments, retouched from the attribute that the first vocabulary is directly linked level
It states and extracts extension vocabulary in information.But it is not excluded in some embodiments, it can be from the category in the first vocabulary respiratory sensation level
Property description information in extract extension vocabulary.
It in a particular embodiment, can be by corresponding vocabulary in the attribute description information of the first vocabulary directly as expansion word
It converges, can also selecting some vocabulary as extension vocabulary, such as only from these attribute description information, selection specifically describes vocabulary
As extension vocabulary.
Then, step 203, according to term vector model trained in advance, the first vocabulary and each extension vocabulary difference are obtained
Corresponding each word term vector.It is appreciated that term vector technology can convert the word in natural language to dense vector,
Similar word has similar vector and indicates, such conversion facilitates the feature excavated in text between word and sentence.It generates
The method of term vector can be based on statistical method (co-occurrence matrix, SVD are decomposed), be also possible to the mind based on different structure
Language model through network, such as word2vec (word embeddings), glove (Global vectors for word
Representation, the Global Vector of word) etc..This specification embodiment to specific term vector model without limitation.
Under normal conditions, vocabulary first can be shown into (one-hot representation) by only hotlist, that is, given
Each word distributes a unique number string, distinguishes vocabulary with this.Such as: " banana " is expressed as [0 001000000
00000 0...], " apple " is expressed as [0 00000001000000 0...].That is,
In corpus, a vocabulary (such as banana, apple) corresponds to a vector, only one value is 1 in vector, remaining is all 0.
This vector is corresponding be exactly vocabulary unique number string.If regarding above-mentioned vector as binary number representation, each
Vocabulary can also correspond to a decimal system or hexadecimal digit.The different corresponding numeric strings of two vocabulary is also different.This
A numeric string is referred to as digital ID, such as the decimal number table that number ID can also be converted to above-mentioned binary vector
Show, as unique number string [0 0000000000 0...1 00 0] can correspond to a number ID " 8 ".Due to
It is mutually indepedent between vector, incidence relation that may be present between vocabulary directly can not be determined by unique number string.Moreover, to
Measure dimension size depend on corpus in words number.If vocabulary is more in corpus, vector dimension is very big.Word to
Similar import or the stronger vocabulary of relevance can be mapped to similar position in vector space by low-dimensional vector by amount model
It sets.
By taking word2vec as an example, only hotlist that input layer can be vocabulary shows that corresponding unique number string, output layer are words
Converge corresponding term vector, wherein each element of output layer corresponds to a vocabulary dimension in term vector.Such as output layer is right respectively
It answers vocabulary [apple, banana, orange, rice ...], the value on each element can indicate the corresponding vocabulary of input layer and this yuan
The correlation degree of the corresponding vocabulary of element.Wherein, in training pattern, the term vector of sample vocabulary can be by sample vocabulary and each
The correlation degree of the corresponding vocabulary of a element indicates that for sample vocabulary, which can be by corpus
The context relation of vocabulary carries out statistics determination.The probability that two vocabulary occur in context (such as adjacent) jointly together is higher,
Its correlation degree is stronger.Numerical value on each element can between 0-1 value.During the corresponding only hotlist of sample vocabulary is shown
Unique number be series-connected to the hidden layer with less a node.The weight for connecting input layer and hidden layer will become new
Term vector.The activation primitive of the hidden layer for example can be to the linear weighted function of this layer of each node and (not will use as
Nonlinear activation function as sigmoid or tanh).Hereafter the node of hidden layer can be fed to softmax (normalization index
Function) output layer.In the training process, for the vocabulary occurred in corpus, the weight (mould of neural network can constantly be adjusted
Shape parameter), make the corresponding each vocabulary of input layer, the probability that the higher word of correlation degree exports in output layer is higher.
In this way, at least one the extension vocabulary extracted in the first vocabulary and step 202 that are obtained in step 201,
A word term vector can be obtained by term vector model trained in advance.In some embodiments, can first be obtained
The unique number string (digital ID) of one vocabulary and each extension vocabulary, by unique number string input term vector mould trained in advance
Type, so that it may which corresponding word term vector is obtained according to the output of term vector model.
Then, in step 204, the corresponding each word term vector of the first vocabulary and each extension vocabulary is merged,
To generate the synthesis term vector of predefined size for the first vocabulary.In this step, each word term vector can be merged into
One synthesis term vector of predetermined length, thus the term vector as the first vocabulary.
It is appreciated that during usually determining term vector to single vocabulary by term vector model, often only in consideration
Hereafter vocabulary, and attribute description vocabulary is fewer, for example, " banana " context of co-text may be more " eating banana ",
" picking up banana ", " buying banana " etc., and less " banana of yellow ", " banana as fruit " etc. can be described.Therefore, though
It may also be related to context vocabulary in right word term vector generating process, but the information utilized is still limited.Due to compound word to
Amount has comprehensively considered the attribute description information of vocabulary itself and the vocabulary in knowledge mapping, therefore can enrich the first word
The meaning of remittance.
Word term vector merge method have very much, such as can be neural network, linear regression, average, be superimposed,
Maxpooling (maximum pond) etc..It, first can be by each word word obtained in step 203 by taking maxpooling as an example
Vector, which is arranged together, is combined into a matrix.Assuming that single term vector is M dimension, the quantity of word term vector is N, then can group
At the matrix of M × N.Then it is slided in the matrix with the sliding window of a fixed size, and often slides into a position, taken
Maximum value in sliding window.For example, sliding window size is 1 × N, a line is slided every time, takes maximum value therein.In this way,
The synthesis term vector of available M dimension.If the predefined size of synthesis term vector is M-1 dimension, sliding window size can also be taken
For 2 × N, a line is slided every time, takes maximum value therein.If the predefined size of synthesis term vector is M/2 dimension, cunning can also be taken
Dynamic window size is 2 × N, slides two rows every time, takes maximum value therein.In practice, the predefined size for synthesizing term vector may be used also
To be other dimensions, according to the predefined size, it can determine different sliding window size and sliding step, not limit herein
It is fixed.
It is worth noting that when the merging mode to each word term vector is the processing for numerical value (rather than vector)
When, for example, merge mode be linear regression, average, can be for it is each synthesis term vector same dimension member
The processing of element.Such as it averages to the element of each word term vector first dimension.And be superimposed, it can be the superposition of vector, i.e.,
The element of identical dimensional is summed.
For the first vocabulary, the first vocabulary and its each word term vector for extending vocabulary, in a manner mentioned above
It merges, the synthesis term vector of available predefined size.The synthesis term vector be exactly by Fig. 2 shows method be first
The term vector that vocabulary determines.
It is worth noting that when the term vector model used in step 203 is consistent, the word term vector of each vocabulary
Dimension is generally consistent, however for the first vocabulary, the quantity for extending vocabulary is uncertain, therefore is carrying out in this step
It is not recommended that each word term vector is spliced during term vector merges.But it is not excluded for the possibility using joining method, such as
Increase qualifications, extension vocabulary quantity is predetermined quantity, or can make the vector of indefinite length using other unified approach
It is compressed to predefined size etc..
In order to illustrate more clearly of Fig. 2 shows term vector generation method, please refer to a term vector shown in Fig. 5 and generate
Specific example.In the specific example shown in Fig. 5, the first vocabulary is " banana ".After obtaining the first vocabulary banana, to it
Carry out the Information expansion of knowledge based map.Assuming that be expanded after extension vocabulary " fruit " and " yellow ", add first at this time
Vocabulary itself, there are three vocabulary " banana ", " yellow " and " fruit " altogether.Then, the digital ID of these three vocabulary is first obtained respectively,
Such as " banana " is corresponding number ID " 4 ", " fruit " is corresponding number ID " 6 ", and " yellow " is corresponding number ID " 9 ".By predefining
The each number ID of term vector model treatment, respectively obtain corresponding word term vector.As shown in Figure 5 it is possible to further by 3
A word term vector, which merges, generates a synthesis term vector.The synthesis term vector is exactly the term vector using this specification embodiment
The final term vector for the first vocabulary " banana " that generation method determines.
By Fig. 2 shows process, for the first vocabulary generate term vector during, first be directed to from knowledge mapping
The various attribute description information of first vocabulary are extended, and by the first vocabulary and are extended the word term vector of vocabulary and merged into the
The term vector of one vocabulary.Since knowledge mapping includes the rich of information, the much information of the first vocabulary can be made full use of, it is raw
At term vector it is more efficient.
Be detailed below using Fig. 2 shows term vector generation method determine the processes of text categories.
Fig. 6 shows the method for the determination text categories according to one embodiment.The executing subject of this method has certain
Electronic equipment of data-handling capacity, such as server-side shown in fig. 1 etc..The executing subject with Fig. 2 shows term vector generation
The executing subject of method can be consistent, can also be inconsistent, is not limited thereto.As shown in fig. 6, the stream of the determination text categories
Journey includes: step 601, obtains the first text to be processed;Step 602, word cutting processing is carried out to the first text, obtains at least one
A candidate word;Step 603, the corresponding each synthesis term vector of each candidate word is determined, wherein at least one candidate word packet
Include the first vocabulary, the synthesis term vector of the first vocabulary obtains in the following manner: based in preset knowledge mapping with described the
The associated attribute description information of one vocabulary extracts at least one extension vocabulary of the first vocabulary;According to word trained in advance to
Model is measured, each word term vector of the first vocabulary and each extension vocabulary is obtained;Each word term vector is merged, to be
The synthesis term vector of first vocabulary generation predefined size;Step 604, the prediction mould that each synthesis term vector input is trained in advance
Type, and determine according to the output result of prediction model the text categories of the first text.
Firstly, in step 601, obtaining the first text to be processed.Here, the first text to be processed can be needs
Determine any one text of text categories, for example, an article of a chat message, science popularization platform in social platform,
News item information, etc..
Then, by step 602, word cutting processing is carried out to the first text, obtains at least one candidate word.Word cutting is exactly
Character in text is divided into word one by one.It may include to segmenting, remove stop words to the processing of the word cutting of text.Participle is just
It is to be split to text, for example, can be divided by dictionary trained in advance for text " I has had a breakfast downstairs "
At " I ", " I ", " ", " in building ", " building " " downstairs ", " having ", " ", " having ", " a ", " a morning ", " breakfast ", " meal "
Etc. as word.Stop words is usually the word deactivated in vocabulary, these words are often little to the practical significance of text, such as empty
Word, preposition etc..Such as in previous example, the stop words such as " ", " " can be removed, obtain " downstairs, have, breakfast,
Vocabulary as meal ... ".These vocabulary can be used as the candidate word obtained after text filtering.
It is appreciated that due to segment, remove after stop words only effective vocabulary of remaining first text as candidate word,
In subsequent processing, data processing amount can be reduced by only using effective vocabulary.When the length of text is larger (such as more than predetermined character
Number threshold value), after word cutting processing can also being carried out to text, the keyword of predetermined number (such as 5) is only extracted as candidate word.Its
The extracting mode of middle keyword is, for example, TF-IDF (term frequency-inverse document frequency, word
Frequently-inverse text frequency), LDA (Latent Dirichlet Allocation, document subject matter generate model) etc., herein no longer
It repeats.
Then, in step 603, the corresponding each synthesis term vector of each candidate word is determined.It is worth noting that
The term vector of candidate word used herein is not traditional word term vector, but passes through the word term vector of multiple vocabulary
Merge generation.
Specifically, the synthesis term vector of each candidate word can based on candidate word itself with its extend vocabulary word word to
Amount determines.Assuming that any vocabulary in candidate word obtained in step 602 is the first vocabulary, then when getting the vocabulary,
Synthesis term vector can obtain in the following manner: based on attribute associated with first vocabulary in preset knowledge mapping
Description information extracts at least one extension vocabulary of the first vocabulary;According to term vector model trained in advance, the first vocabulary is obtained
And the corresponding each word term vector of each extension vocabulary;Each word term vector is merged, thus raw for the first vocabulary
At the synthesis term vector of predefined size.As can be seen that above procedure is consistent with the process of term vector generation method that Fig. 2 is used.
Wherein, in this step 603, for each candidate word, it can use and generate synthesis term vector with the consistent process of Fig. 2 for it,
Details are not described herein.
In one embodiment, can be determined in real time in the step the corresponding each compound word of each candidate word to
Amount.That is, in the step 603, being obtained in order obtained each after obtaining each candidate word in step 602
Candidate word, and when obtaining each candidate word, it is executed Fig. 2 shows term vector generation method, for its generate compound word to
Amount.
In another embodiment, the vocabulary that can be directed in advance in dictionary (or scheduled word finder) determines synthesis respectively
Term vector, and it is stored in the predetermined storage area of database or executing subject.It, can be according in step 602 in the step 603
Obtained each candidate word directly reads corresponding each synthesis term vector.
By Fig. 2 shows term vector generation method it is found that step 603 determine each synthesis term vector not only contain
The lexical information of candidate word itself, further comprises its attribute description information, and meaning is more abundant.
Then, by step 604, by each synthesis term vector input prediction model trained in advance, and according to prediction mould
The output result of type determines the text categories of the first text.Wherein, according to the difference of application scenarios, text categories meaning here
Also different.For example, text categories can be domain class as agricultural, politics, economy etc. in some Domestic News platforms
Not.And in network risks control field, text categories can be Risk Text, non-Risk Text, or even can also be that risk waits for
Determine risk classification as text, etc..
Prediction model can be full Connection Neural Network, is also possible to decision-tree model (such as GBDT), can also be circulation
Neural network (RNN, Recurrent Neural Network) is it is not limited here.It is appreciated that for some scenes, such as
In the previously mentioned longer situation of text, the keyword of predetermined number is extracted from text as candidate word, at this point, different texts
The quantity of candidate word be all determining, i.e., predetermined number said herein.In this way, the corresponding each conjunction of the candidate word of each text
It is all consistent at term vector dimension, if each synthesis term vector is stitched together, dimension is also determination, be can be used complete
Connection Neural Network model, decision tree etc. are used as prediction model.And for other scenes, text to be processed may be one
Words, several words, at this point, all vocabulary after word cutting all can serve as candidate word, if by the corresponding compound word of each candidate word
Vector is all stitched together, and dimension is uncertain.In this case, Recognition with Recurrent Neural Network can be used as prediction model, incite somebody to action
The corresponding each synthesis term vector of each candidate word is handled as sequence in text.In some implementations, it can also use
Shot and long term memory models (Long Short Term Memory, LSTM) under RNN framework are as prediction model.
Below by taking text categories are risk as an example, the training process of above-mentioned prediction model is described.Conduct is obtained first
Multiple texts of sample.Each text is corresponding with text label as " Risk Text " marked in advance, " non-Risk Text ".
Each text is handled according to the description of step 601 to step 603.In this way, any one text as sample is ok
Obtain at least one corresponding candidate word of the text and the corresponding each synthesis term vector of each candidate word.For the side of description
Just, the corresponding each synthesis term vector of a text can be called one group of term vector.Successively by the corresponding each group of each text
The selected model of term vector input is synthesized, and adjusts model parameter according to corresponding text label, so that prediction loss function
Value is biased to reduced direction, to train above-mentioned prediction model.
When carrying out model training, the label marked in advance to sample can use digital representation, such as " Risk Text " with 1 table
Show, " non-Risk Text " is indicated with 0.In some implementations, the output result of prediction model can be 0,1 risk specific in this way
Category result, at this point, prediction model is for when determining text categories to the first text, it can be with according to the output result of prediction model
Directly determine text categories belonging to the first text.In other sights, the output result of prediction model be also possible to 0-1 it
Between numerical value, indicate to correspond to the probability of Risk Text.In this way, when prediction model is used to determine text categories to the first text,
Further the output result of prediction model can also be judged according to pre-set probability threshold value.Such as output result is big
When first threshold (such as 0.7), the text categories of corresponding first text are Risk Text, and output result is less than second threshold (such as
0.3) when, the text categories of corresponding first text are non-Risk Text, and export result between first threshold and second threshold
Numerical value when, the first text be risk text undetermined, can further by manually determine text categories.
When being used for other scenes, the classification of various numerical identity texts can also be passed through.Such as " 3 " indicate agriculture
Not, " 1 " indicates political classification, etc..
As shown in fig. 7, being a specific example of the method for the determination text categories shown in an application drawing 6.In the example
In son, text to be processed be " banana is eaten very well ", by participle, obtain candidate word " banana ", " very ", " nice ".For each
Candidate word determines method by the new term vector shown in Fig. 2, Fig. 5, determines synthesis term vector.Then, to each candidate word pair
The each synthesis term vector answered inputs deep learning network (CNN or RNN as shown in Figure 7 etc.) trained in advance respectively, obtains
The classification ID " 5 " of text.Classification ID can correspond to a specific category, such as Household Encyclopedia.
Above procedure is looked back, during determining text categories, due to the term vector generating process of the candidate word in text
In, the attribute description information in knowledge mapping is utilized, thus more comprehensively and abundant from the progress of more various dimensions to candidate word
Description, can be improved the accuracy of text classification.When being applied to text risk profile, the effective of risk profile can be improved
Property.By experiment, by by using Fig. 2 shows method determine vocabulary synthesis term vector be applied to text classification, comparison pass
The word term vector acquisition methods of system, in the textual classification model of small data set, classification accuracy rate promotes 20% or more.
According to the embodiment of another aspect, a kind of term vector generating means are also provided.Fig. 8 is shown according to one embodiment
The schematic block diagram of term vector generating means.As shown in figure 8, term vector generating means 800 include: acquiring unit 81, it is configured to
Obtain the first vocabulary;Expanding element 82 is configured in preset knowledge mapping attribute associated with first vocabulary
Description information extracts at least one extension vocabulary of the first vocabulary;Word processing unit 83 is configured to according to word trained in advance
Vector model obtains the first vocabulary and the corresponding each word term vector of each extension vocabulary;Combining unit 84, is configured to
Each word term vector is merged, to generate the synthesis term vector of predefined size for the first vocabulary.
According to one embodiment, attribute description information associated with the first vocabulary is included at least one of the following: for institute
That states the first vocabulary specifically describes vocabulary, the upper vocabulary of first vocabulary, association vocabulary.
According to one embodiment, term vector model is word2vec.
According to one embodiment, word processing unit 83 is further configured to:
Obtain the unique number string that first vocabulary or the excessively only hotlist of expansion word Correspondent show;
By unique number string input term vector model trained in advance, the is determined according to the output result of term vector model
One vocabulary or the corresponding word term vector of extension vocabulary.
According to one embodiment, combining unit 84 merges each word term vector by way of following at least one:
Maximum pond is carried out to the matrix that each word term vector is arranged in;
It averages to each word term vector with the element of dimension;
Each word term vector is overlapped.
It is worth noting that device 800 shown in Fig. 8 be with Fig. 2 shows the corresponding device of embodiment of the method implement
Example, Fig. 2 shows embodiment of the method in it is corresponding describe be equally applicable to device 800, details are not described herein.
According to the embodiment of another further aspect, a kind of device of determining text categories is also provided.Fig. 9 is shown to be implemented according to one
The device schematic block diagram of the determination text categories of example.As shown in figure 9, the device 900 for determining text categories includes: receiving unit
91, it is configured to obtain the first text to be processed;Pretreatment unit 92 is configured to carry out word cutting processing to the first text, obtain
At least one candidate word;Determination unit 93 is configured to determine the corresponding each synthesis term vector of each candidate word, wherein
At least one candidate word includes the first vocabulary, and the synthesis term vector of the first vocabulary obtains in the following manner: being known based on preset
Know attribute description information associated with first vocabulary in map, extracts at least one extension vocabulary of the first vocabulary;Root
According to term vector model trained in advance, the first vocabulary and the corresponding each word term vector of each extension vocabulary are obtained;It will
Each word term vector merges, to generate the synthesis term vector of predefined size for the first vocabulary;Predicting unit 94, be configured to by
Each synthesis term vector inputs prediction model trained in advance, and the text of the first text is determined according to the output result of prediction model
This classification.
According to one embodiment, attribute description information associated with the first vocabulary is included at least one of the following: for
One vocabulary specifically describes vocabulary, the upper description vocabulary of the first vocabulary, association vocabulary.
According to one embodiment, text categories include: Risk Text or non-Risk Text.
In a further embodiment, prediction model is trained in the following manner:
Obtain multiple texts as sample, wherein each text respectively corresponds each group based on the candidate in corresponding text
The synthesis term vector that word determines, and the text label marked in advance, text label include Risk Text, non-Risk Text;
The successively model that each group synthesis term vector input is selected, and according to corresponding text label training adjustment model ginseng
Number, so that before compared to training, model after training, corresponding with current sample prediction loss function value reduction.
According to one embodiment, prediction model can be in full Connection Neural Network, decision tree or Recognition with Recurrent Neural Network
It is a kind of.
It is worth noting that device 900 shown in Fig. 9 is that device corresponding with the embodiment of the method shown in Fig. 6 is implemented
, the corresponding description in the embodiment of the method shown in Fig. 6 is equally applicable to device 900, and details are not described herein.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey
Sequence enables computer execute and combines method described in Fig. 2 or Fig. 6 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided
In be stored with executable code, when the processor executes the executable code, realize the method in conjunction with described in Fig. 2 or Fig. 6.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention
It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions
Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all
Including within protection scope of the present invention.