CN110245228A

CN110245228A - The method and apparatus for determining text categories

Info

Publication number: CN110245228A
Application number: CN201910355870.3A
Authority: CN
Inventors: 马良庄
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-09-17

Abstract

This specification embodiment provides a kind of method and apparatus of determining text categories, according to this method embodiment, when generating term vector, the vocabulary that is expanded first is extended based on attribute description information associated with first vocabulary in preset knowledge mapping, and the word term vector of the first vocabulary and extension vocabulary is merged into the synthesis term vector of the first vocabulary.During determining text categories, for each vocabulary in text to be processed, this synthesis term vector synthesized by vocabulary extension, vector is used.Since knowledge mapping includes the rich of information, a variety of attribute informations of vocabulary can be made full use of, the term vector of generation is more efficient, to improve the accuracy of text classification.

Description

The method and apparatus for determining text categories

Technical field

This specification one or more embodiment is related to field of computer technology, more particularly to determines text by computer The method and apparatus of classification.

Background technique

Natural language processing (Natural Language Processing) can be used for research and be able to achieve people and calculate The various theory and methods of efficient communication are carried out between machine with natural language.Application of the natural language processing in terms of artificial intelligence Scene is very more, for example, automatic translation, speech recognition, text classification etc..Wherein, it by taking text classification as an example, can attempt to push away Label or tag set belonging to the disconnected text (sentence, document etc.) given out.One important step of natural language processing is just It is by term vector.Show (one-hot representation) often through only hotlist during term vectorization, each word It is expressed as a very long vector.The dimension of this vector is vocabulary size, and the value of only one dimension is 1.Namely to every A word distributes a number ID.

In routine techniques, the digital ID of one-hot is converted into a low-dimensional vector, such as word2vec etc..Word to When amount obtains, usually carried out according to the such sequence of vocabulary-word ID-term vector.This term vector acquisition methods, term vector It is relatively simple, vocabulary can not be further understood and be described.In specific implementation scene (such as text classification etc.), such side Method can not further describe vocabulary according to more effective informations, so that natural language processing result is not ideal enough.

Summary of the invention

This specification one or more embodiment describes a kind of method and apparatus of determining text categories, for solve with Upper one or more problem.

According in a first aspect, providing a kind of method of determining text categories, comprising: obtain the first text to be processed； Word cutting processing is carried out to first text, obtains at least one candidate word；Determine the corresponding each conjunction of each candidate word At term vector, wherein the synthesis term vector of the first vocabulary at least one described candidate word obtains in the following manner: being based on Attribute description information associated with first vocabulary in preset knowledge mapping extracts first vocabulary at least one Extend vocabulary；According to term vector model trained in advance, obtains first vocabulary and each extension vocabulary is corresponding each A word term vector；Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary； By each synthesis term vector input prediction model trained in advance, and first text is determined according to the output result of prediction model This text categories.

In one embodiment, attribute description information associated with first vocabulary includes at least one of the following: needle Vocabulary, the upper description vocabulary of first vocabulary, association vocabulary are specifically described to first vocabulary.

In one embodiment, the text categories include: Risk Text or non-Risk Text.

In one further embodiment, the prediction model is trained in the following manner:

Obtain multiple texts as sample, wherein each text respectively corresponds each group based on the candidate in corresponding text The synthesis term vector that word determines, and the text label marked in advance, the text label include Risk Text, non-risk text This；

The successively model that each group synthesis term vector input is selected, and according to corresponding text label training adjustment model ginseng Number.

In one embodiment, the prediction model is in full Connection Neural Network, decision tree or Recognition with Recurrent Neural Network It is a kind of.

According to second aspect, a kind of term vector generation method that computer executes is provided, which comprises obtain first Vocabulary；Based on attribute description information associated with first vocabulary in preset knowledge mapping, first vocabulary is extracted At least one extension vocabulary；According to term vector model trained in advance, first vocabulary and each extension vocabulary point are obtained Not corresponding each word term vector；Each word term vector is merged, to generate predefined size for first vocabulary Synthesize term vector.

In one embodiment, the term vector model is word2vec.

In one embodiment, the term vector model that the basis is trained in advance obtains first vocabulary and each expansion Exhibition vocabulary each word term vector include:

Obtain the unique number string that first vocabulary or the excessively only hotlist of expansion word Correspondent show；

By unique number string input term vector model trained in advance, according to the output result of the term vector model Determine first vocabulary or the corresponding word term vector of extension vocabulary.

In one embodiment, described to carry out each word term vector merga pass at least one of the following mode:

Maximum pond is carried out to the matrix that each word term vector is arranged in；

It averages to each word term vector with the element of dimension；

Each word term vector is overlapped.

According to the third aspect, a kind of device of determining text categories is provided, comprising:

Receiving unit is configured to obtain the first text to be processed；

Pretreatment unit is configured to carry out word cutting processing to first text, obtains at least one candidate word；

Determination unit is configured to determine the corresponding each synthesis term vector of each candidate word, wherein described at least one A candidate word includes the first vocabulary, and the synthesis term vector of first vocabulary obtains in the following manner: based on preset knowledge Attribute description information associated with first vocabulary in map extracts at least one extension vocabulary of first vocabulary； According to term vector model trained in advance, obtain first vocabulary and the corresponding each word word of each extension vocabulary to Amount；Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary；

Predicting unit is configured to the prediction model that each synthesis term vector input is trained in advance, and according to prediction model Output result determine the text categories of first text.

According to fourth aspect, a kind of term vector generating means are provided, described device includes:

Acquiring unit is configured to obtain the first vocabulary；

Expanding element is configured in preset knowledge mapping attribute description letter associated with first vocabulary Breath extracts at least one extension vocabulary of first vocabulary；

Word processing unit is configured to obtain first vocabulary and each expansion according to term vector model trained in advance Open up the corresponding each word term vector of vocabulary；

Combining unit is configured to merge each word term vector, to generate predefined size for first vocabulary Synthesize term vector.

According to the 5th aspect, a kind of computer readable storage medium is provided, computer program is stored thereon with, when described When computer program executes in a computer, the method that enables computer execute first aspect or second aspect.

According to the 6th aspect, a kind of calculating equipment, including memory and processor are provided, which is characterized in that described to deposit It is stored with executable code in reservoir, when the processor executes the executable code, realizes first aspect or second aspect Method.

Pass through the method and apparatus for the determination text categories that this specification embodiment provides, when generating term vector, first base Attribute description information associated with first vocabulary is extended the vocabulary that is expanded in preset knowledge mapping, and will First vocabulary and the word term vector for extending vocabulary merge into the synthesis term vector of the first vocabulary.Determining text categories process In, for each vocabulary in text to be processed, use this synthesis term vector synthesized by vocabulary extension, vector. Since knowledge mapping includes the rich of information, a variety of attribute informations of vocabulary can be made full use of, the term vector of generation is more Effectively, to improve the accuracy of text classification.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 shows the exemplary architecture of this specification embodiment；

Fig. 2 shows the flow charts of the term vector generation method executed according to the computer of one embodiment；

Fig. 3 shows a specific partial schematic diagram of knowledge mapping in a specific example；

Fig. 4 shows the specific example according to a term vector model；

Fig. 5 shows the schematic diagram of the specific example generated according to the term vector of one embodiment；

Fig. 6 shows the flow chart of the method for the determination text categories according to one embodiment；

Fig. 7 shows the schematic diagram of the specific example of the determination text categories according to one embodiment；

Fig. 8 shows the schematic block diagram of the term vector generating means according to one embodiment；

Fig. 9 shows the schematic block diagram of the device of the determination text categories according to one embodiment.

Specific embodiment

With reference to the accompanying drawing, the scheme provided this specification is described.

Fig. 1 is an exemplary architecture of this specification embodiment.In the exemplary architecture, terminal and server passes through Network is communicated.Wherein, terminal can be smart phone, laptop, desktop computer etc..It can be installed in terminal There are various client applications.Server, which can be, provides the background server of support for various client applications.User can lead to The client application run on terminal device is crossed to interact with server-side.

Specifically, in an application scenarios, above-mentioned client application for example can be chat tool class and apply (such as QQ Deng), being also possible to social platform applies (such as microblogging), can also be financial platform class application etc..User can pass through visitor The application of family end is issued or transmitting text information.The terminal of client application operation thereon, or mentioned for relative client application For the computing platform of the background server of support, can text to be released to user or transmitting classify.Said herein point Class can be the classification of broad sense.Specifically, carrying out classification text can be field classification described in determining text, such as economic neck Domain, political realms, field of medicaments etc. can also be risk of determining text, such as Risk Text, devoid of risk text etc. Deng being not limited thereto.

In above-mentioned scene, during determining text categories, for text to be processed, often can first it be segmented Processing, determines the vocabulary that text is included.Then, its term vector determined to each vocabulary, and using the word of each vocabulary to Amount, determines the classification of text.This specification embodiment mainly improves the determination process of term vector, by utilizing vocabulary More information promotes the accuracy of text classification.

Term vector generating process is described first below.

Fig. 2 is the flow diagram according to the term vector generation method of one embodiment.Fig. 2 shows term vector generate stream Journey is adapted to the electronic equipment with certain computing capability.Assuming that any vocabulary be the first vocabulary, as shown in Fig. 2, for this first The process that vocabulary generates term vector may include: step 201, obtain the first vocabulary；Step 202, it is based on preset knowledge mapping In attribute description information associated with first vocabulary, extract the first vocabulary at least one extension vocabulary；Step 203, According to term vector model trained in advance, the first vocabulary and the corresponding each word term vector of each extension vocabulary are obtained； Step 204, each word term vector is merged, to generate the synthesis term vector of predefined size for the first vocabulary.

Firstly, in step 201, obtaining the first vocabulary.It is appreciated that the first vocabulary can be stored in advance in it is local Vocabulary, at this point it is possible to directly acquire first vocabulary.First vocabulary can also pass through the operations such as text participle, extraction keyword It obtains, at this point, the mode for obtaining the first vocabulary is also possible to segment text, removes stop words processing, or extracts and close Keyword etc..If the processing such as being segmented, extracting keyword to text in advance, a text can also correspond to a word finder, First vocabulary can also be the vocabulary obtained in order from word finder.In short, in this step, to the acquisition modes of vocabulary Without limitation.

Then, in step 202, based on attribute description letter associated with first vocabulary in preset knowledge mapping Breath extracts at least one extension vocabulary of the first vocabulary.By the step, the first vocabulary is extended.

Knowledge mapping (Knowledge Graph) can also be known as mapping knowledge domains, in books and information group, and can claim Map is mapped for knowledge domain visualization or ken.Knowledge mapping can describe knowledge resource and its load with visualization technique Body is excavated, analysis, building, is drawn and explicit knowledge and connecting each other between them, development process and structural relation etc..Its In, knowledge resource can be vocabulary itself, using vocabulary as things of carrier etc..

The knowledge resource for including in knowledge mapping can be indicated by vocabulary.One vocabulary can be the vocabulary itself, It is also possible to the specific things that this vocabulary is carried.It further, can be with when being described for a concrete knowledge resource It is described by complicated network structure.These descriptions can be to the vocabulary itself, be also possible to for this The specific things that a vocabulary is carried.These description informations can be to be specifically described for knowledge resource, can also be with It is the upper information of knowledge resource, can also be the description information that knowledge resource is described by related resource, these is retouched State the attribute description information that information can be collectively referred to as knowledge resource.In knowledge mapping, the attribute description of a knowledge resource Information is associated with the knowledge resource.Such as it is closed with a knowledge resource with connection in visual knowledge mapping Other various knowledge resources of system, all can serve as the attribute description information of the knowledge resource.

As an example, Fig. 3 gives the knowledge mapping schematic diagram an of vocabulary.As shown in Figure 3, it is assumed that current knowledge money Source is vocabulary " millet ".In its corresponding knowledge mapping, each attribute of vocabulary " millet " can be described.For The description of " millet " can be to the vocabulary itself, such as " noun ", " pronoun " etc., be also possible to for " millet " this word The specific things that remittance is carried, such as characters name millet, food millet etc..

For the first vocabulary, the attribute description information of the first vocabulary may include for the first vocabulary or its carried The specific descriptions information of things, the upper description information of the first vocabulary, association description information, etc..These attribute descriptions letter Breath can be embodied in knowledge mapping by vocabulary.Vocabulary associated with the first vocabulary, can be used as the attribute of the first vocabulary Description information.By taking Fig. 3 as an example, this direct correlation relationship is indicated by line.In fig. 3, it is assumed that the first vocabulary is " millet ", The vocabulary then connected with millet: characters name, Business Name, plant, Poales, grain, crops, herbaceous plant, five cereals, etc. Deng can be used as the upper description vocabulary (upper description information) of millet.The vocabulary " herba setariae viridis " connected with millet, as similar Plant, association plant, nickname etc. can be the association vocabulary (association description information) of " millet ".In the knowledge graph of more levels In spectrum, there can also be more specific descriptions vocabulary for being directed to " millet ".Such as yellow (color), round shaped grain (shape), the Shanxi (place of production One of) etc., it can be used as the specific descriptions vocabulary (specifically describing information) that plant " millet " specific features are described.Value It must illustrate, attribute information category division here is not unique, such as " characters name ", " Business Name " in Fig. 3 It can be used as the specific descriptions information of " millet ".It is upper description vocabulary listed above, association vocabulary, specific for " millet " Vocabulary is described, all can serve as attribute description information associated with the first vocabulary " millet ".

It may include many knowledge resources in one complete knowledge mapping, such as " crops " processing in Fig. 3 is " small Except rice ", multiple vocabulary " corn ", " wheat ", " rice ", " sweet potato " etc. can also be corresponded to.And these vocabulary can be right Answer other description informations, such as " rice " that " rice ", " grain ", " white ", " long grain " etc. can also be corresponded to.Knowledge mapping can To depict complicated relationship between these knowledge resources.Knowledge mapping can be built based on the knowledge information of various channels It is vertical, such as scientific encyclopaedia, webpage etc..Knowledge mapping can be pre-established and be stored.

As can be seen that knowledge mapping is just as a complicated network, each kind of a knowledge resource (such as vocabulary) Property description information can also be distinguished according to the associated layers grade with the knowledge resource.Such as in the knowledge mapping shown in Fig. 3, know Know resource be the first vocabulary " millet ", be directly linked level on attribute description information may include " name ", " company ", " virtual portrait ", " grass family ", " crops ", " herbaceous plant ", " grain ", " herba setariae viridis " etc., in respiratory sensation level Attribute description information may include " plant ", " grass ", " food ", " mechanism " etc..In general, the attribute for being directly linked level is retouched It is more significant to the first vocabulary to state information.Therefore, in some embodiments, retouched from the attribute that the first vocabulary is directly linked level It states and extracts extension vocabulary in information.But it is not excluded in some embodiments, it can be from the category in the first vocabulary respiratory sensation level Property description information in extract extension vocabulary.

It in a particular embodiment, can be by corresponding vocabulary in the attribute description information of the first vocabulary directly as expansion word It converges, can also selecting some vocabulary as extension vocabulary, such as only from these attribute description information, selection specifically describes vocabulary As extension vocabulary.

Then, step 203, according to term vector model trained in advance, the first vocabulary and each extension vocabulary difference are obtained Corresponding each word term vector.It is appreciated that term vector technology can convert the word in natural language to dense vector, Similar word has similar vector and indicates, such conversion facilitates the feature excavated in text between word and sentence.It generates The method of term vector can be based on statistical method (co-occurrence matrix, SVD are decomposed), be also possible to the mind based on different structure Language model through network, such as word2vec (word embeddings), glove (Global vectors for word Representation, the Global Vector of word) etc..This specification embodiment to specific term vector model without limitation.

Under normal conditions, vocabulary first can be shown into (one-hot representation) by only hotlist, that is, given Each word distributes a unique number string, distinguishes vocabulary with this.Such as: " banana " is expressed as [0 001000000 00000 0...], " apple " is expressed as [0 00000001000000 0...].That is, In corpus, a vocabulary (such as banana, apple) corresponds to a vector, only one value is 1 in vector, remaining is all 0. This vector is corresponding be exactly vocabulary unique number string.If regarding above-mentioned vector as binary number representation, each Vocabulary can also correspond to a decimal system or hexadecimal digit.The different corresponding numeric strings of two vocabulary is also different.This A numeric string is referred to as digital ID, such as the decimal number table that number ID can also be converted to above-mentioned binary vector Show, as unique number string [0 0000000000 0...1 00 0] can correspond to a number ID " 8 ".Due to It is mutually indepedent between vector, incidence relation that may be present between vocabulary directly can not be determined by unique number string.Moreover, to Measure dimension size depend on corpus in words number.If vocabulary is more in corpus, vector dimension is very big.Word to Similar import or the stronger vocabulary of relevance can be mapped to similar position in vector space by low-dimensional vector by amount model It sets.

By taking word2vec as an example, only hotlist that input layer can be vocabulary shows that corresponding unique number string, output layer are words Converge corresponding term vector, wherein each element of output layer corresponds to a vocabulary dimension in term vector.Such as output layer is right respectively It answers vocabulary [apple, banana, orange, rice ...], the value on each element can indicate the corresponding vocabulary of input layer and this yuan The correlation degree of the corresponding vocabulary of element.Wherein, in training pattern, the term vector of sample vocabulary can be by sample vocabulary and each The correlation degree of the corresponding vocabulary of a element indicates that for sample vocabulary, which can be by corpus The context relation of vocabulary carries out statistics determination.The probability that two vocabulary occur in context (such as adjacent) jointly together is higher, Its correlation degree is stronger.Numerical value on each element can between 0-1 value.During the corresponding only hotlist of sample vocabulary is shown Unique number be series-connected to the hidden layer with less a node.The weight for connecting input layer and hidden layer will become new Term vector.The activation primitive of the hidden layer for example can be to the linear weighted function of this layer of each node and (not will use as Nonlinear activation function as sigmoid or tanh).Hereafter the node of hidden layer can be fed to softmax (normalization index Function) output layer.In the training process, for the vocabulary occurred in corpus, the weight (mould of neural network can constantly be adjusted Shape parameter), make the corresponding each vocabulary of input layer, the probability that the higher word of correlation degree exports in output layer is higher.

In this way, at least one the extension vocabulary extracted in the first vocabulary and step 202 that are obtained in step 201, A word term vector can be obtained by term vector model trained in advance.In some embodiments, can first be obtained The unique number string (digital ID) of one vocabulary and each extension vocabulary, by unique number string input term vector mould trained in advance Type, so that it may which corresponding word term vector is obtained according to the output of term vector model.

Then, in step 204, the corresponding each word term vector of the first vocabulary and each extension vocabulary is merged, To generate the synthesis term vector of predefined size for the first vocabulary.In this step, each word term vector can be merged into One synthesis term vector of predetermined length, thus the term vector as the first vocabulary.

It is appreciated that during usually determining term vector to single vocabulary by term vector model, often only in consideration Hereafter vocabulary, and attribute description vocabulary is fewer, for example, " banana " context of co-text may be more " eating banana ", " picking up banana ", " buying banana " etc., and less " banana of yellow ", " banana as fruit " etc. can be described.Therefore, though It may also be related to context vocabulary in right word term vector generating process, but the information utilized is still limited.Due to compound word to Amount has comprehensively considered the attribute description information of vocabulary itself and the vocabulary in knowledge mapping, therefore can enrich the first word The meaning of remittance.

Word term vector merge method have very much, such as can be neural network, linear regression, average, be superimposed, Maxpooling (maximum pond) etc..It, first can be by each word word obtained in step 203 by taking maxpooling as an example Vector, which is arranged together, is combined into a matrix.Assuming that single term vector is M dimension, the quantity of word term vector is N, then can group At the matrix of M × N.Then it is slided in the matrix with the sliding window of a fixed size, and often slides into a position, taken Maximum value in sliding window.For example, sliding window size is 1 × N, a line is slided every time, takes maximum value therein.In this way, The synthesis term vector of available M dimension.If the predefined size of synthesis term vector is M-1 dimension, sliding window size can also be taken For 2 × N, a line is slided every time, takes maximum value therein.If the predefined size of synthesis term vector is M/2 dimension, cunning can also be taken Dynamic window size is 2 × N, slides two rows every time, takes maximum value therein.In practice, the predefined size for synthesizing term vector may be used also To be other dimensions, according to the predefined size, it can determine different sliding window size and sliding step, not limit herein It is fixed.

It is worth noting that when the merging mode to each word term vector is the processing for numerical value (rather than vector) When, for example, merge mode be linear regression, average, can be for it is each synthesis term vector same dimension member The processing of element.Such as it averages to the element of each word term vector first dimension.And be superimposed, it can be the superposition of vector, i.e., The element of identical dimensional is summed.

For the first vocabulary, the first vocabulary and its each word term vector for extending vocabulary, in a manner mentioned above It merges, the synthesis term vector of available predefined size.The synthesis term vector be exactly by Fig. 2 shows method be first The term vector that vocabulary determines.

It is worth noting that when the term vector model used in step 203 is consistent, the word term vector of each vocabulary Dimension is generally consistent, however for the first vocabulary, the quantity for extending vocabulary is uncertain, therefore is carrying out in this step It is not recommended that each word term vector is spliced during term vector merges.But it is not excluded for the possibility using joining method, such as Increase qualifications, extension vocabulary quantity is predetermined quantity, or can make the vector of indefinite length using other unified approach It is compressed to predefined size etc..

In order to illustrate more clearly of Fig. 2 shows term vector generation method, please refer to a term vector shown in Fig. 5 and generate Specific example.In the specific example shown in Fig. 5, the first vocabulary is " banana ".After obtaining the first vocabulary banana, to it Carry out the Information expansion of knowledge based map.Assuming that be expanded after extension vocabulary " fruit " and " yellow ", add first at this time Vocabulary itself, there are three vocabulary " banana ", " yellow " and " fruit " altogether.Then, the digital ID of these three vocabulary is first obtained respectively, Such as " banana " is corresponding number ID " 4 ", " fruit " is corresponding number ID " 6 ", and " yellow " is corresponding number ID " 9 ".By predefining The each number ID of term vector model treatment, respectively obtain corresponding word term vector.As shown in Figure 5 it is possible to further by 3 A word term vector, which merges, generates a synthesis term vector.The synthesis term vector is exactly the term vector using this specification embodiment The final term vector for the first vocabulary " banana " that generation method determines.

By Fig. 2 shows process, for the first vocabulary generate term vector during, first be directed to from knowledge mapping The various attribute description information of first vocabulary are extended, and by the first vocabulary and are extended the word term vector of vocabulary and merged into the The term vector of one vocabulary.Since knowledge mapping includes the rich of information, the much information of the first vocabulary can be made full use of, it is raw At term vector it is more efficient.

Be detailed below using Fig. 2 shows term vector generation method determine the processes of text categories.

Fig. 6 shows the method for the determination text categories according to one embodiment.The executing subject of this method has certain Electronic equipment of data-handling capacity, such as server-side shown in fig. 1 etc..The executing subject with Fig. 2 shows term vector generation The executing subject of method can be consistent, can also be inconsistent, is not limited thereto.As shown in fig. 6, the stream of the determination text categories Journey includes: step 601, obtains the first text to be processed；Step 602, word cutting processing is carried out to the first text, obtains at least one A candidate word；Step 603, the corresponding each synthesis term vector of each candidate word is determined, wherein at least one candidate word packet Include the first vocabulary, the synthesis term vector of the first vocabulary obtains in the following manner: based in preset knowledge mapping with described the The associated attribute description information of one vocabulary extracts at least one extension vocabulary of the first vocabulary；According to word trained in advance to Model is measured, each word term vector of the first vocabulary and each extension vocabulary is obtained；Each word term vector is merged, to be The synthesis term vector of first vocabulary generation predefined size；Step 604, the prediction mould that each synthesis term vector input is trained in advance Type, and determine according to the output result of prediction model the text categories of the first text.

Firstly, in step 601, obtaining the first text to be processed.Here, the first text to be processed can be needs Determine any one text of text categories, for example, an article of a chat message, science popularization platform in social platform, News item information, etc..

Then, by step 602, word cutting processing is carried out to the first text, obtains at least one candidate word.Word cutting is exactly Character in text is divided into word one by one.It may include to segmenting, remove stop words to the processing of the word cutting of text.Participle is just It is to be split to text, for example, can be divided by dictionary trained in advance for text " I has had a breakfast downstairs " At " I ", " I ", " ", " in building ", " building " " downstairs ", " having ", " ", " having ", " a ", " a morning ", " breakfast ", " meal " Etc. as word.Stop words is usually the word deactivated in vocabulary, these words are often little to the practical significance of text, such as empty Word, preposition etc..Such as in previous example, the stop words such as " ", " " can be removed, obtain " downstairs, have, breakfast, Vocabulary as meal ... ".These vocabulary can be used as the candidate word obtained after text filtering.

It is appreciated that due to segment, remove after stop words only effective vocabulary of remaining first text as candidate word, In subsequent processing, data processing amount can be reduced by only using effective vocabulary.When the length of text is larger (such as more than predetermined character Number threshold value), after word cutting processing can also being carried out to text, the keyword of predetermined number (such as 5) is only extracted as candidate word.Its The extracting mode of middle keyword is, for example, TF-IDF (term frequency-inverse document frequency, word Frequently-inverse text frequency), LDA (Latent Dirichlet Allocation, document subject matter generate model) etc., herein no longer It repeats.

Then, in step 603, the corresponding each synthesis term vector of each candidate word is determined.It is worth noting that The term vector of candidate word used herein is not traditional word term vector, but passes through the word term vector of multiple vocabulary Merge generation.

Specifically, the synthesis term vector of each candidate word can based on candidate word itself with its extend vocabulary word word to Amount determines.Assuming that any vocabulary in candidate word obtained in step 602 is the first vocabulary, then when getting the vocabulary, Synthesis term vector can obtain in the following manner: based on attribute associated with first vocabulary in preset knowledge mapping Description information extracts at least one extension vocabulary of the first vocabulary；According to term vector model trained in advance, the first vocabulary is obtained And the corresponding each word term vector of each extension vocabulary；Each word term vector is merged, thus raw for the first vocabulary At the synthesis term vector of predefined size.As can be seen that above procedure is consistent with the process of term vector generation method that Fig. 2 is used. Wherein, in this step 603, for each candidate word, it can use and generate synthesis term vector with the consistent process of Fig. 2 for it, Details are not described herein.

In one embodiment, can be determined in real time in the step the corresponding each compound word of each candidate word to Amount.That is, in the step 603, being obtained in order obtained each after obtaining each candidate word in step 602 Candidate word, and when obtaining each candidate word, it is executed Fig. 2 shows term vector generation method, for its generate compound word to Amount.

In another embodiment, the vocabulary that can be directed in advance in dictionary (or scheduled word finder) determines synthesis respectively Term vector, and it is stored in the predetermined storage area of database or executing subject.It, can be according in step 602 in the step 603 Obtained each candidate word directly reads corresponding each synthesis term vector.

By Fig. 2 shows term vector generation method it is found that step 603 determine each synthesis term vector not only contain The lexical information of candidate word itself, further comprises its attribute description information, and meaning is more abundant.

Then, by step 604, by each synthesis term vector input prediction model trained in advance, and according to prediction mould The output result of type determines the text categories of the first text.Wherein, according to the difference of application scenarios, text categories meaning here Also different.For example, text categories can be domain class as agricultural, politics, economy etc. in some Domestic News platforms Not.And in network risks control field, text categories can be Risk Text, non-Risk Text, or even can also be that risk waits for Determine risk classification as text, etc..

Prediction model can be full Connection Neural Network, is also possible to decision-tree model (such as GBDT), can also be circulation Neural network (RNN, Recurrent Neural Network) is it is not limited here.It is appreciated that for some scenes, such as In the previously mentioned longer situation of text, the keyword of predetermined number is extracted from text as candidate word, at this point, different texts The quantity of candidate word be all determining, i.e., predetermined number said herein.In this way, the corresponding each conjunction of the candidate word of each text It is all consistent at term vector dimension, if each synthesis term vector is stitched together, dimension is also determination, be can be used complete Connection Neural Network model, decision tree etc. are used as prediction model.And for other scenes, text to be processed may be one Words, several words, at this point, all vocabulary after word cutting all can serve as candidate word, if by the corresponding compound word of each candidate word Vector is all stitched together, and dimension is uncertain.In this case, Recognition with Recurrent Neural Network can be used as prediction model, incite somebody to action The corresponding each synthesis term vector of each candidate word is handled as sequence in text.In some implementations, it can also use Shot and long term memory models (Long Short Term Memory, LSTM) under RNN framework are as prediction model.

Below by taking text categories are risk as an example, the training process of above-mentioned prediction model is described.Conduct is obtained first Multiple texts of sample.Each text is corresponding with text label as " Risk Text " marked in advance, " non-Risk Text ". Each text is handled according to the description of step 601 to step 603.In this way, any one text as sample is ok Obtain at least one corresponding candidate word of the text and the corresponding each synthesis term vector of each candidate word.For the side of description Just, the corresponding each synthesis term vector of a text can be called one group of term vector.Successively by the corresponding each group of each text The selected model of term vector input is synthesized, and adjusts model parameter according to corresponding text label, so that prediction loss function Value is biased to reduced direction, to train above-mentioned prediction model.

When carrying out model training, the label marked in advance to sample can use digital representation, such as " Risk Text " with 1 table Show, " non-Risk Text " is indicated with 0.In some implementations, the output result of prediction model can be 0,1 risk specific in this way Category result, at this point, prediction model is for when determining text categories to the first text, it can be with according to the output result of prediction model Directly determine text categories belonging to the first text.In other sights, the output result of prediction model be also possible to 0-1 it Between numerical value, indicate to correspond to the probability of Risk Text.In this way, when prediction model is used to determine text categories to the first text, Further the output result of prediction model can also be judged according to pre-set probability threshold value.Such as output result is big When first threshold (such as 0.7), the text categories of corresponding first text are Risk Text, and output result is less than second threshold (such as 0.3) when, the text categories of corresponding first text are non-Risk Text, and export result between first threshold and second threshold Numerical value when, the first text be risk text undetermined, can further by manually determine text categories.

When being used for other scenes, the classification of various numerical identity texts can also be passed through.Such as " 3 " indicate agriculture Not, " 1 " indicates political classification, etc..

As shown in fig. 7, being a specific example of the method for the determination text categories shown in an application drawing 6.In the example In son, text to be processed be " banana is eaten very well ", by participle, obtain candidate word " banana ", " very ", " nice ".For each Candidate word determines method by the new term vector shown in Fig. 2, Fig. 5, determines synthesis term vector.Then, to each candidate word pair The each synthesis term vector answered inputs deep learning network (CNN or RNN as shown in Figure 7 etc.) trained in advance respectively, obtains The classification ID " 5 " of text.Classification ID can correspond to a specific category, such as Household Encyclopedia.

Above procedure is looked back, during determining text categories, due to the term vector generating process of the candidate word in text In, the attribute description information in knowledge mapping is utilized, thus more comprehensively and abundant from the progress of more various dimensions to candidate word Description, can be improved the accuracy of text classification.When being applied to text risk profile, the effective of risk profile can be improved Property.By experiment, by by using Fig. 2 shows method determine vocabulary synthesis term vector be applied to text classification, comparison pass The word term vector acquisition methods of system, in the textual classification model of small data set, classification accuracy rate promotes 20% or more.

According to the embodiment of another aspect, a kind of term vector generating means are also provided.Fig. 8 is shown according to one embodiment The schematic block diagram of term vector generating means.As shown in figure 8, term vector generating means 800 include: acquiring unit 81, it is configured to Obtain the first vocabulary；Expanding element 82 is configured in preset knowledge mapping attribute associated with first vocabulary Description information extracts at least one extension vocabulary of the first vocabulary；Word processing unit 83 is configured to according to word trained in advance Vector model obtains the first vocabulary and the corresponding each word term vector of each extension vocabulary；Combining unit 84, is configured to Each word term vector is merged, to generate the synthesis term vector of predefined size for the first vocabulary.

According to one embodiment, attribute description information associated with the first vocabulary is included at least one of the following: for institute That states the first vocabulary specifically describes vocabulary, the upper vocabulary of first vocabulary, association vocabulary.

According to one embodiment, term vector model is word2vec.

According to one embodiment, word processing unit 83 is further configured to:

By unique number string input term vector model trained in advance, the is determined according to the output result of term vector model One vocabulary or the corresponding word term vector of extension vocabulary.

According to one embodiment, combining unit 84 merges each word term vector by way of following at least one:

It averages to each word term vector with the element of dimension；

Each word term vector is overlapped.

It is worth noting that device 800 shown in Fig. 8 be with Fig. 2 shows the corresponding device of embodiment of the method implement Example, Fig. 2 shows embodiment of the method in it is corresponding describe be equally applicable to device 800, details are not described herein.

According to the embodiment of another further aspect, a kind of device of determining text categories is also provided.Fig. 9 is shown to be implemented according to one The device schematic block diagram of the determination text categories of example.As shown in figure 9, the device 900 for determining text categories includes: receiving unit 91, it is configured to obtain the first text to be processed；Pretreatment unit 92 is configured to carry out word cutting processing to the first text, obtain At least one candidate word；Determination unit 93 is configured to determine the corresponding each synthesis term vector of each candidate word, wherein At least one candidate word includes the first vocabulary, and the synthesis term vector of the first vocabulary obtains in the following manner: being known based on preset Know attribute description information associated with first vocabulary in map, extracts at least one extension vocabulary of the first vocabulary；Root According to term vector model trained in advance, the first vocabulary and the corresponding each word term vector of each extension vocabulary are obtained；It will Each word term vector merges, to generate the synthesis term vector of predefined size for the first vocabulary；Predicting unit 94, be configured to by Each synthesis term vector inputs prediction model trained in advance, and the text of the first text is determined according to the output result of prediction model This classification.

According to one embodiment, attribute description information associated with the first vocabulary is included at least one of the following: for One vocabulary specifically describes vocabulary, the upper description vocabulary of the first vocabulary, association vocabulary.

According to one embodiment, text categories include: Risk Text or non-Risk Text.

In a further embodiment, prediction model is trained in the following manner:

Obtain multiple texts as sample, wherein each text respectively corresponds each group based on the candidate in corresponding text The synthesis term vector that word determines, and the text label marked in advance, text label include Risk Text, non-Risk Text；

The successively model that each group synthesis term vector input is selected, and according to corresponding text label training adjustment model ginseng Number, so that before compared to training, model after training, corresponding with current sample prediction loss function value reduction.

According to one embodiment, prediction model can be in full Connection Neural Network, decision tree or Recognition with Recurrent Neural Network It is a kind of.

It is worth noting that device 900 shown in Fig. 9 is that device corresponding with the embodiment of the method shown in Fig. 6 is implemented , the corresponding description in the embodiment of the method shown in Fig. 6 is equally applicable to device 900, and details are not described herein.

According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey Sequence enables computer execute and combines method described in Fig. 2 or Fig. 6 when the computer program executes in a computer.

According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided In be stored with executable code, when the processor executes the executable code, realize the method in conjunction with described in Fig. 2 or Fig. 6.

Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims

1. a kind of method of determining text categories, which comprises

Obtain the first text to be processed；

Word cutting processing is carried out to first text, obtains at least one candidate word；

Determine the corresponding each synthesis term vector of each candidate word, wherein at least one described candidate word includes the first word It converges, the synthesis term vector of first vocabulary obtains in the following manner: being based in preset knowledge mapping and first word It converges associated attribute description information, extracts at least one extension vocabulary of first vocabulary；According to word trained in advance to Model is measured, first vocabulary and the corresponding each word term vector of each extension vocabulary are obtained；By each word word to Amount merges, to generate the synthesis term vector of predefined size for first vocabulary；

Described the is determined by each synthesis term vector input prediction model trained in advance, and according to the output result of prediction model The text categories of one text.

2. according to the method described in claim 1, wherein, attribute description information associated with first vocabulary includes following At least one of: vocabulary, the upper description vocabulary of first vocabulary, association vocabulary are specifically described for first vocabulary.

3. according to the method described in claim 1, wherein, the text categories include: Risk Text or non-Risk Text.

4. according to the method described in claim 3, wherein, the prediction model is trained in the following manner:

Obtain multiple texts as sample, wherein it is true based on the candidate word in corresponding text that each text respectively corresponds each group Fixed synthesis term vector, and the text label marked in advance, the text label include Risk Text, non-Risk Text；

The successively model that each group synthesis term vector input is selected, and according to corresponding text label training adjustment model parameter.

5. according to the method described in claim 1, wherein, the prediction model is full Connection Neural Network, decision tree or circulation One of neural network.

6. the term vector generation method that a kind of computer executes, which comprises

Obtain the first vocabulary；

Based on attribute description information associated with first vocabulary in preset knowledge mapping, first vocabulary is extracted At least one extension vocabulary；

According to term vector model trained in advance, first vocabulary and the corresponding each word of each extension vocabulary are obtained Term vector；

Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary.

7. according to the method described in claim 6, wherein, attribute description information associated with first vocabulary includes following At least one of: vocabulary, the upper description vocabulary of first vocabulary, association vocabulary are specifically described for first vocabulary.

8. according to the method described in claim 6, wherein, the term vector model is word2vec.

9. according to the method described in claim 6, wherein, the term vector model that the basis is trained in advance obtains described first Vocabulary and each each word term vector for extending vocabulary include:

By unique number string input term vector model trained in advance, determined according to the output result of the term vector model First vocabulary or the corresponding word term vector of extension vocabulary.

10. according to the method described in claim 6, wherein, it is described by below each word term vector merga pass at least A kind of mode carries out:

It averages to each word term vector with the element of dimension；

Each word term vector is overlapped.

11. a kind of device of determining text categories, described device include:

Receiving unit is configured to obtain the first text to be processed；

Determination unit is configured to determine the corresponding each synthesis term vector of each candidate word, wherein at least one described time Selecting word includes the first vocabulary, and the synthesis term vector of first vocabulary obtains in the following manner: based on preset knowledge mapping In attribute description information associated with first vocabulary, extract at least one extension vocabulary of first vocabulary；According to Trained term vector model in advance obtains first vocabulary and the corresponding each word term vector of each extension vocabulary； Each word term vector is merged, to generate the synthesis term vector of predefined size for first vocabulary；

Predicting unit is configured to the prediction model that each synthesis term vector input is trained in advance, and according to the defeated of prediction model Result determines the text categories of first text out.

12. device according to claim 11, wherein attribute description information associated with first vocabulary include with It is at least one of lower: for upper description vocabulary, the conjunctive word for specifically describing vocabulary, first vocabulary of first vocabulary It converges.

13. device according to claim 11, wherein the text categories include: Risk Text or non-Risk Text.

14. device according to claim 13, wherein described device further includes model training module, be configured to by with The under type training prediction model:

15. device according to claim 11, wherein the prediction model is full Connection Neural Network, decision tree or follows One of ring neural network.

16. a kind of term vector generating means, described device include:

Acquiring unit is configured to obtain the first vocabulary；

Expanding element is configured in preset knowledge mapping in attribute description information associated with first vocabulary, Extract at least one extension vocabulary of first vocabulary；

Word processing unit is configured to obtain first vocabulary and each expansion word according to term vector model trained in advance Converge corresponding each word term vector；

Combining unit is configured to merge each word term vector, to generate the synthesis of predefined size for first vocabulary Term vector.

17. device according to claim 16, wherein attribute description information associated with first vocabulary include with It is at least one of lower: for upper description vocabulary, the conjunctive word for specifically describing vocabulary, first vocabulary of first vocabulary It converges.

18. device according to claim 6, wherein the word processing unit is further configured to:

19. device according to claim 16, wherein the combining unit will be each by least one of the following mode A word term vector merges:

It averages to each word term vector with the element of dimension；

Each word term vector is overlapped.

20. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-10.

21. a kind of calculating equipment, including memory and processor, which is characterized in that be stored with executable generation in the memory Code realizes method of any of claims 1-10 when the processor executes the executable code.