CN110516073A - A kind of file classification method, device, equipment and medium - Google Patents
A kind of file classification method, device, equipment and medium Download PDFInfo
- Publication number
- CN110516073A CN110516073A CN201910816831.9A CN201910816831A CN110516073A CN 110516073 A CN110516073 A CN 110516073A CN 201910816831 A CN201910816831 A CN 201910816831A CN 110516073 A CN110516073 A CN 110516073A
- Authority
- CN
- China
- Prior art keywords
- entity
- vector
- sequence
- text
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of file classification method, device, equipment and media, are related to natural language processing technique field.Specific implementation are as follows: obtain text to be sorted;The word sequence input word vector coding model of text to be sorted is determined to the term vector sequence of word sequence;By the entity sequence inputting entity vector model of text to be sorted to determine the corresponding entity sequence vector of entity sequence;Entity vector model determines that entity vector, entity vector coding model are formed based on the text training in entity mobility models spectrum data library based on entity vector coding model;Classifying text, which is treated, according to term vector sequence and entity sequence vector carries out Classification and Identification.The embodiment of the present application avoids the building of Feature Engineering and training sample, reduces the building difficulty of textual classification model;Text classification is carried out by the way that term vector sequence and entity sequence vector are comprehensive, improves the semantic susceptibility of textual classification model, and then improve the accuracy of the classification results of text to be sorted.
Description
Technical field
The invention relates to microcomputer data processing more particularly to natural language processing technique fields, specifically
It is related to a kind of file classification method, device, equipment and medium.
Background technique
Text classification is that machine learning field is most basic and the most commonly used task of application scenarios, the target of text classification are
It automatically is one or more predefined classifications by the document classification of textual form.
Converted based on term vector carry out text classification technology be it is current through frequently with technology, still, existing scheme has
Extreme dependence characteristics engineering and training sample building process, need to spend larger human cost, also then to it is semantic not
Enough sensitivities, it is difficult to meet the text classification application demand under complex scene.
Summary of the invention
The embodiment of the present application provides a kind of file classification method, device, equipment and medium, to reduce textual classification model
Difficulty is constructed, and promotes the semantic susceptibility of textual classification model.
In a first aspect, the embodiment of the present application provides a kind of file classification method, comprising:
Obtain text to be sorted;
By the word sequence input word vector coding model of text to be sorted, with the term vector sequence of the determination word sequence;
By the entity sequence inputting entity vector model of the text to be sorted, with the corresponding reality of the determination entity sequence
Body sequence vector, wherein the entity vector model is that entity vector is determined based on entity vector coding model, the entity
Vector coding model is the text based on entity mobility models spectrum data library as made of sample training;
According to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
One embodiment in above-mentioned application has the following advantages that or the utility model has the advantages that reduces the building of textual classification model
Difficulty, while improving the semantic susceptibility of textual classification model.The embodiment of the present application, and will be to by obtaining text to be sorted
The word sequence input word vector coding model of disaggregated model text, to determine the term vector sequence of word sequence;By text to be sorted
Entity sequence inputting to the text training based on entity spectrum data library and acquisition entity vector model, to determine entity sequence
Arrange corresponding entity sequence vector;According to term vector sequence and entity sequence vector, treats classifying text and carry out Classification and Identification.On
The use that technical solution passes through term vector encoding model and entity vector model is stated, determines word corresponding with text to be sorted respectively
Sequence vector and entity sequence vector, the building for avoiding Feature Engineering and training sample are constituted, and reduce textual classification model
Building difficulty;By the way that under two kinds of different dimensions of term vector sequence and entity sequence vector, synthesis carries out text to be sorted
Classification and Identification improves the semantic susceptibility of textual classification model, so improve text to be sorted classification results it is accurate
Degree.
Optionally, according to the term vector sequence and entity sequence vector, classification knowledge is carried out to the text to be sorted
Before not, further includes:
By the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination;
The entity sequence vector is inputted into entity vector attention mechanism model, is weighed with the attention of each entity vector of determination
Weight;
Correspondingly, carrying out Classification and Identification to the text to be sorted according to the term vector sequence and entity sequence vector
Include:
According to the term vector sequence, entity sequence vector and respective attention weight, to the text to be sorted into
Row Classification and Identification.
One embodiment in above-mentioned application is by introducing term vector attention Mechanism Model and entity vector attention machine
Simulation carries out text to term vector sequence and entity sequence vector divided attention weight, and according to the attention weight distributed
Classification and Identification, so that classification results efficient balance of the classifying text under term vector and entity vector dimension is treated, to text
In important information highlighted, maximize the model contribution degree of different sequence vectors, further improve textual classification model
Semantic susceptibility, improve the accuracy of text classification result.
Optionally, according to the term vector sequence, entity sequence vector and respective attention weight, to described to be sorted
Text carries out Classification and Identification
By the term vector respectively multiplied by corresponding attention weight, the entity vector is weighed multiplied by corresponding attention respectively
Weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence
Column;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
One embodiment in above-mentioned application is by noticing that weight respectively carries out term vector sequence and entity sequence vector
Weighting, the contribution degree of efficient balance term vector sequence and entity sequence vector, and pass through the term vector sequence and reality after weighting
Body sequence vector realizes Fusion Features by splicing, carries out text classification based on the complete vector sequence after Fusion Features, perfect
The classification mechanism of text classification identification, improve textual classification model semantic susceptibility and text classification result it is accurate
Degree.
Optionally, the training process of the entity vector coding model includes:
Training sample based on the entity description text in entity mobility models spectrum data library as entity;
Using the training sample of the entity, entity vector coding model is trained.
One embodiment in above-mentioned application is by the entity description text in knowledge mapping database to entity vector mould
Type is trained, so that introducing the description text for having incidence relation with entity during model training, increases training sample
Range, and then improve the reasonability and validity when entity vector coding model encodes different entities, be sample
The successful training of entity vector coding model in the case of this provides possibility.
Optionally, using the training sample of the entity, entity vector coding model is trained includes:
According to the context training sample of each entity, the level-one vector model of each entity is trained, with true
The level-one vector of fixed each entity;
Entity relationship group is determined from entity mobility models spectrum data library, or, according to the entity co-occurrence in urtext
Situation determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship instruction of each entity is determined respectively
Practice sample, inputs in the corresponding second-level model of each entity and be trained, to update the level-one vector of each entity,
Obtain final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
One embodiment in above-mentioned application passes through the entity relationship group or entity co-occurrence in entity mobility models spectrum data library
Entity relationship group determined by situation carries out second training to entity vector coding model, so that drawing during model training
The entity for entering separate sources, increases the range of training sample, so improve entity vector coding model to different entities into
Reasonability and validity when row coding.
Optionally, the first-level model includes NN model and similarity function, and the second-level model is skip-gram mould
Type.
Entity vector coding model is embodied as NN model and skip-gram mould by one embodiment in above-mentioned application
Type is chosen by suitable model, avoids the generation of model over-fitting, thus for entity sequence vector encoding model
Effectively training provides safeguard.
Optionally, include: as the training sample of entity based on the entity description text in entity mobility models spectrum data library
Obtain original statement;
Based on entity mobility models map, at least one entity in the original statement is identified;
The original statement for carrying out entity positive example mark is obtained, as positive example training sample, wherein in positive example training sample
Entities Matching in entity and the entity mobility models map;
Counter-example training sample is determined according to positive example entity, wherein entity and the entity mobility models in counter-example training sample
Entity in map mismatches;
Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training sample is added to,
As the context training sample.
One embodiment in above-mentioned application is determined by the entity that entity mobility models map carries out original statement, and is based on institute
Determining entity and entity description text generation positive example training sample increases the characteristic dimension in training sample, and based on just
Example entity is determining and entity mobility models map determines counter-example training sample, and the perfect generting machanism of training sample, is entity vector
Effective training of encoding model provides safeguard.
Optionally, determine that counter-example training sample includes: according to positive example training sample
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example
Entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample,
As the context training sample.
One embodiment in above-mentioned application carries out the determination of counter-example entity by positive example entity, and according to counter-example entity
Entity description text generation counter-example training sample, the perfect generting machanism of counter-example training sample are entity vector coding model
It is effective training provide safeguard.
Optionally, the term vector encoding model is word2vec model or Glove model, carries out nothing using samples of text
Supervised training forms.
By word2vec model or Glove model, perfect term vector encodes mould for one embodiment in above-mentioned application
The training mechanism of type, is chosen by suitable model, avoids the generation of over-fitting caused by the model complexity excessively of selection,
To be provided safeguard for effective training of term vector encoding model.
Optionally, it includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, extracts Key Words from the user comment text
Sentence, as text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software in, obtain consultation information entry topic or key sentence, as it is described to
Classifying text;
Advertising information key sentence is obtained, as the text to be sorted.
One embodiment in above-mentioned application is by carrying out the acquisition of text to be sorted from different application software, by the application
Involved in file classification method be adapted to different application scenarios, embody the universality of file classification method.
Second aspect, the embodiment of the present application also provides a kind of document sorting apparatus, comprising:
Text to be sorted obtains module, for obtaining text to be sorted;
Term vector sequence determining module, for by the word sequence input word vector coding model of text to be sorted, with determination
The term vector sequence of the word sequence;
Entity sequence vector determining module, for by the entity sequence inputting entity vector model of the text to be sorted,
With the corresponding entity sequence vector of the determination entity sequence, wherein the entity vector model is based on entity vector coding
Model is come to determine entity vector, the entity vector coding model be the text based on entity mobility models spectrum data library as sample
Made of training;
Classification and Identification module, for according to the term vector sequence and entity sequence vector, to the text to be sorted into
Row Classification and Identification.
The third aspect, the embodiment of the present application also provides a kind of electronic equipment, comprising:
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one
A processor executes, so that at least one described processor is able to carry out a kind of text classification provided by first aspect embodiment
Method.
Fourth aspect is stored with the non-instantaneous of computer instruction and computer-readable deposits the embodiment of the present application also provides a kind of
Storage media, the computer instruction is for making the computer execute a kind of text classification side provided by first aspect embodiment
Method.
Other effects possessed by above-mentioned optional way are illustrated hereinafter in conjunction with specific embodiment.
Detailed description of the invention
Attached drawing does not constitute the restriction to the application for more fully understanding this programme.Wherein:
Fig. 1 is the flow chart of one of the embodiment of the present application one file classification method;
Fig. 2 is the flow chart of one of the embodiment of the present application two file classification method;
Fig. 3 is the flow chart of one of the embodiment of the present application three file classification method;
Fig. 4 A is the flow chart of one of the embodiment of the present application four file classification method;
Fig. 4 B is one of the embodiment of the present application four entity vector coding model framework schematic diagram;
Fig. 4 C is the overall architecture schematic diagram of one of the embodiment of the present application four textual classification model;
Fig. 4 D is the schematic diagram of one of the embodiment of the present application four text classification result;
Fig. 5 is the structure chart of one of the embodiment of the present application five document sorting apparatus;
Fig. 6 is the block diagram for the electronic equipment for realizing the file classification method of the embodiment of the present application.
Specific embodiment
It explains below in conjunction with exemplary embodiment of the attached drawing to the application, including the various of the embodiment of the present application
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from the scope and spirit of the present application.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Embodiment one
Fig. 1 is the flow chart of one of the embodiment of the present application one file classification method, and the embodiment of the present application is suitable for pair
Text to be sorted carries out Classification and Identification, and the case where with the generic of determination text to be sorted, this method is by document sorting apparatus
It executes, the device is by software and or hardware realization, and concrete configuration is in the electronic equipment for having certain data operation ability
In.
A kind of file classification method as shown in Figure 1, comprising:
S101, text to be sorted is obtained.
Wherein, text to be sorted can be stored in advance in electronic equipment local, other storages associated with electronic equipment
In equipment or cloud, and the acquisition of text to be sorted is carried out when needed;Or from the application software for generating text to be sorted
Carry out real-time acquisition or the timing acquisition of text to be sorted.
Illustratively, obtaining text to be sorted includes but is not limited at least one of following manner: being answered from social media
With in software, user comment text is obtained, key sentence is extracted from the user comment text, as text to be sorted;From
In search engine application software, user's search statement is obtained, as the text to be sorted;Application software is pushed from consultation information
In, the topic or key sentence of consultation information entry are obtained, as the text to be sorted;And obtain advertising information Key Words
Sentence, as the text to be sorted.
It is understood that the application can be made by the acquisition for carrying out text to be sorted from different application software
Related file classification method is adapted to different application scenarios, embodies the universality of file classification method.
Optionally, key sentence is extracted from user comment text, can be and word segmentation processing is carried out to user comment text,
And count the frequency of occurrences of each word segmentation result;According to the frequency of occurrences of each word segmentation result, key sentence is extracted.Such as participle is tied
Corresponding sentence is as key sentence when the frequency of occurrences of fruit is more than setpoint frequency threshold value, and/or by the appearance of word segmentation result
The corresponding sentence of each word segmentation result of the highest setting quantity of frequency is as key sentence.Wherein, setpoint frequency threshold value or setting
Quantity is as needed by technical staff or empirical value is set.
S102, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence
Sequence.
Illustratively, classifying text can be treated and carry out word segmentation processing, combine each word segmentation result to obtain word sequence;By word
Sequence inputting determines term vector according to the output result of term vector encoding model into preparatory trained term vector encoding model
Sequence.
Wherein, each word in word sequence is mapped to corresponding term vector by term vector encoding model, and by each word
Vector combines to obtain term vector sequence.Also, the vector distance between the corresponding term vector of word similar in identical or meaning compared with
Small, the vector distance between the different or opposite meaning corresponding term vectors of word is larger.
Optionally, term vector encoding model can use word2vec model or Glove model, be carried out using samples of text
Unsupervised training forms.Wherein, word2vec model can be continuous bag of words (Continuous Bag-of-words
Model, CBOW) or Skip-gram model etc..
It is understood that passing through the Rational choice of term vector encoding model, the model complexity excessively for avoiding selection causes
Over-fitting generation, thus for term vector encoding model it is effective training provide guarantee.
S103, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair
The entity sequence vector answered, wherein the entity vector model is that entity vector is determined based on entity vector coding model, institute
Stating entity vector coding model is the text based on entity mobility models spectrum data library as made of sample training.
Illustratively, classifying text can be treated and carry out word segmentation processing, and according to preset entity mobility models spectrum data library
Included in each entity word, each word segmentation result is screened, at least one entity word is obtained;By each entity word combination shape
At entity sequence, and by entity sequence inputting into the entity vector model being previously obtained, and entity vector model is based on pre-
First trained entity vector coding model determines.Entity vector sequence is determined according to the output result of entity vector model
Column, entity vector model can be entity vector coding model itself, but can be based on the determination of entity vector coding model
As a result, the mapping table for example including each entity Yu entity vector, mapping relations are the entity vectors obtained by training
Encoding model and determination.
Wherein, each entity word in entity sequence is mapped to corresponding entity vector by entity vector model, and will
Each entity vector combines to obtain entity sequence vector.The distance between entity vector equally reflects the similar journey between entity word
Degree.It is abundant with entity associated since entity vector model is based in entity word itself and entity mobility models spectrum data library
Describe text training and formation, so the sample size of required mark is few, the semantic relevance of training sample is strong and abundant, right
The identification susceptibility of semantic degree of approximation is high.
Optionally, entity vector coding model can using neural network (Neural Networks, NN) model and
Skip-gram model is formed by built-up pattern.
It is understood that by the Rational choice to entity vector coding model, and the group of the different models to selection
It closes and uses, guarantee is provided for effective training of entity vector coding model, to improve entity vector coding model indirectly
Model accuracy.
S104, according to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
Optionally, it according to term vector sequence and entity sequence vector, treats classifying text and carries out Classification and Identification, may is that
Directly term vector sequence and entity sequence vector are spliced, to realize Fusion Features, and the sequence vector obtained after fusion is defeated
Enter into classifier, obtains the classification results of text to be sorted.That is, considering the term vector sequence of the text simultaneously in classifier
Column and entity sequence vector.
The embodiment of the present application is compiled by obtaining text to be sorted, and by the word sequence of model text to be sorted input term vector
Code model, to determine the term vector sequence of word sequence;By the entity sequence inputting of text to be sorted to being based on entity spectrum data
Entity vector coding model made of the text training in library, to determine the corresponding entity sequence vector of entity sequence;According to word to
Sequence and entity sequence vector are measured, classifying text is treated and carries out Classification and Identification.Above-mentioned technical proposal passes through term vector encoding model
With the use of entity vector model, term vector sequence corresponding with text to be sorted and entity sequence vector are determined respectively, is avoided
The building of Feature Engineering and training sample is constituted, and reduces the building difficulty of textual classification model;By in term vector sequence
Under two kinds of different dimensions of entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves textual classification model
Semantic susceptibility, and then improve the accuracy of the classification results of text to be sorted.
Embodiment two
Fig. 2 is the flow chart of one of the embodiment of the present application two file classification method, and the embodiment of the present application is above-mentioned each
Improvement is optimized on the basis of the technical solution of embodiment.
Further, " according to the term vector sequence and entity sequence vector, classifying to the text to be sorted
It is additional " by the term vector sequence inputting term vector attention mechanism model, to be weighed with the attention of each term vector of determination before identification "
Weight;The entity sequence vector is inputted into entity vector attention mechanism model, with the attention weight of each entity vector of determination ";Phase
It answers, operation " according to the term vector sequence and entity sequence vector, carrying out Classification and Identification to the text to be sorted " is thin
It turns to and " according to the term vector sequence, entity sequence vector and respective attention weight, the text to be sorted is divided
Class identification ", to realize to the efficient balance of the classification results under term vector and entity vector dimension, maximizes different sequence vectors
Model contribution degree.
A kind of file classification method as shown in Figure 2, comprising:
S201, text to be sorted is obtained.
S202, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence
Sequence.
S203, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair
The entity sequence vector answered, wherein the entity vector model is that entity vector is determined based on entity vector coding model, institute
Stating entity vector coding model is the text based on entity mobility models spectrum data library as made of sample training.
S204, by the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination.
S205, the entity sequence vector is inputted into entity vector attention mechanism model, with the note of each entity vector of determination
Meaning weight.
S206, according to the term vector sequence, entity sequence vector and respective attention weight, to the text to be sorted
This progress Classification and Identification.
Attention mechanism model, also referred to as attention mechanism (Attention mechanism) model, is applied to natural language data processing
In principle be, in conjunction with each term vector in context identification term vector sequence to the percentage contribution of result is obtained, i.e., as pay attention to
Weight.
In a kind of optional embodiment of the embodiment of the present application, according to term vector sequence, entity sequence vector and each
From attention weight, treat classifying text carry out Classification and Identification, may is that by the term vector respectively multiplied by corresponding attentions power
Weight, by the entity vector respectively multiplied by corresponding attention weight;It will be multiplied by the term vector sequence and entity vector for paying attention to weight
Sequence carries out head and the tail splicing, forms complete vector sequence;By the complete vector sequence inputting classifier, will output result as
The classification results of the text to be sorted.
Pay attention to weight to term vector sequence and entity vector accordingly it is understood that above-mentioned optinal plan passes through respectively
Sequence carries out attention weighting, realizes feature-based fusion, so that when being classified using classifier, by under different dimensions
Synergistic effect between term vector sequence and the vector of entity sequence vector, the two are complementary to one another, and are corrected each other, while in text
Important information highlighted, to enhance the comprehensive and reliability of text classification result.
One embodiment in above-mentioned application is by introducing term vector attention Mechanism Model and entity vector attention machine
Simulation carries out text to term vector sequence and entity sequence vector divided attention weight, and according to the attention weight distributed
Classification and Identification, so that classification results efficient balance of the classifying text under term vector and entity vector dimension is treated, to text
In important information highlighted, maximize the model contribution degree of different sequence vectors, further improve textual classification model
Semantic susceptibility, improve the accuracy of text classification result.
Embodiment three
Fig. 3 is the flow chart of one of the embodiment of the present application three file classification method, and the embodiment of the present application is above-mentioned each
Improvement is optimized on the basis of the technical solution of embodiment.
The operation of " carrying out model training to entity vector coding model " is described in detail, and by entity vector coding
The training process of model is refined as " the training sample based on the entity description text in entity mobility models spectrum data library as entity
This;Using the training sample of the entity, entity vector coding model is trained ", to improve entity vector coding model
Model training mechanism.
A kind of file classification method as shown in Figure 3, comprising:
S301, the training sample based on the entity description text in entity mobility models spectrum data library as entity.
Wherein, training sample may include positive example training sample and counter-example training sample.Certain original statement is obtained first, and is known
Entity word content therein is not determined.For the entity word content, the correct entity point in entity mobility models map, institute are corresponded to
Associated various description corpus are just used as positive example training sample.Each reality in entity mobility models map other than correct entity point
The description of body point is it is anticipated that all can serve as counter-example training sample.
By the addition of description this entity Adjunct content of text of entity point, so that being introduced during model training
New characteristic information avoids the generation of model poor fitting phenomenon.
Illustratively, the training sample based on the entity description text in entity mobility models spectrum data library as entity, can
To be: original statement is obtained, based at least one entity in entity mobility models spectrum recognition original statement;It obtains and is carrying out entity just
The original statement of example mark, as positive example training sample.Wherein, the entity in positive example training sample and the entity mobility models map
In Entities Matching;Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training is added to
Sample.
Wherein, original statement can be web data or search daily record data etc., and original statement can be stored in advance in electricity
Sub- equipment is local, in other storage equipment of electronic device association or cloud, and is obtained when needed;Certainly also
The real-time crawl of original statement can be carried out when equipment generates the original statements such as web data or search daily record data.
It is understood that due to the introducing for describing text, so that the characteristic information that positive example training sample is included is more
Comprehensively, thus when carrying out model training using the positive example training sample, the acquisition of description corpus can be enriched, and these are described
Corpus can rapidly become training sample without artificial mark.Since description corpus multi-faceted can be described the entity word,
So that Entity Semantics identification is more sensitive.
Illustratively, counter-example training sample can be determined according to positive example entity and be generated.Wherein, the reality in counter-example training sample
Correct entity mismatch in body and entity mobility models map namely the counter-example entity in counter-example training sample are entity mobility models map
In it is identical as the entity in original statement, but the different entity of text is described;And/or the counter-example entity in counter-example training sample is
Other entities in entity mobility models map in addition to the correct entity in original statement.For example, original statement are as follows: Liu Dehua is sung
Song be lustily water.So " Liu Dehua " is Liu Dehua (singer) entity point in entity mobility models map, is marked as positive example,
This annotation process can be accomplished manually;If it is real that there is also Liu Dehua (professor), Zhou Jielun (singer) etc. in entity mobility models map
Body point, then these entity points are exactly counter-example, after positive example entity point has been determined, other entity points are all counter-examples.
Specifically, determining counter-example training sample according to positive example training sample, can be according to positive example entity, from entity mobility models
The identical or different different entities point of content is determined in map, as counter-example entity;Counter-example entity is obtained in entity mobility models map
Entity description text in database, as counter-example training sample.Can according to the demand of counter-example training sample, stochastical sampling or
Set rule sampling.Above-mentioned positive example training sample and counter-example training sample can be used as the context training sample of entity respectively.
For example, original statement is " coach of badminton player Zhang San is Li Si ", then, it gets the bid in entity mobility models map
Positive example entity point " Zhang San " is outpoured, although Zhang San may bear the same name, mark of each entity point in map is unique.It is real
Body point " Zhang San " can much describe text, such as resume, the news of Zhang San with corresponding record.Correspondingly, counter-example entity is other
Also it is " Zhang San ", but is not the entity point of this Zhang San, or be also possible to any other entity point.
S302, using the training sample of the entity, entity vector coding model is trained.
Wherein, entity vector coding model is used to each entity word in input model being mapped as vector form, obtains pair
The entity vector answered.
In an optional embodiment of the embodiment of the present application, entity vector coding model is trained can be by
The positive example training sample and counter-example training sample that above-mentioned each optional embodiment obtains are trained entity vector coding model,
To optimize each model parameter in entity vector coding model.Hidden layer weight parameter in entity vector coding model is just made
For the vector of entity.A corresponding entity vector coding model can be respectively trained to each entity, thus determine respectively each
The mapping relations of a entity and entity vector save as entity vector model in advance.
In another optional embodiment of the embodiment of the present application, in order to advanced optimize the output of model as a result, may be used also
The training of two-stage is carried out to entity vector coding model based on context training sample and entity relationship training sample.That is, using
The training sample of the entity, the operation being trained to entity vector coding model specifically include:
According to the context training sample of each entity, the level-one vector model of each entity is trained, with true
The level-one vector of fixed each entity;
Entity relationship group is determined from entity mobility models spectrum data library, and/or, it is total according to the entity in urtext
Existing situation determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship instruction of each entity is determined respectively
Practice sample, inputs in the corresponding second-level model of each entity and be trained, to update the level-one vector of each entity,
Obtain final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
The operation of above steps is illustrated respectively below.
For the training process of first-level model, specifically:
The first-level model includes NN (neural network) model and similarity function, and similarity function can be sigmod letter
Number.For NN+sigmod model, for each entity one corresponding model of each self-training, specifically by the conduct of the entity
The positive example training sample and counter-example training sample of context training sample input NN+sigmod model, complete supervised training.It trains
NN+sigmod model in hidden layer weight parameter, be exactly the level-one vector of the entity.
Acquisition for entity relationship group, specific as follows:
For example, for original statement " coach of badminton player Zhang San is Li Si ", it can be in entity mobility models figure
It is searched in spectrum and entity " Zhang San " has triplet information corresponding to the system of frontier juncture and be determined as entity relationship group, such as " Zhang San " is
One of entity in triple relationship " entity 1- relation-entity 2 ", or " Zhang San " are triple relationship " entity-category
Entity or attribute value in property-attribute value ".Or it is optional, " Li Si " and " Zhang San " occurs jointly in original statement, and
" Li Si " is similarly a certain entity in entity mobility models map, then will include entity " Zhang San ", entity " Li Si " and "
Three " with the triplet information of " Li Si " corresponding sides relationship, such as " Zhang San-master and apprentice-Li Si " and " Li Si-master and apprentice-Zhang San " etc.,
As entity relationship group.
For the training process of second-level model, specifically:
The second-level model can be skip-gram model.The entity relationship group of above-mentioned acquisition reflects between entity
Context relation is illustrated for the entity relationship group as " Liu Dehua, Liang Chaowei, protagonist, Infernal Affairs ", including
Three entities, are Liu Dehua, Liang Chaowei, Infernal Affairs respectively.Entity two-by-two in three entities is combined, entity is formed
Relationship training sample, such as " Liu Dehua, Infernal Affairs ", the sample of two tuples such as " Liu Dehua, Liang Chaowei " and " Liang Chaowei, Infernal Affairs "
This.For all samples of any entity, such as all samples of Liu Dehua, by the level-one of an entity in two tuples to
Amount, as the input of skip-gram model, output of the level-one vector of another entity as skip-gram model, to lead to
Other entities for having frontier juncture system with " Liu Dehua " are crossed to update the level-one vector of " Liu Dehua " this entity, formed second level to
Amount, the entity vector as final " Liu Dehua ".
By above-mentioned two-stage training, positive example training sample and counter-example training sample be usually include entity word, it is corresponding
The sentence of various contexts, so NN model is used to carry out context description to each entity in training sample, each entity is preliminary
It is mapped as entity vector.Skip-gram model is used for according to entity-side-entity corresponding relationship specific in entity relationship group
Further trained.It is desirable that making entity vector that can not only reflect context, moreover it is possible to reflection and other tight associations
Relationship between entity.For example, having the entity of close frontier juncture system when the entity vector meeting distance occurred in identical context is smaller
Distance also can be smaller between vector.
S303, text to be sorted is obtained.
S304, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence
Sequence.
S305, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair
The entity sequence vector answered.
S306, according to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
One embodiment in above-mentioned application compiles entity vector by the entity description text in knowledge mapping database
Code model is trained, so that introducing the description text for having incidence relation with entity during model training, increases training
The range of sample, and then reasonability and validity when entity vector coding model encodes different entities are improved, it is
The successful training of entity vector coding model under Small Sample Size provides possibility.
Example IV
Fig. 4 A is the flow chart of one of the embodiment of the present application four file classification method, and the embodiment of the present application is above-mentioned each
A kind of preferred embodiment is provided on the basis of the technical solution of embodiment.Below with reference to entity vector mould shown in Fig. 4 B
The overall architecture schematic diagram of textual classification model shown in type configuration diagram and Fig. 4 C is described in detail.
A kind of file classification method as shown in Figure 4 A, comprising:
S410, training sample preparation stage;
S420, model training stage;
S430, model service stage.
Wherein, the training sample preparation stage, comprising:
S411, original statement is obtained.
S412, according to the entity word in entity mobility models map, identify at least one entity word in original statement.
S413, it is trained using the original statement of positive example mark and the corresponding entity description text of positive example entity as positive example
Sample.
S414, using the corresponding entity description text of counter-example entity in entity mobility models map as counter-example training sample.
S415, entity relationship group is determined from entity mobility models map;And/or the entity co-occurrence situation in original statement, really
Determine entity relationship group.
If original statement is " in Australian Open Tennis open championship finals, Li Na obtains champion women's singles ".Wherein, positive example entity is " net
Ball sportsman Li Na ".It further include " singer Li Na ", as counter-example entity in entity mobility models map.In entity mobility models map
In, it further include triplet information " Li Na-master and apprentice-Jiang Shan " corresponding with " tennis player Li Na ", " Li Na-master and apprentice-Jiang Shan "
As entity relationship group.It is similar also to obtain other entity relationship groups.
Wherein, model training stage, comprising:
S421, using positive example training sample and counter-example training sample as input sample, to the NN mould in entity vector model
Type is trained.
S422, the model of each positive example training sample of NN model and counter-example training sample is exported as a result, i.e. entity vector, with
The conversion of entity relationship group is used as input sample, is trained to the skip-gram model in entity vector model.
Wherein, model service stage, comprising:
S431, text to be sorted is obtained.
S432, it treats classifying text and is segmented, combine word segmentation result to obtain word sequence [w1, w2 ..., wn].
S433, according to the entity word in entity mobility models spectrum recognition text to be sorted, each entity word combination is obtained into entity
Sequence [e1, e2 ..., en].
S434, word sequence is input in trained term vector encoding model, obtain term vector sequence [h1, h2 ...,
hn]。
S435, by entity sequence inputting into trained entity vector model, obtain entity sequence vector [k1,
k2,…,kn]。
S436, by the trained term vector attention Mechanism Model Uw of term vector sequence inputting, determine that term vector pays attention to
Weight [a1, a2 ..., an].
S437, entity sequence vector is inputted in trained entity vector attention Mechanism Model KGw, determine entity to
Amount attention weight [b1, b2 ..., bn].
S438, term vector being paid attention to, weight to each term vector weighted sum in term vector sequence, obtains to be sorted respectively
Term vector sequence S1.
S439, entity vector being paid attention to, weight to each entity vector weighted sum in entity sequence vector, obtains respectively
Entity sequence vector S2 to be sorted.
S4310, term vector sequence S1 to be sorted and entity sequence vector S2 to be sorted are subjected to splicing fusion, obtained complete
Sequence vector.
S4311, by complete vector sequence inputting into classifier softmax, obtain the text categories of text to be sorted.
Wherein, classifier can be two disaggregated model classifiers, and it is more to can also be that multiple two disaggregated models combine
Disaggregated model classifier.
It is text classification knot for text to be sorted with " in Australian Open Tennis open championship finals, Li Na obtains champion women's singles "
Fruit D referring to fig. 4.
To " Li Na " in the text to be sorted, the probability for being determined as tennis player Li Na is 0.95, is determined as fencing
The probability of sportsman Li Na is 0.6, and the probability for being determined as singer Li Na is 0.09.It is understood that due to fencer
Belong to sportsman's scope with tennis player, therefore the probability of fencer Li Na wants high compared with the probability of singer Li Na.
In addition, if tennis player Li Na has master-apprentice relation with famous coach Jiang Shan in entity mobility models map,
When carrying out entity vector coding, vector distance between tennis player Li Na and the corresponding entity vector of famous coach Jiang Shan compared with
It is small, and the vector distance between singer Li Na entity vector corresponding with famous coach Jiang Shan is larger.
Above-mentioned file classification method can be applied to progress public sentiment monitoring in social media, be subject to the emotional status of audient
Classification;It is also applied in search engine application software, content is searched for user and is classified, the search need of user are met
It asks;It is also applied in consultation information push application software, classifies to the topic or key sentence of consultation information entry,
And in advertising information class application software, to advertising information close prison sentence classify, so as to carry out information it is accurate dispensing with
Push.
Embodiment five
Fig. 5 is the structure chart of one of the embodiment of the present application five document sorting apparatus, and the embodiment of the present application is suitable for pair
Text to be sorted carries out Classification and Identification, and the case where with the generic of determination text to be sorted, which passes through software and/or hard
Part is realized, and concrete configuration is in the electronic equipment for having certain data operation ability.
A kind of document sorting apparatus 500 as shown in Figure 5, comprising: text to be sorted obtains module 501, term vector sequence
Determining module 502, entity sequence vector determining module 503 and Classification and Identification module 504.
Text to be sorted obtains module 501, for obtaining text to be sorted;
Term vector sequence determining module 502, for by the word sequence input word vector coding model of text to be sorted, with true
The term vector sequence of the fixed word sequence;
Entity sequence vector determining module 503, for by the entity sequence inputting entity vector mould of the text to be sorted
Type, with the corresponding entity sequence vector of the determination entity sequence, wherein the entity vector model is based on entity vector mould
Type determines entity vector, the entity vector model be the text based on entity mobility models spectrum data library as sample training and
At;
Classification and Identification module 504 is used for according to the term vector sequence and entity sequence vector, to the text to be sorted
Carry out Classification and Identification.
The embodiment of the present application obtains module by text to be sorted and obtains text to be sorted, and is determined by term vector sequence
Module is by the word sequence input word vector coding model of model text to be sorted, to determine the term vector sequence of word sequence;Pass through
Entity sequence vector determining module is by the entity sequence inputting of text to be sorted to the text training based on entity spectrum data library
Made of entity vector model, to determine the corresponding entity sequence vector of entity sequence;By Classification and Identification module according to word to
Sequence and entity sequence vector are measured, classifying text is treated and carries out Classification and Identification.Above-mentioned technical proposal passes through term vector encoding model
With the use of entity vector model, term vector sequence corresponding with text to be sorted and entity sequence vector are determined respectively, is avoided
The building of Feature Engineering and training sample is constituted, and reduces the building difficulty of textual classification model;By in term vector sequence
Under two kinds of different dimensions of entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves textual classification model
Semantic susceptibility, and then improve the accuracy of the classification results of text to be sorted.
Further, which further includes that Automobile driving module specifically includes:
Term vector pays attention to weight determining unit, for according to the term vector sequence and entity sequence vector, to described
Before text to be sorted carries out Classification and Identification, by the term vector sequence inputting term vector attention mechanism model, with each word of determination
The attention weight of vector;
Entity vector pays attention to weight determining unit, for the entity sequence vector to be inputted entity vector attention mechanism mould
Type, with the attention weight of each entity vector of determination;
Correspondingly, Classification and Identification module 504, comprising:
Classification and Identification unit is used for according to the term vector sequence, entity sequence vector and respective attention weight, right
The text to be sorted carries out Classification and Identification.
Further, Classification and Identification unit is specifically used for:
By the term vector respectively multiplied by corresponding attention weight, the entity vector is weighed multiplied by corresponding attention respectively
Weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence
Column;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
Further, which further includes that entity vector model training module specifically includes:
Training sample construction unit, for based on the entity description text in entity mobility models spectrum data library as entity
Training sample;
Entity vector model training unit instructs entity vector model for the training sample using the entity
Practice.
Further, entity vector model training unit specifically includes:
Level-one vector determination unit, for the context training sample according to each entity, to the level-one of each entity
Vector model is trained, with the level-one vector of each entity of determination;
Entity relationship group determination unit, for entity relationship group determining from entity mobility models spectrum data library, and/
Or, determining entity relationship group according to the entity co-occurrence situation in urtext;Wherein, the entity relationship group includes at least two
Relationship between entity and entity;
Second training updating unit, for according to the entity level-one vector and the entity relationship group, it is true respectively
The entity relationship training sample of fixed each entity is inputted in the corresponding second-level model of each entity and is trained, to update
The level-one vector of each entity obtains final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
Further, the first-level model includes NN model and similarity function, and the second-level model is skip-gram mould
Type.
Further, training sample construction unit specifically includes:
Original statement obtains subelement, for obtaining original statement;
Entity recognition subelement identifies at least one entity in the original statement for being based on entity mobility models map;
Positive example sample determines subelement, for obtaining the original statement for carrying out entity positive example mark, as positive example training sample
This, wherein the Entities Matching in entity and the entity mobility models map in positive example training sample;
Negative data determines subelement, for determining counter-example training sample according to positive example entity, wherein counter-example training sample
In entity and the entity mobility models map in entity mismatch;
Positive example sample adds subelement, for obtaining entity description text of the positive example entity in entity mobility models spectrum data library
This, is added to positive example training sample, as the context training sample.
Further, negative data determines subelement, is specifically used for:
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example
Entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample, is made
For the context training sample.
Further, the term vector encoding model is word2vec model or Glove model, is carried out using samples of text
Unsupervised training forms.
Further, text to be sorted obtains module 501, is specifically used for:
It includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, extracts Key Words from the user comment text
Sentence, as text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software in, obtain consultation information entry topic or key sentence, as it is described to
Classifying text;
Advertising information key sentence is obtained, as the text to be sorted.
File classification method provided by the application any embodiment can be performed in above-mentioned document sorting apparatus, has and executes text
The corresponding functional module of this classification method and beneficial effect.
Embodiment six
According to an embodiment of the present application, present invention also provides a kind of electronic equipment and a kind of readable storage medium storing program for executing.
As shown in fig. 6, being the block diagram according to the electronic equipment of the execution file classification method of the embodiment of the present application.Electronics is set
Standby to be intended to indicate that various forms of digital computers, such as, laptop computer, desktop computer, workbench, individual digital help
Reason, server, blade server, mainframe computer and other suitable computer.Electronic equipment also may indicate that various shapes
The mobile device of formula, such as, personal digital assistant, cellular phone, smart phone, wearable device and other similar calculating dresses
It sets.Component, their connection and relationship shown in this article and their function are merely exemplary, and are not intended to limit
The realization of described herein and/or requirement the application.
As shown in fig. 6, the electronic equipment includes: one or more processors 601, memory 602, and each for connecting
The interface of component, including high-speed interface and low-speed interface.All parts are interconnected using different buses, and can be pacified
It installs in other ways on public mainboard or as needed.Processor can to the instruction executed in electronic equipment into
Row processing, including storage in memory or on memory (such as, to be coupled to interface in external input/output device
Display equipment) on show GUI graphical information instruction.In other embodiments, if desired, can be by multiple processors
And/or multiple bus is used together with multiple memories with multiple memories.It is also possible to multiple electronic equipments are connected, it is each
Equipment provides the necessary operation in part (for example, as server array, one group of blade server or multiprocessor system
System).In Fig. 6 by taking a processor 601 as an example.
Memory 602 is non-transitory computer-readable storage medium provided herein.Wherein, the memory is deposited
The instruction that can be executed by least one processor is contained, so that at least one described processor executes text provided herein
Classification method.The non-transitory computer-readable storage medium of the application stores computer instruction, and the computer instruction is based on making
Calculation machine executes file classification method provided herein.
Memory 602 is used as a kind of non-transitory computer-readable storage medium, can be used for storing non-instantaneous software program, non-
Instantaneous computer executable program and module, such as the corresponding program instruction/mould of the file classification method in the embodiment of the present application
Block is (for example, attached shown in fig. 5 including text to be sorted acquisition module 501, term vector sequence determining module 502, entity vector sequence
The document sorting apparatus 500 of column determining module 503 and Classification and Identification module 504).Processor 601 is stored in storage by operation
Non-instantaneous software program, instruction and module in device 602, at the various function application and data of server
Reason, i.e. file classification method in realization above method embodiment.
Memory 602 may include storing program area and storage data area, wherein storing program area can store operation system
Application program required for system, at least one function;Storage data area can be stored to be set according to the electronics for executing file classification method
Standby uses created data etc..In addition, memory 602 may include high-speed random access memory, it can also include non-
Volatile storage, for example, at least a disk memory, flush memory device or other non-instantaneous solid-state memories.Some
In embodiment, it includes the memory remotely located relative to processor 601 that memory 602 is optional, these remote memories can be with
By being connected to the network to the electronic equipment for executing file classification method.The example of above-mentioned network includes but is not limited to internet, enterprise
Industry intranet, local area network, mobile radio communication and combinations thereof.
The electronic equipment for executing file classification method can also include: input unit 603 and output device 604.Processor
601, memory 602, input unit 603 and output device 604 can be connected by bus or other modes, with logical in Fig. 6
It crosses for bus connection.
Input unit 603 can receive the number or character information of input, and generate the electricity with execution file classification method
The related key signals input of the user setting and function control of sub- equipment, such as touch screen, keypad, mouse, track pad, touching
The input units such as template, indicating arm, one or more mouse button, trace ball, control stick.Output device 604 may include
Show equipment, auxiliary lighting apparatus (for example, LED) and haptic feedback devices (for example, vibrating motor) etc..The display equipment can be with
Including but not limited to, liquid crystal display (LCD), light emitting diode (LED) display and plasma scope.In some implementations
In mode, display equipment can be touch screen.
The various embodiments of system and technology described herein can be in digital electronic circuitry, integrated circuit system
It is realized in system, dedicated ASIC (specific integrated circuit), computer hardware, firmware, software, and/or their combination.These are various
Embodiment may include: to implement in one or more computer program, which can be
It executes and/or explains in programmable system containing at least one programmable processor, which can be dedicated
Or general purpose programmable processors, number can be received from storage system, at least one input unit and at least one output device
According to and instruction, and data and instruction is transmitted to the storage system, at least one input unit and this at least one output
Device.
These calculation procedures (also referred to as program, software, software application or code) include the machine of programmable processor
Instruction, and can use programming language, and/or the compilation/machine language of level process and/or object-oriented to implement these
Calculation procedure.As used herein, term " machine readable media " and " computer-readable medium " are referred to for referring to machine
It enables and/or data is supplied to any computer program product, equipment, and/or the device of programmable processor (for example, disk, light
Disk, memory, programmable logic device (PLD)), including, receive the machine readable of the machine instruction as machine-readable signal
Medium.Term " machine-readable signal " is referred to for machine instruction and/or data to be supplied to any of programmable processor
Signal.
In order to provide the interaction with user, system and technology described herein, the computer can be implemented on computers
The display device for showing information to user is included (for example, CRT (cathode-ray tube) or LCD (liquid crystal display) monitoring
Device);And keyboard and indicator device (for example, mouse or trace ball), user can by the keyboard and the indicator device come
Provide input to computer.The device of other types can be also used for providing the interaction with user;For example, being supplied to user's
Feedback may be any type of sensory feedback (for example, visual feedback, audio feedback or touch feedback);And it can use
Any form (including vocal input, voice input or tactile input) receives input from the user.
System described herein and technology can be implemented including the computing system of background component (for example, as data
Server) or the computing system (for example, application server) including middleware component or the calculating including front end component
System is (for example, the subscriber computer with graphic user interface or web browser, user can pass through graphical user circle
Face or the web browser to interact with the embodiment of system described herein and technology) or including this backstage portion
In any combination of computing system of part, middleware component or front end component.Any form or the number of medium can be passed through
Digital data communicates (for example, communication network) and is connected with each other the component of system.The example of communication network includes: local area network
(LAN), wide area network (WAN) and internet.
Computer system may include client and server.Client and server is generally off-site from each other and usually logical
Communication network is crossed to interact.By being run on corresponding computer and each other with the meter of client-server relation
Calculation machine program generates the relationship of client and server.
According to the technical solution of the embodiment of the present application, by obtaining text to be sorted, and by the word of model text to be sorted
Sequence inputting term vector encoding model, to determine the term vector sequence of word sequence;Extremely by the entity sequence inputting of text to be sorted
Entity vector model made of text training based on entity spectrum data library, to determine the corresponding entity vector sequence of entity sequence
Column;According to term vector sequence and entity sequence vector, treats classifying text and carry out Classification and Identification.Above-mentioned technical proposal by word to
The use of encoding model and entity vector model is measured, determines term vector sequence corresponding with text to be sorted and entity vector respectively
Sequence, the building for avoiding Feature Engineering and training sample are constituted, and reduce the building difficulty of textual classification model;By in word
Under two kinds of different dimensions of sequence vector and entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves text
The semantic susceptibility of disaggregated model, and then improve the accuracy of the classification results of text to be sorted.
It should be understood that various forms of processes illustrated above can be used, rearrangement increases or deletes step.Example
Such as, each step recorded in the application of this hair can be performed in parallel or be sequentially performed the order that can also be different and execute,
As long as it is desired as a result, being not limited herein to can be realized technical solution disclosed in the present application.
Above-mentioned specific embodiment does not constitute the limitation to the application protection scope.Those skilled in the art should be bright
White, according to design requirement and other factors, various modifications can be carried out, combination, sub-portfolio and substitution.It is any in the application
Spirit and principle within made modifications, equivalent substitutions and improvements etc., should be included within the application protection scope.
Claims (13)
1. a kind of file classification method characterized by comprising
Obtain text to be sorted;
By the word sequence input word vector coding model of text to be sorted, with the term vector sequence of the determination word sequence;
By the entity sequence inputting entity vector model of the text to be sorted, with the corresponding entity of the determination entity sequence to
Measure sequence, wherein the entity vector model is that entity vector is determined based on entity vector coding model, the entity vector
Encoding model is the text based on entity mobility models spectrum data library as made of sample training;
According to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
2. the method according to claim 1, wherein according to the term vector sequence and entity sequence vector,
Before the text progress Classification and Identification to be sorted, further includes:
By the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination;
The entity sequence vector is inputted into entity vector attention mechanism model, with the attention weight of each entity vector of determination;
Correspondingly, carrying out Classification and Identification to the text to be sorted includes: according to the term vector sequence and entity sequence vector
According to the term vector sequence, entity sequence vector and respective attention weight, the text to be sorted is divided
Class identification.
3. according to the method described in claim 2, it is characterized in that, according to the term vector sequence, entity sequence vector and
Respective attention weight, carrying out Classification and Identification to the text to be sorted includes:
By the term vector respectively multiplied by corresponding attention weight, by the entity vector respectively multiplied by corresponding attention weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
4. the method according to claim 1, wherein the training process of the entity vector coding model includes:
Training sample based on the entity description text in entity mobility models spectrum data library as entity;
Using the training sample of the entity, entity vector coding model is trained.
5. according to the method described in claim 4, it is characterized in that, using the entity training sample, to entity vector compile
Code model, which is trained, includes:
According to the context training sample of each entity, the first-level model of each entity is trained, with each institute of determination
State the level-one vector of entity;
Entity relationship group is determined from entity mobility models spectrum data library, and/or, according to the entity co-occurrence feelings in urtext
Condition determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship training sample of each entity is determined respectively
This, inputs in the corresponding second-level model of each entity and is trained, to update the level-one vector of each entity, obtain
Final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
6. according to the method described in claim 5, it is characterized in that, the first-level model includes NN model and similarity function,
The second-level model is skip-gram model.
7. according to any method of claim 4-6, which is characterized in that based on the entity in entity mobility models spectrum data library
Text, which is described, as the training sample of entity includes:
Obtain original statement;
Based on entity mobility models map, at least one entity in the original statement is identified;
The original statement for carrying out entity positive example mark is obtained, as positive example training sample, wherein the entity in positive example training sample
With the Entities Matching in the entity mobility models map;
Counter-example training sample is determined according to positive example entity, wherein entity and the entity mobility models map in counter-example training sample
In entity mismatch;
Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training sample is added to, as
The context training sample.
8. the method according to the description of claim 7 is characterized in that determining counter-example training sample packet according to positive example training sample
It includes:
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample, also conduct
The context training sample.
9. according to the method described in claim 1, it is characterized by:
The term vector encoding model is word2vec model or Glove model, using the unsupervised training of samples of text progress
At.
10. the method according to claim 1, wherein it includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, key sentence is extracted from the user comment text, made
For text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software, the topic or key sentence of consultation information entry are obtained, as described to be sorted
Text;
Advertising information key sentence is obtained, as the text to be sorted.
11. a kind of document sorting apparatus characterized by comprising
Text to be sorted obtains module, for obtaining text to be sorted;
Term vector sequence determining module, for by the word sequence input word vector coding model of text to be sorted, described in determination
The term vector sequence of word sequence;
Entity sequence vector determining module, for by the entity sequence inputting entity vector model of the text to be sorted, with true
Determine the corresponding entity sequence vector of the entity sequence, wherein the entity vector model is based on entity vector coding model
To determine that entity vector, the entity vector coding model are the texts based on entity mobility models spectrum data library as sample training
Made of;
Classification and Identification module, for dividing the text to be sorted according to the term vector sequence and entity sequence vector
Class identification.
12. a kind of electronic equipment characterized by comprising
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one
It manages device to execute, so that at least one described processor is able to carry out a kind of text classification of any of claims 1-10
Method.
13. a kind of non-transitory computer-readable storage medium for being stored with computer instruction, which is characterized in that the computer refers to
It enables for making the computer perform claim require a kind of file classification method described in any one of 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910816831.9A CN110516073A (en) | 2019-08-30 | 2019-08-30 | A kind of file classification method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910816831.9A CN110516073A (en) | 2019-08-30 | 2019-08-30 | A kind of file classification method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110516073A true CN110516073A (en) | 2019-11-29 |
Family
ID=68629603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910816831.9A Pending CN110516073A (en) | 2019-08-30 | 2019-08-30 | A kind of file classification method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110516073A (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111128391A (en) * | 2019-12-24 | 2020-05-08 | 北京推想科技有限公司 | Information processing apparatus, method and storage medium |
CN111145914A (en) * | 2019-12-30 | 2020-05-12 | 四川大学华西医院 | Method and device for determining lung cancer clinical disease library text entity |
CN111241234A (en) * | 2019-12-27 | 2020-06-05 | 北京百度网讯科技有限公司 | Text classification method and device |
CN111274815A (en) * | 2020-01-15 | 2020-06-12 | 北京百度网讯科技有限公司 | Method and device for mining entity attention points in text |
CN111401066A (en) * | 2020-03-12 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based word classification model training method, word processing method and device |
CN111459959A (en) * | 2020-03-31 | 2020-07-28 | 北京百度网讯科技有限公司 | Method and apparatus for updating event set |
CN111506702A (en) * | 2020-03-25 | 2020-08-07 | 北京万里红科技股份有限公司 | Knowledge distillation-based language model training method, text classification method and device |
CN111666373A (en) * | 2020-05-07 | 2020-09-15 | 华东师范大学 | Chinese news classification method based on Transformer |
CN111797194A (en) * | 2020-05-20 | 2020-10-20 | 北京三快在线科技有限公司 | Text risk detection method and device, electronic equipment and storage medium |
CN111966836A (en) * | 2020-08-29 | 2020-11-20 | 深圳呗佬智能有限公司 | Knowledge graph vector representation method and device, computer equipment and storage medium |
CN112016601A (en) * | 2020-08-17 | 2020-12-01 | 华东师范大学 | Network model construction method based on knowledge graph enhanced small sample visual classification |
CN112182230A (en) * | 2020-11-27 | 2021-01-05 | 北京健康有益科技有限公司 | Text data classification method and device based on deep learning |
CN112182249A (en) * | 2020-10-23 | 2021-01-05 | 四川大学 | Automatic classification method and device for aviation safety report |
CN112307752A (en) * | 2020-10-30 | 2021-02-02 | 平安科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN112328653A (en) * | 2020-10-30 | 2021-02-05 | 北京百度网讯科技有限公司 | Data identification method and device, electronic equipment and storage medium |
CN112632971A (en) * | 2020-12-18 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN112800214A (en) * | 2021-01-29 | 2021-05-14 | 西安交通大学 | Theme co-occurrence network and external knowledge based theme identification method, system and equipment |
CN113011187A (en) * | 2021-03-12 | 2021-06-22 | 平安科技(深圳)有限公司 | Named entity processing method, system and equipment |
CN113010669A (en) * | 2020-12-24 | 2021-06-22 | 华戎信息产业有限公司 | News classification method and system |
CN113643241A (en) * | 2021-07-15 | 2021-11-12 | 北京迈格威科技有限公司 | Interaction relation detection method, interaction relation detection model training method and device |
CN113762998A (en) * | 2020-07-31 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Category analysis method, device, equipment and storage medium |
CN113963357A (en) * | 2021-12-16 | 2022-01-21 | 北京大学 | Knowledge graph-based sensitive text detection method and system |
CN114266255A (en) * | 2022-03-01 | 2022-04-01 | 深圳壹账通科技服务有限公司 | Corpus classification method, apparatus, device and storage medium based on clustering model |
CN114579740A (en) * | 2022-01-20 | 2022-06-03 | 马上消费金融股份有限公司 | Text classification method and device, electronic equipment and storage medium |
CN116975297A (en) * | 2023-09-22 | 2023-10-31 | 北京利久医药科技有限公司 | Method for evaluating clinical trial risk |
CN117493568A (en) * | 2023-11-09 | 2024-02-02 | 中安启成科技有限公司 | End-to-end software function point extraction and identification method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3144825A1 (en) * | 2015-09-16 | 2017-03-22 | Valossa Labs Oy | Enhanced digital media indexing and retrieval |
CN108280061A (en) * | 2018-01-17 | 2018-07-13 | 北京百度网讯科技有限公司 | Text handling method based on ambiguity entity word and device |
CN108595708A (en) * | 2018-05-10 | 2018-09-28 | 北京航空航天大学 | A kind of exception information file classification method of knowledge based collection of illustrative plates |
CN108733792A (en) * | 2018-05-14 | 2018-11-02 | 北京大学深圳研究生院 | A kind of entity relation extraction method |
CN108959482A (en) * | 2018-06-21 | 2018-12-07 | 北京慧闻科技发展有限公司 | Single-wheel dialogue data classification method, device and electronic equipment based on deep learning |
CN108984745A (en) * | 2018-07-16 | 2018-12-11 | 福州大学 | A kind of neural network file classification method merging more knowledge mappings |
CN109597997A (en) * | 2018-12-07 | 2019-04-09 | 上海宏原信息科技有限公司 | Based on comment entity, aspect grade sensibility classification method and device and its model training |
US20190138653A1 (en) * | 2017-11-03 | 2019-05-09 | Salesforce.Com, Inc. | Calculating relationship strength using an activity-based distributed graph |
CN109902171A (en) * | 2019-01-30 | 2019-06-18 | 中国地质大学(武汉) | Text Relation extraction method and system based on layering knowledge mapping attention model |
-
2019
- 2019-08-30 CN CN201910816831.9A patent/CN110516073A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3144825A1 (en) * | 2015-09-16 | 2017-03-22 | Valossa Labs Oy | Enhanced digital media indexing and retrieval |
US20190138653A1 (en) * | 2017-11-03 | 2019-05-09 | Salesforce.Com, Inc. | Calculating relationship strength using an activity-based distributed graph |
CN108280061A (en) * | 2018-01-17 | 2018-07-13 | 北京百度网讯科技有限公司 | Text handling method based on ambiguity entity word and device |
CN108595708A (en) * | 2018-05-10 | 2018-09-28 | 北京航空航天大学 | A kind of exception information file classification method of knowledge based collection of illustrative plates |
CN108733792A (en) * | 2018-05-14 | 2018-11-02 | 北京大学深圳研究生院 | A kind of entity relation extraction method |
CN108959482A (en) * | 2018-06-21 | 2018-12-07 | 北京慧闻科技发展有限公司 | Single-wheel dialogue data classification method, device and electronic equipment based on deep learning |
CN108984745A (en) * | 2018-07-16 | 2018-12-11 | 福州大学 | A kind of neural network file classification method merging more knowledge mappings |
CN109597997A (en) * | 2018-12-07 | 2019-04-09 | 上海宏原信息科技有限公司 | Based on comment entity, aspect grade sensibility classification method and device and its model training |
CN109902171A (en) * | 2019-01-30 | 2019-06-18 | 中国地质大学(武汉) | Text Relation extraction method and system based on layering knowledge mapping attention model |
Cited By (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111128391B (en) * | 2019-12-24 | 2021-01-12 | 推想医疗科技股份有限公司 | Information processing apparatus, method and storage medium |
CN111128391A (en) * | 2019-12-24 | 2020-05-08 | 北京推想科技有限公司 | Information processing apparatus, method and storage medium |
CN111241234A (en) * | 2019-12-27 | 2020-06-05 | 北京百度网讯科技有限公司 | Text classification method and device |
CN111241234B (en) * | 2019-12-27 | 2023-07-18 | 北京百度网讯科技有限公司 | Text classification method and device |
CN111145914A (en) * | 2019-12-30 | 2020-05-12 | 四川大学华西医院 | Method and device for determining lung cancer clinical disease library text entity |
CN111145914B (en) * | 2019-12-30 | 2023-08-04 | 四川大学华西医院 | Method and device for determining text entity of lung cancer clinical disease seed bank |
CN111274815B (en) * | 2020-01-15 | 2024-04-12 | 北京百度网讯科技有限公司 | Method and device for mining entity focus point in text |
CN111274815A (en) * | 2020-01-15 | 2020-06-12 | 北京百度网讯科技有限公司 | Method and device for mining entity attention points in text |
US11775761B2 (en) | 2020-01-15 | 2023-10-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for mining entity focus in text |
CN111401066A (en) * | 2020-03-12 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based word classification model training method, word processing method and device |
CN111506702A (en) * | 2020-03-25 | 2020-08-07 | 北京万里红科技股份有限公司 | Knowledge distillation-based language model training method, text classification method and device |
CN111459959A (en) * | 2020-03-31 | 2020-07-28 | 北京百度网讯科技有限公司 | Method and apparatus for updating event set |
CN111666373A (en) * | 2020-05-07 | 2020-09-15 | 华东师范大学 | Chinese news classification method based on Transformer |
CN111797194A (en) * | 2020-05-20 | 2020-10-20 | 北京三快在线科技有限公司 | Text risk detection method and device, electronic equipment and storage medium |
CN111797194B (en) * | 2020-05-20 | 2024-04-02 | 北京三快在线科技有限公司 | Text risk detection method and device, electronic equipment and storage medium |
CN113762998A (en) * | 2020-07-31 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Category analysis method, device, equipment and storage medium |
CN112016601A (en) * | 2020-08-17 | 2020-12-01 | 华东师范大学 | Network model construction method based on knowledge graph enhanced small sample visual classification |
CN112016601B (en) * | 2020-08-17 | 2022-08-05 | 华东师范大学 | Network model construction method based on knowledge graph enhanced small sample visual classification |
CN111966836A (en) * | 2020-08-29 | 2020-11-20 | 深圳呗佬智能有限公司 | Knowledge graph vector representation method and device, computer equipment and storage medium |
CN112182249A (en) * | 2020-10-23 | 2021-01-05 | 四川大学 | Automatic classification method and device for aviation safety report |
CN112328653B (en) * | 2020-10-30 | 2023-07-28 | 北京百度网讯科技有限公司 | Data identification method, device, electronic equipment and storage medium |
CN112328653A (en) * | 2020-10-30 | 2021-02-05 | 北京百度网讯科技有限公司 | Data identification method and device, electronic equipment and storage medium |
CN112307752A (en) * | 2020-10-30 | 2021-02-02 | 平安科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN112182230A (en) * | 2020-11-27 | 2021-01-05 | 北京健康有益科技有限公司 | Text data classification method and device based on deep learning |
CN112182230B (en) * | 2020-11-27 | 2021-03-16 | 北京健康有益科技有限公司 | Text data classification method and device based on deep learning |
CN112632971A (en) * | 2020-12-18 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN112632971B (en) * | 2020-12-18 | 2023-08-25 | 上海明略人工智能(集团)有限公司 | Word vector training method and system for entity matching |
CN113010669A (en) * | 2020-12-24 | 2021-06-22 | 华戎信息产业有限公司 | News classification method and system |
CN112800214B (en) * | 2021-01-29 | 2023-04-18 | 西安交通大学 | Theme co-occurrence network and external knowledge based theme identification method, system and equipment |
CN112800214A (en) * | 2021-01-29 | 2021-05-14 | 西安交通大学 | Theme co-occurrence network and external knowledge based theme identification method, system and equipment |
CN113011187A (en) * | 2021-03-12 | 2021-06-22 | 平安科技(深圳)有限公司 | Named entity processing method, system and equipment |
CN113643241A (en) * | 2021-07-15 | 2021-11-12 | 北京迈格威科技有限公司 | Interaction relation detection method, interaction relation detection model training method and device |
CN113963357A (en) * | 2021-12-16 | 2022-01-21 | 北京大学 | Knowledge graph-based sensitive text detection method and system |
CN113963357B (en) * | 2021-12-16 | 2022-03-11 | 北京大学 | Knowledge graph-based sensitive text detection method and system |
CN114579740A (en) * | 2022-01-20 | 2022-06-03 | 马上消费金融股份有限公司 | Text classification method and device, electronic equipment and storage medium |
CN114579740B (en) * | 2022-01-20 | 2023-12-05 | 马上消费金融股份有限公司 | Text classification method, device, electronic equipment and storage medium |
CN114266255B (en) * | 2022-03-01 | 2022-05-17 | 深圳壹账通科技服务有限公司 | Corpus classification method, apparatus, device and storage medium based on clustering model |
CN114266255A (en) * | 2022-03-01 | 2022-04-01 | 深圳壹账通科技服务有限公司 | Corpus classification method, apparatus, device and storage medium based on clustering model |
CN116975297A (en) * | 2023-09-22 | 2023-10-31 | 北京利久医药科技有限公司 | Method for evaluating clinical trial risk |
CN116975297B (en) * | 2023-09-22 | 2023-12-01 | 北京利久医药科技有限公司 | Method for evaluating clinical trial risk |
CN117493568A (en) * | 2023-11-09 | 2024-02-02 | 中安启成科技有限公司 | End-to-end software function point extraction and identification method |
CN117493568B (en) * | 2023-11-09 | 2024-04-19 | 中安启成科技有限公司 | End-to-end software function point extraction and identification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516073A (en) | A kind of file classification method, device, equipment and medium | |
CN109241524B (en) | Semantic analysis method and device, computer-readable storage medium and electronic equipment | |
CN107491531B (en) | Chinese network comment sensibility classification method based on integrated study frame | |
US9489625B2 (en) | Rapid development of virtual personal assistant applications | |
US9081411B2 (en) | Rapid development of virtual personal assistant applications | |
US20220004714A1 (en) | Event extraction method and apparatus, and storage medium | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN109960800A (en) | Weakly supervised file classification method and device based on Active Learning | |
CN111177569A (en) | Recommendation processing method, device and equipment based on artificial intelligence | |
CN110447042A (en) | It cooperative trains and/or using individual input and subsequent content neural network to carry out information retrieval | |
WO2018153215A1 (en) | Method for automatically generating sentence sample with similar semantics | |
WO2021129123A1 (en) | Corpus data processing method and apparatus, server, and storage medium | |
CN109582788A (en) | Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing | |
JP2021131858A (en) | Entity word recognition method and apparatus | |
CN107145514A (en) | Chinese sentence pattern sorting technique based on decision tree and SVM mixed models | |
CN110060674A (en) | Form management method, apparatus, terminal and storage medium | |
CN110517767A (en) | Aided diagnosis method, device, electronic equipment and storage medium | |
CN110795565A (en) | Semantic recognition-based alias mining method, device, medium and electronic equipment | |
CN109710760A (en) | Clustering method, device, medium and the electronic equipment of short text | |
CN112287656A (en) | Text comparison method, device, equipment and storage medium | |
Chen et al. | A review and roadmap of deep learning causal discovery in different variable paradigms | |
Wang et al. | Aspect-based sentiment analysis with graph convolutional networks over dependency awareness | |
Arafat et al. | Analyzing public emotion and predicting stock market using social media | |
Chang et al. | Multi-information preprocessing event extraction with BiLSTM-CRF attention for academic knowledge graph construction | |
CN116757195A (en) | Implicit emotion recognition method based on prompt learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191129 |
|
RJ01 | Rejection of invention patent application after publication |