CN110516073A - A kind of file classification method, device, equipment and medium - Google Patents

A kind of file classification method, device, equipment and medium Download PDF

Info

Publication number
CN110516073A
CN110516073A CN201910816831.9A CN201910816831A CN110516073A CN 110516073 A CN110516073 A CN 110516073A CN 201910816831 A CN201910816831 A CN 201910816831A CN 110516073 A CN110516073 A CN 110516073A
Authority
CN
China
Prior art keywords
entity
vector
sequence
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910816831.9A
Other languages
Chinese (zh)
Inventor
汪琦
冯知凡
张扬
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910816831.9A priority Critical patent/CN110516073A/en
Publication of CN110516073A publication Critical patent/CN110516073A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of file classification method, device, equipment and media, are related to natural language processing technique field.Specific implementation are as follows: obtain text to be sorted;The word sequence input word vector coding model of text to be sorted is determined to the term vector sequence of word sequence;By the entity sequence inputting entity vector model of text to be sorted to determine the corresponding entity sequence vector of entity sequence;Entity vector model determines that entity vector, entity vector coding model are formed based on the text training in entity mobility models spectrum data library based on entity vector coding model;Classifying text, which is treated, according to term vector sequence and entity sequence vector carries out Classification and Identification.The embodiment of the present application avoids the building of Feature Engineering and training sample, reduces the building difficulty of textual classification model;Text classification is carried out by the way that term vector sequence and entity sequence vector are comprehensive, improves the semantic susceptibility of textual classification model, and then improve the accuracy of the classification results of text to be sorted.

Description

A kind of file classification method, device, equipment and medium
Technical field
The invention relates to microcomputer data processing more particularly to natural language processing technique fields, specifically It is related to a kind of file classification method, device, equipment and medium.
Background technique
Text classification is that machine learning field is most basic and the most commonly used task of application scenarios, the target of text classification are It automatically is one or more predefined classifications by the document classification of textual form.
Converted based on term vector carry out text classification technology be it is current through frequently with technology, still, existing scheme has Extreme dependence characteristics engineering and training sample building process, need to spend larger human cost, also then to it is semantic not Enough sensitivities, it is difficult to meet the text classification application demand under complex scene.
Summary of the invention
The embodiment of the present application provides a kind of file classification method, device, equipment and medium, to reduce textual classification model Difficulty is constructed, and promotes the semantic susceptibility of textual classification model.
In a first aspect, the embodiment of the present application provides a kind of file classification method, comprising:
Obtain text to be sorted;
By the word sequence input word vector coding model of text to be sorted, with the term vector sequence of the determination word sequence;
By the entity sequence inputting entity vector model of the text to be sorted, with the corresponding reality of the determination entity sequence Body sequence vector, wherein the entity vector model is that entity vector is determined based on entity vector coding model, the entity Vector coding model is the text based on entity mobility models spectrum data library as made of sample training;
According to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
One embodiment in above-mentioned application has the following advantages that or the utility model has the advantages that reduces the building of textual classification model Difficulty, while improving the semantic susceptibility of textual classification model.The embodiment of the present application, and will be to by obtaining text to be sorted The word sequence input word vector coding model of disaggregated model text, to determine the term vector sequence of word sequence;By text to be sorted Entity sequence inputting to the text training based on entity spectrum data library and acquisition entity vector model, to determine entity sequence Arrange corresponding entity sequence vector;According to term vector sequence and entity sequence vector, treats classifying text and carry out Classification and Identification.On The use that technical solution passes through term vector encoding model and entity vector model is stated, determines word corresponding with text to be sorted respectively Sequence vector and entity sequence vector, the building for avoiding Feature Engineering and training sample are constituted, and reduce textual classification model Building difficulty;By the way that under two kinds of different dimensions of term vector sequence and entity sequence vector, synthesis carries out text to be sorted Classification and Identification improves the semantic susceptibility of textual classification model, so improve text to be sorted classification results it is accurate Degree.
Optionally, according to the term vector sequence and entity sequence vector, classification knowledge is carried out to the text to be sorted Before not, further includes:
By the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination;
The entity sequence vector is inputted into entity vector attention mechanism model, is weighed with the attention of each entity vector of determination Weight;
Correspondingly, carrying out Classification and Identification to the text to be sorted according to the term vector sequence and entity sequence vector Include:
According to the term vector sequence, entity sequence vector and respective attention weight, to the text to be sorted into Row Classification and Identification.
One embodiment in above-mentioned application is by introducing term vector attention Mechanism Model and entity vector attention machine Simulation carries out text to term vector sequence and entity sequence vector divided attention weight, and according to the attention weight distributed Classification and Identification, so that classification results efficient balance of the classifying text under term vector and entity vector dimension is treated, to text In important information highlighted, maximize the model contribution degree of different sequence vectors, further improve textual classification model Semantic susceptibility, improve the accuracy of text classification result.
Optionally, according to the term vector sequence, entity sequence vector and respective attention weight, to described to be sorted Text carries out Classification and Identification
By the term vector respectively multiplied by corresponding attention weight, the entity vector is weighed multiplied by corresponding attention respectively Weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence Column;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
One embodiment in above-mentioned application is by noticing that weight respectively carries out term vector sequence and entity sequence vector Weighting, the contribution degree of efficient balance term vector sequence and entity sequence vector, and pass through the term vector sequence and reality after weighting Body sequence vector realizes Fusion Features by splicing, carries out text classification based on the complete vector sequence after Fusion Features, perfect The classification mechanism of text classification identification, improve textual classification model semantic susceptibility and text classification result it is accurate Degree.
Optionally, the training process of the entity vector coding model includes:
Training sample based on the entity description text in entity mobility models spectrum data library as entity;
Using the training sample of the entity, entity vector coding model is trained.
One embodiment in above-mentioned application is by the entity description text in knowledge mapping database to entity vector mould Type is trained, so that introducing the description text for having incidence relation with entity during model training, increases training sample Range, and then improve the reasonability and validity when entity vector coding model encodes different entities, be sample The successful training of entity vector coding model in the case of this provides possibility.
Optionally, using the training sample of the entity, entity vector coding model is trained includes:
According to the context training sample of each entity, the level-one vector model of each entity is trained, with true The level-one vector of fixed each entity;
Entity relationship group is determined from entity mobility models spectrum data library, or, according to the entity co-occurrence in urtext Situation determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship instruction of each entity is determined respectively Practice sample, inputs in the corresponding second-level model of each entity and be trained, to update the level-one vector of each entity, Obtain final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
One embodiment in above-mentioned application passes through the entity relationship group or entity co-occurrence in entity mobility models spectrum data library Entity relationship group determined by situation carries out second training to entity vector coding model, so that drawing during model training The entity for entering separate sources, increases the range of training sample, so improve entity vector coding model to different entities into Reasonability and validity when row coding.
Optionally, the first-level model includes NN model and similarity function, and the second-level model is skip-gram mould Type.
Entity vector coding model is embodied as NN model and skip-gram mould by one embodiment in above-mentioned application Type is chosen by suitable model, avoids the generation of model over-fitting, thus for entity sequence vector encoding model Effectively training provides safeguard.
Optionally, include: as the training sample of entity based on the entity description text in entity mobility models spectrum data library
Obtain original statement;
Based on entity mobility models map, at least one entity in the original statement is identified;
The original statement for carrying out entity positive example mark is obtained, as positive example training sample, wherein in positive example training sample Entities Matching in entity and the entity mobility models map;
Counter-example training sample is determined according to positive example entity, wherein entity and the entity mobility models in counter-example training sample Entity in map mismatches;
Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training sample is added to, As the context training sample.
One embodiment in above-mentioned application is determined by the entity that entity mobility models map carries out original statement, and is based on institute Determining entity and entity description text generation positive example training sample increases the characteristic dimension in training sample, and based on just Example entity is determining and entity mobility models map determines counter-example training sample, and the perfect generting machanism of training sample, is entity vector Effective training of encoding model provides safeguard.
Optionally, determine that counter-example training sample includes: according to positive example training sample
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example Entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample, As the context training sample.
One embodiment in above-mentioned application carries out the determination of counter-example entity by positive example entity, and according to counter-example entity Entity description text generation counter-example training sample, the perfect generting machanism of counter-example training sample are entity vector coding model It is effective training provide safeguard.
Optionally, the term vector encoding model is word2vec model or Glove model, carries out nothing using samples of text Supervised training forms.
By word2vec model or Glove model, perfect term vector encodes mould for one embodiment in above-mentioned application The training mechanism of type, is chosen by suitable model, avoids the generation of over-fitting caused by the model complexity excessively of selection, To be provided safeguard for effective training of term vector encoding model.
Optionally, it includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, extracts Key Words from the user comment text Sentence, as text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software in, obtain consultation information entry topic or key sentence, as it is described to Classifying text;
Advertising information key sentence is obtained, as the text to be sorted.
One embodiment in above-mentioned application is by carrying out the acquisition of text to be sorted from different application software, by the application Involved in file classification method be adapted to different application scenarios, embody the universality of file classification method.
Second aspect, the embodiment of the present application also provides a kind of document sorting apparatus, comprising:
Text to be sorted obtains module, for obtaining text to be sorted;
Term vector sequence determining module, for by the word sequence input word vector coding model of text to be sorted, with determination The term vector sequence of the word sequence;
Entity sequence vector determining module, for by the entity sequence inputting entity vector model of the text to be sorted, With the corresponding entity sequence vector of the determination entity sequence, wherein the entity vector model is based on entity vector coding Model is come to determine entity vector, the entity vector coding model be the text based on entity mobility models spectrum data library as sample Made of training;
Classification and Identification module, for according to the term vector sequence and entity sequence vector, to the text to be sorted into Row Classification and Identification.
The third aspect, the embodiment of the present application also provides a kind of electronic equipment, comprising:
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that at least one described processor is able to carry out a kind of text classification provided by first aspect embodiment Method.
Fourth aspect is stored with the non-instantaneous of computer instruction and computer-readable deposits the embodiment of the present application also provides a kind of Storage media, the computer instruction is for making the computer execute a kind of text classification side provided by first aspect embodiment Method.
Other effects possessed by above-mentioned optional way are illustrated hereinafter in conjunction with specific embodiment.
Detailed description of the invention
Attached drawing does not constitute the restriction to the application for more fully understanding this programme.Wherein:
Fig. 1 is the flow chart of one of the embodiment of the present application one file classification method;
Fig. 2 is the flow chart of one of the embodiment of the present application two file classification method;
Fig. 3 is the flow chart of one of the embodiment of the present application three file classification method;
Fig. 4 A is the flow chart of one of the embodiment of the present application four file classification method;
Fig. 4 B is one of the embodiment of the present application four entity vector coding model framework schematic diagram;
Fig. 4 C is the overall architecture schematic diagram of one of the embodiment of the present application four textual classification model;
Fig. 4 D is the schematic diagram of one of the embodiment of the present application four text classification result;
Fig. 5 is the structure chart of one of the embodiment of the present application five document sorting apparatus;
Fig. 6 is the block diagram for the electronic equipment for realizing the file classification method of the embodiment of the present application.
Specific embodiment
It explains below in conjunction with exemplary embodiment of the attached drawing to the application, including the various of the embodiment of the present application Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from the scope and spirit of the present application.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Embodiment one
Fig. 1 is the flow chart of one of the embodiment of the present application one file classification method, and the embodiment of the present application is suitable for pair Text to be sorted carries out Classification and Identification, and the case where with the generic of determination text to be sorted, this method is by document sorting apparatus It executes, the device is by software and or hardware realization, and concrete configuration is in the electronic equipment for having certain data operation ability In.
A kind of file classification method as shown in Figure 1, comprising:
S101, text to be sorted is obtained.
Wherein, text to be sorted can be stored in advance in electronic equipment local, other storages associated with electronic equipment In equipment or cloud, and the acquisition of text to be sorted is carried out when needed;Or from the application software for generating text to be sorted Carry out real-time acquisition or the timing acquisition of text to be sorted.
Illustratively, obtaining text to be sorted includes but is not limited at least one of following manner: being answered from social media With in software, user comment text is obtained, key sentence is extracted from the user comment text, as text to be sorted;From In search engine application software, user's search statement is obtained, as the text to be sorted;Application software is pushed from consultation information In, the topic or key sentence of consultation information entry are obtained, as the text to be sorted;And obtain advertising information Key Words Sentence, as the text to be sorted.
It is understood that the application can be made by the acquisition for carrying out text to be sorted from different application software Related file classification method is adapted to different application scenarios, embodies the universality of file classification method.
Optionally, key sentence is extracted from user comment text, can be and word segmentation processing is carried out to user comment text, And count the frequency of occurrences of each word segmentation result;According to the frequency of occurrences of each word segmentation result, key sentence is extracted.Such as participle is tied Corresponding sentence is as key sentence when the frequency of occurrences of fruit is more than setpoint frequency threshold value, and/or by the appearance of word segmentation result The corresponding sentence of each word segmentation result of the highest setting quantity of frequency is as key sentence.Wherein, setpoint frequency threshold value or setting Quantity is as needed by technical staff or empirical value is set.
S102, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence Sequence.
Illustratively, classifying text can be treated and carry out word segmentation processing, combine each word segmentation result to obtain word sequence;By word Sequence inputting determines term vector according to the output result of term vector encoding model into preparatory trained term vector encoding model Sequence.
Wherein, each word in word sequence is mapped to corresponding term vector by term vector encoding model, and by each word Vector combines to obtain term vector sequence.Also, the vector distance between the corresponding term vector of word similar in identical or meaning compared with Small, the vector distance between the different or opposite meaning corresponding term vectors of word is larger.
Optionally, term vector encoding model can use word2vec model or Glove model, be carried out using samples of text Unsupervised training forms.Wherein, word2vec model can be continuous bag of words (Continuous Bag-of-words Model, CBOW) or Skip-gram model etc..
It is understood that passing through the Rational choice of term vector encoding model, the model complexity excessively for avoiding selection causes Over-fitting generation, thus for term vector encoding model it is effective training provide guarantee.
S103, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair The entity sequence vector answered, wherein the entity vector model is that entity vector is determined based on entity vector coding model, institute Stating entity vector coding model is the text based on entity mobility models spectrum data library as made of sample training.
Illustratively, classifying text can be treated and carry out word segmentation processing, and according to preset entity mobility models spectrum data library Included in each entity word, each word segmentation result is screened, at least one entity word is obtained;By each entity word combination shape At entity sequence, and by entity sequence inputting into the entity vector model being previously obtained, and entity vector model is based on pre- First trained entity vector coding model determines.Entity vector sequence is determined according to the output result of entity vector model Column, entity vector model can be entity vector coding model itself, but can be based on the determination of entity vector coding model As a result, the mapping table for example including each entity Yu entity vector, mapping relations are the entity vectors obtained by training Encoding model and determination.
Wherein, each entity word in entity sequence is mapped to corresponding entity vector by entity vector model, and will Each entity vector combines to obtain entity sequence vector.The distance between entity vector equally reflects the similar journey between entity word Degree.It is abundant with entity associated since entity vector model is based in entity word itself and entity mobility models spectrum data library Describe text training and formation, so the sample size of required mark is few, the semantic relevance of training sample is strong and abundant, right The identification susceptibility of semantic degree of approximation is high.
Optionally, entity vector coding model can using neural network (Neural Networks, NN) model and Skip-gram model is formed by built-up pattern.
It is understood that by the Rational choice to entity vector coding model, and the group of the different models to selection It closes and uses, guarantee is provided for effective training of entity vector coding model, to improve entity vector coding model indirectly Model accuracy.
S104, according to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
Optionally, it according to term vector sequence and entity sequence vector, treats classifying text and carries out Classification and Identification, may is that Directly term vector sequence and entity sequence vector are spliced, to realize Fusion Features, and the sequence vector obtained after fusion is defeated Enter into classifier, obtains the classification results of text to be sorted.That is, considering the term vector sequence of the text simultaneously in classifier Column and entity sequence vector.
The embodiment of the present application is compiled by obtaining text to be sorted, and by the word sequence of model text to be sorted input term vector Code model, to determine the term vector sequence of word sequence;By the entity sequence inputting of text to be sorted to being based on entity spectrum data Entity vector coding model made of the text training in library, to determine the corresponding entity sequence vector of entity sequence;According to word to Sequence and entity sequence vector are measured, classifying text is treated and carries out Classification and Identification.Above-mentioned technical proposal passes through term vector encoding model With the use of entity vector model, term vector sequence corresponding with text to be sorted and entity sequence vector are determined respectively, is avoided The building of Feature Engineering and training sample is constituted, and reduces the building difficulty of textual classification model;By in term vector sequence Under two kinds of different dimensions of entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves textual classification model Semantic susceptibility, and then improve the accuracy of the classification results of text to be sorted.
Embodiment two
Fig. 2 is the flow chart of one of the embodiment of the present application two file classification method, and the embodiment of the present application is above-mentioned each Improvement is optimized on the basis of the technical solution of embodiment.
Further, " according to the term vector sequence and entity sequence vector, classifying to the text to be sorted It is additional " by the term vector sequence inputting term vector attention mechanism model, to be weighed with the attention of each term vector of determination before identification " Weight;The entity sequence vector is inputted into entity vector attention mechanism model, with the attention weight of each entity vector of determination ";Phase It answers, operation " according to the term vector sequence and entity sequence vector, carrying out Classification and Identification to the text to be sorted " is thin It turns to and " according to the term vector sequence, entity sequence vector and respective attention weight, the text to be sorted is divided Class identification ", to realize to the efficient balance of the classification results under term vector and entity vector dimension, maximizes different sequence vectors Model contribution degree.
A kind of file classification method as shown in Figure 2, comprising:
S201, text to be sorted is obtained.
S202, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence Sequence.
S203, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair The entity sequence vector answered, wherein the entity vector model is that entity vector is determined based on entity vector coding model, institute Stating entity vector coding model is the text based on entity mobility models spectrum data library as made of sample training.
S204, by the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination.
S205, the entity sequence vector is inputted into entity vector attention mechanism model, with the note of each entity vector of determination Meaning weight.
S206, according to the term vector sequence, entity sequence vector and respective attention weight, to the text to be sorted This progress Classification and Identification.
Attention mechanism model, also referred to as attention mechanism (Attention mechanism) model, is applied to natural language data processing In principle be, in conjunction with each term vector in context identification term vector sequence to the percentage contribution of result is obtained, i.e., as pay attention to Weight.
In a kind of optional embodiment of the embodiment of the present application, according to term vector sequence, entity sequence vector and each From attention weight, treat classifying text carry out Classification and Identification, may is that by the term vector respectively multiplied by corresponding attentions power Weight, by the entity vector respectively multiplied by corresponding attention weight;It will be multiplied by the term vector sequence and entity vector for paying attention to weight Sequence carries out head and the tail splicing, forms complete vector sequence;By the complete vector sequence inputting classifier, will output result as The classification results of the text to be sorted.
Pay attention to weight to term vector sequence and entity vector accordingly it is understood that above-mentioned optinal plan passes through respectively Sequence carries out attention weighting, realizes feature-based fusion, so that when being classified using classifier, by under different dimensions Synergistic effect between term vector sequence and the vector of entity sequence vector, the two are complementary to one another, and are corrected each other, while in text Important information highlighted, to enhance the comprehensive and reliability of text classification result.
One embodiment in above-mentioned application is by introducing term vector attention Mechanism Model and entity vector attention machine Simulation carries out text to term vector sequence and entity sequence vector divided attention weight, and according to the attention weight distributed Classification and Identification, so that classification results efficient balance of the classifying text under term vector and entity vector dimension is treated, to text In important information highlighted, maximize the model contribution degree of different sequence vectors, further improve textual classification model Semantic susceptibility, improve the accuracy of text classification result.
Embodiment three
Fig. 3 is the flow chart of one of the embodiment of the present application three file classification method, and the embodiment of the present application is above-mentioned each Improvement is optimized on the basis of the technical solution of embodiment.
The operation of " carrying out model training to entity vector coding model " is described in detail, and by entity vector coding The training process of model is refined as " the training sample based on the entity description text in entity mobility models spectrum data library as entity This;Using the training sample of the entity, entity vector coding model is trained ", to improve entity vector coding model Model training mechanism.
A kind of file classification method as shown in Figure 3, comprising:
S301, the training sample based on the entity description text in entity mobility models spectrum data library as entity.
Wherein, training sample may include positive example training sample and counter-example training sample.Certain original statement is obtained first, and is known Entity word content therein is not determined.For the entity word content, the correct entity point in entity mobility models map, institute are corresponded to Associated various description corpus are just used as positive example training sample.Each reality in entity mobility models map other than correct entity point The description of body point is it is anticipated that all can serve as counter-example training sample.
By the addition of description this entity Adjunct content of text of entity point, so that being introduced during model training New characteristic information avoids the generation of model poor fitting phenomenon.
Illustratively, the training sample based on the entity description text in entity mobility models spectrum data library as entity, can To be: original statement is obtained, based at least one entity in entity mobility models spectrum recognition original statement;It obtains and is carrying out entity just The original statement of example mark, as positive example training sample.Wherein, the entity in positive example training sample and the entity mobility models map In Entities Matching;Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training is added to Sample.
Wherein, original statement can be web data or search daily record data etc., and original statement can be stored in advance in electricity Sub- equipment is local, in other storage equipment of electronic device association or cloud, and is obtained when needed;Certainly also The real-time crawl of original statement can be carried out when equipment generates the original statements such as web data or search daily record data.
It is understood that due to the introducing for describing text, so that the characteristic information that positive example training sample is included is more Comprehensively, thus when carrying out model training using the positive example training sample, the acquisition of description corpus can be enriched, and these are described Corpus can rapidly become training sample without artificial mark.Since description corpus multi-faceted can be described the entity word, So that Entity Semantics identification is more sensitive.
Illustratively, counter-example training sample can be determined according to positive example entity and be generated.Wherein, the reality in counter-example training sample Correct entity mismatch in body and entity mobility models map namely the counter-example entity in counter-example training sample are entity mobility models map In it is identical as the entity in original statement, but the different entity of text is described;And/or the counter-example entity in counter-example training sample is Other entities in entity mobility models map in addition to the correct entity in original statement.For example, original statement are as follows: Liu Dehua is sung Song be lustily water.So " Liu Dehua " is Liu Dehua (singer) entity point in entity mobility models map, is marked as positive example, This annotation process can be accomplished manually;If it is real that there is also Liu Dehua (professor), Zhou Jielun (singer) etc. in entity mobility models map Body point, then these entity points are exactly counter-example, after positive example entity point has been determined, other entity points are all counter-examples.
Specifically, determining counter-example training sample according to positive example training sample, can be according to positive example entity, from entity mobility models The identical or different different entities point of content is determined in map, as counter-example entity;Counter-example entity is obtained in entity mobility models map Entity description text in database, as counter-example training sample.Can according to the demand of counter-example training sample, stochastical sampling or Set rule sampling.Above-mentioned positive example training sample and counter-example training sample can be used as the context training sample of entity respectively.
For example, original statement is " coach of badminton player Zhang San is Li Si ", then, it gets the bid in entity mobility models map Positive example entity point " Zhang San " is outpoured, although Zhang San may bear the same name, mark of each entity point in map is unique.It is real Body point " Zhang San " can much describe text, such as resume, the news of Zhang San with corresponding record.Correspondingly, counter-example entity is other Also it is " Zhang San ", but is not the entity point of this Zhang San, or be also possible to any other entity point.
S302, using the training sample of the entity, entity vector coding model is trained.
Wherein, entity vector coding model is used to each entity word in input model being mapped as vector form, obtains pair The entity vector answered.
In an optional embodiment of the embodiment of the present application, entity vector coding model is trained can be by The positive example training sample and counter-example training sample that above-mentioned each optional embodiment obtains are trained entity vector coding model, To optimize each model parameter in entity vector coding model.Hidden layer weight parameter in entity vector coding model is just made For the vector of entity.A corresponding entity vector coding model can be respectively trained to each entity, thus determine respectively each The mapping relations of a entity and entity vector save as entity vector model in advance.
In another optional embodiment of the embodiment of the present application, in order to advanced optimize the output of model as a result, may be used also The training of two-stage is carried out to entity vector coding model based on context training sample and entity relationship training sample.That is, using The training sample of the entity, the operation being trained to entity vector coding model specifically include:
According to the context training sample of each entity, the level-one vector model of each entity is trained, with true The level-one vector of fixed each entity;
Entity relationship group is determined from entity mobility models spectrum data library, and/or, it is total according to the entity in urtext Existing situation determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship instruction of each entity is determined respectively Practice sample, inputs in the corresponding second-level model of each entity and be trained, to update the level-one vector of each entity, Obtain final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
The operation of above steps is illustrated respectively below.
For the training process of first-level model, specifically:
The first-level model includes NN (neural network) model and similarity function, and similarity function can be sigmod letter Number.For NN+sigmod model, for each entity one corresponding model of each self-training, specifically by the conduct of the entity The positive example training sample and counter-example training sample of context training sample input NN+sigmod model, complete supervised training.It trains NN+sigmod model in hidden layer weight parameter, be exactly the level-one vector of the entity.
Acquisition for entity relationship group, specific as follows:
For example, for original statement " coach of badminton player Zhang San is Li Si ", it can be in entity mobility models figure It is searched in spectrum and entity " Zhang San " has triplet information corresponding to the system of frontier juncture and be determined as entity relationship group, such as " Zhang San " is One of entity in triple relationship " entity 1- relation-entity 2 ", or " Zhang San " are triple relationship " entity-category Entity or attribute value in property-attribute value ".Or it is optional, " Li Si " and " Zhang San " occurs jointly in original statement, and " Li Si " is similarly a certain entity in entity mobility models map, then will include entity " Zhang San ", entity " Li Si " and " Three " with the triplet information of " Li Si " corresponding sides relationship, such as " Zhang San-master and apprentice-Li Si " and " Li Si-master and apprentice-Zhang San " etc., As entity relationship group.
For the training process of second-level model, specifically:
The second-level model can be skip-gram model.The entity relationship group of above-mentioned acquisition reflects between entity Context relation is illustrated for the entity relationship group as " Liu Dehua, Liang Chaowei, protagonist, Infernal Affairs ", including Three entities, are Liu Dehua, Liang Chaowei, Infernal Affairs respectively.Entity two-by-two in three entities is combined, entity is formed Relationship training sample, such as " Liu Dehua, Infernal Affairs ", the sample of two tuples such as " Liu Dehua, Liang Chaowei " and " Liang Chaowei, Infernal Affairs " This.For all samples of any entity, such as all samples of Liu Dehua, by the level-one of an entity in two tuples to Amount, as the input of skip-gram model, output of the level-one vector of another entity as skip-gram model, to lead to Other entities for having frontier juncture system with " Liu Dehua " are crossed to update the level-one vector of " Liu Dehua " this entity, formed second level to Amount, the entity vector as final " Liu Dehua ".
By above-mentioned two-stage training, positive example training sample and counter-example training sample be usually include entity word, it is corresponding The sentence of various contexts, so NN model is used to carry out context description to each entity in training sample, each entity is preliminary It is mapped as entity vector.Skip-gram model is used for according to entity-side-entity corresponding relationship specific in entity relationship group Further trained.It is desirable that making entity vector that can not only reflect context, moreover it is possible to reflection and other tight associations Relationship between entity.For example, having the entity of close frontier juncture system when the entity vector meeting distance occurred in identical context is smaller Distance also can be smaller between vector.
S303, text to be sorted is obtained.
S304, the word sequence input word vector coding model by text to be sorted, with the term vector of the determination word sequence Sequence.
S305, the entity sequence inputting entity vector model by the text to be sorted, with the determination entity sequence pair The entity sequence vector answered.
S306, according to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
One embodiment in above-mentioned application compiles entity vector by the entity description text in knowledge mapping database Code model is trained, so that introducing the description text for having incidence relation with entity during model training, increases training The range of sample, and then reasonability and validity when entity vector coding model encodes different entities are improved, it is The successful training of entity vector coding model under Small Sample Size provides possibility.
Example IV
Fig. 4 A is the flow chart of one of the embodiment of the present application four file classification method, and the embodiment of the present application is above-mentioned each A kind of preferred embodiment is provided on the basis of the technical solution of embodiment.Below with reference to entity vector mould shown in Fig. 4 B The overall architecture schematic diagram of textual classification model shown in type configuration diagram and Fig. 4 C is described in detail.
A kind of file classification method as shown in Figure 4 A, comprising:
S410, training sample preparation stage;
S420, model training stage;
S430, model service stage.
Wherein, the training sample preparation stage, comprising:
S411, original statement is obtained.
S412, according to the entity word in entity mobility models map, identify at least one entity word in original statement.
S413, it is trained using the original statement of positive example mark and the corresponding entity description text of positive example entity as positive example Sample.
S414, using the corresponding entity description text of counter-example entity in entity mobility models map as counter-example training sample.
S415, entity relationship group is determined from entity mobility models map;And/or the entity co-occurrence situation in original statement, really Determine entity relationship group.
If original statement is " in Australian Open Tennis open championship finals, Li Na obtains champion women's singles ".Wherein, positive example entity is " net Ball sportsman Li Na ".It further include " singer Li Na ", as counter-example entity in entity mobility models map.In entity mobility models map In, it further include triplet information " Li Na-master and apprentice-Jiang Shan " corresponding with " tennis player Li Na ", " Li Na-master and apprentice-Jiang Shan " As entity relationship group.It is similar also to obtain other entity relationship groups.
Wherein, model training stage, comprising:
S421, using positive example training sample and counter-example training sample as input sample, to the NN mould in entity vector model Type is trained.
S422, the model of each positive example training sample of NN model and counter-example training sample is exported as a result, i.e. entity vector, with The conversion of entity relationship group is used as input sample, is trained to the skip-gram model in entity vector model.
Wherein, model service stage, comprising:
S431, text to be sorted is obtained.
S432, it treats classifying text and is segmented, combine word segmentation result to obtain word sequence [w1, w2 ..., wn].
S433, according to the entity word in entity mobility models spectrum recognition text to be sorted, each entity word combination is obtained into entity Sequence [e1, e2 ..., en].
S434, word sequence is input in trained term vector encoding model, obtain term vector sequence [h1, h2 ..., hn]。
S435, by entity sequence inputting into trained entity vector model, obtain entity sequence vector [k1, k2,…,kn]。
S436, by the trained term vector attention Mechanism Model Uw of term vector sequence inputting, determine that term vector pays attention to Weight [a1, a2 ..., an].
S437, entity sequence vector is inputted in trained entity vector attention Mechanism Model KGw, determine entity to Amount attention weight [b1, b2 ..., bn].
S438, term vector being paid attention to, weight to each term vector weighted sum in term vector sequence, obtains to be sorted respectively Term vector sequence S1.
S439, entity vector being paid attention to, weight to each entity vector weighted sum in entity sequence vector, obtains respectively Entity sequence vector S2 to be sorted.
S4310, term vector sequence S1 to be sorted and entity sequence vector S2 to be sorted are subjected to splicing fusion, obtained complete Sequence vector.
S4311, by complete vector sequence inputting into classifier softmax, obtain the text categories of text to be sorted.
Wherein, classifier can be two disaggregated model classifiers, and it is more to can also be that multiple two disaggregated models combine Disaggregated model classifier.
It is text classification knot for text to be sorted with " in Australian Open Tennis open championship finals, Li Na obtains champion women's singles " Fruit D referring to fig. 4.
To " Li Na " in the text to be sorted, the probability for being determined as tennis player Li Na is 0.95, is determined as fencing The probability of sportsman Li Na is 0.6, and the probability for being determined as singer Li Na is 0.09.It is understood that due to fencer Belong to sportsman's scope with tennis player, therefore the probability of fencer Li Na wants high compared with the probability of singer Li Na.
In addition, if tennis player Li Na has master-apprentice relation with famous coach Jiang Shan in entity mobility models map, When carrying out entity vector coding, vector distance between tennis player Li Na and the corresponding entity vector of famous coach Jiang Shan compared with It is small, and the vector distance between singer Li Na entity vector corresponding with famous coach Jiang Shan is larger.
Above-mentioned file classification method can be applied to progress public sentiment monitoring in social media, be subject to the emotional status of audient Classification;It is also applied in search engine application software, content is searched for user and is classified, the search need of user are met It asks;It is also applied in consultation information push application software, classifies to the topic or key sentence of consultation information entry, And in advertising information class application software, to advertising information close prison sentence classify, so as to carry out information it is accurate dispensing with Push.
Embodiment five
Fig. 5 is the structure chart of one of the embodiment of the present application five document sorting apparatus, and the embodiment of the present application is suitable for pair Text to be sorted carries out Classification and Identification, and the case where with the generic of determination text to be sorted, which passes through software and/or hard Part is realized, and concrete configuration is in the electronic equipment for having certain data operation ability.
A kind of document sorting apparatus 500 as shown in Figure 5, comprising: text to be sorted obtains module 501, term vector sequence Determining module 502, entity sequence vector determining module 503 and Classification and Identification module 504.
Text to be sorted obtains module 501, for obtaining text to be sorted;
Term vector sequence determining module 502, for by the word sequence input word vector coding model of text to be sorted, with true The term vector sequence of the fixed word sequence;
Entity sequence vector determining module 503, for by the entity sequence inputting entity vector mould of the text to be sorted Type, with the corresponding entity sequence vector of the determination entity sequence, wherein the entity vector model is based on entity vector mould Type determines entity vector, the entity vector model be the text based on entity mobility models spectrum data library as sample training and At;
Classification and Identification module 504 is used for according to the term vector sequence and entity sequence vector, to the text to be sorted Carry out Classification and Identification.
The embodiment of the present application obtains module by text to be sorted and obtains text to be sorted, and is determined by term vector sequence Module is by the word sequence input word vector coding model of model text to be sorted, to determine the term vector sequence of word sequence;Pass through Entity sequence vector determining module is by the entity sequence inputting of text to be sorted to the text training based on entity spectrum data library Made of entity vector model, to determine the corresponding entity sequence vector of entity sequence;By Classification and Identification module according to word to Sequence and entity sequence vector are measured, classifying text is treated and carries out Classification and Identification.Above-mentioned technical proposal passes through term vector encoding model With the use of entity vector model, term vector sequence corresponding with text to be sorted and entity sequence vector are determined respectively, is avoided The building of Feature Engineering and training sample is constituted, and reduces the building difficulty of textual classification model;By in term vector sequence Under two kinds of different dimensions of entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves textual classification model Semantic susceptibility, and then improve the accuracy of the classification results of text to be sorted.
Further, which further includes that Automobile driving module specifically includes:
Term vector pays attention to weight determining unit, for according to the term vector sequence and entity sequence vector, to described Before text to be sorted carries out Classification and Identification, by the term vector sequence inputting term vector attention mechanism model, with each word of determination The attention weight of vector;
Entity vector pays attention to weight determining unit, for the entity sequence vector to be inputted entity vector attention mechanism mould Type, with the attention weight of each entity vector of determination;
Correspondingly, Classification and Identification module 504, comprising:
Classification and Identification unit is used for according to the term vector sequence, entity sequence vector and respective attention weight, right The text to be sorted carries out Classification and Identification.
Further, Classification and Identification unit is specifically used for:
By the term vector respectively multiplied by corresponding attention weight, the entity vector is weighed multiplied by corresponding attention respectively Weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence Column;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
Further, which further includes that entity vector model training module specifically includes:
Training sample construction unit, for based on the entity description text in entity mobility models spectrum data library as entity Training sample;
Entity vector model training unit instructs entity vector model for the training sample using the entity Practice.
Further, entity vector model training unit specifically includes:
Level-one vector determination unit, for the context training sample according to each entity, to the level-one of each entity Vector model is trained, with the level-one vector of each entity of determination;
Entity relationship group determination unit, for entity relationship group determining from entity mobility models spectrum data library, and/ Or, determining entity relationship group according to the entity co-occurrence situation in urtext;Wherein, the entity relationship group includes at least two Relationship between entity and entity;
Second training updating unit, for according to the entity level-one vector and the entity relationship group, it is true respectively The entity relationship training sample of fixed each entity is inputted in the corresponding second-level model of each entity and is trained, to update The level-one vector of each entity obtains final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
Further, the first-level model includes NN model and similarity function, and the second-level model is skip-gram mould Type.
Further, training sample construction unit specifically includes:
Original statement obtains subelement, for obtaining original statement;
Entity recognition subelement identifies at least one entity in the original statement for being based on entity mobility models map;
Positive example sample determines subelement, for obtaining the original statement for carrying out entity positive example mark, as positive example training sample This, wherein the Entities Matching in entity and the entity mobility models map in positive example training sample;
Negative data determines subelement, for determining counter-example training sample according to positive example entity, wherein counter-example training sample In entity and the entity mobility models map in entity mismatch;
Positive example sample adds subelement, for obtaining entity description text of the positive example entity in entity mobility models spectrum data library This, is added to positive example training sample, as the context training sample.
Further, negative data determines subelement, is specifically used for:
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example Entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample, is made For the context training sample.
Further, the term vector encoding model is word2vec model or Glove model, is carried out using samples of text Unsupervised training forms.
Further, text to be sorted obtains module 501, is specifically used for:
It includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, extracts Key Words from the user comment text Sentence, as text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software in, obtain consultation information entry topic or key sentence, as it is described to Classifying text;
Advertising information key sentence is obtained, as the text to be sorted.
File classification method provided by the application any embodiment can be performed in above-mentioned document sorting apparatus, has and executes text The corresponding functional module of this classification method and beneficial effect.
Embodiment six
According to an embodiment of the present application, present invention also provides a kind of electronic equipment and a kind of readable storage medium storing program for executing.
As shown in fig. 6, being the block diagram according to the electronic equipment of the execution file classification method of the embodiment of the present application.Electronics is set Standby to be intended to indicate that various forms of digital computers, such as, laptop computer, desktop computer, workbench, individual digital help Reason, server, blade server, mainframe computer and other suitable computer.Electronic equipment also may indicate that various shapes The mobile device of formula, such as, personal digital assistant, cellular phone, smart phone, wearable device and other similar calculating dresses It sets.Component, their connection and relationship shown in this article and their function are merely exemplary, and are not intended to limit The realization of described herein and/or requirement the application.
As shown in fig. 6, the electronic equipment includes: one or more processors 601, memory 602, and each for connecting The interface of component, including high-speed interface and low-speed interface.All parts are interconnected using different buses, and can be pacified It installs in other ways on public mainboard or as needed.Processor can to the instruction executed in electronic equipment into Row processing, including storage in memory or on memory (such as, to be coupled to interface in external input/output device Display equipment) on show GUI graphical information instruction.In other embodiments, if desired, can be by multiple processors And/or multiple bus is used together with multiple memories with multiple memories.It is also possible to multiple electronic equipments are connected, it is each Equipment provides the necessary operation in part (for example, as server array, one group of blade server or multiprocessor system System).In Fig. 6 by taking a processor 601 as an example.
Memory 602 is non-transitory computer-readable storage medium provided herein.Wherein, the memory is deposited The instruction that can be executed by least one processor is contained, so that at least one described processor executes text provided herein Classification method.The non-transitory computer-readable storage medium of the application stores computer instruction, and the computer instruction is based on making Calculation machine executes file classification method provided herein.
Memory 602 is used as a kind of non-transitory computer-readable storage medium, can be used for storing non-instantaneous software program, non- Instantaneous computer executable program and module, such as the corresponding program instruction/mould of the file classification method in the embodiment of the present application Block is (for example, attached shown in fig. 5 including text to be sorted acquisition module 501, term vector sequence determining module 502, entity vector sequence The document sorting apparatus 500 of column determining module 503 and Classification and Identification module 504).Processor 601 is stored in storage by operation Non-instantaneous software program, instruction and module in device 602, at the various function application and data of server Reason, i.e. file classification method in realization above method embodiment.
Memory 602 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function;Storage data area can be stored to be set according to the electronics for executing file classification method Standby uses created data etc..In addition, memory 602 may include high-speed random access memory, it can also include non- Volatile storage, for example, at least a disk memory, flush memory device or other non-instantaneous solid-state memories.Some In embodiment, it includes the memory remotely located relative to processor 601 that memory 602 is optional, these remote memories can be with By being connected to the network to the electronic equipment for executing file classification method.The example of above-mentioned network includes but is not limited to internet, enterprise Industry intranet, local area network, mobile radio communication and combinations thereof.
The electronic equipment for executing file classification method can also include: input unit 603 and output device 604.Processor 601, memory 602, input unit 603 and output device 604 can be connected by bus or other modes, with logical in Fig. 6 It crosses for bus connection.
Input unit 603 can receive the number or character information of input, and generate the electricity with execution file classification method The related key signals input of the user setting and function control of sub- equipment, such as touch screen, keypad, mouse, track pad, touching The input units such as template, indicating arm, one or more mouse button, trace ball, control stick.Output device 604 may include Show equipment, auxiliary lighting apparatus (for example, LED) and haptic feedback devices (for example, vibrating motor) etc..The display equipment can be with Including but not limited to, liquid crystal display (LCD), light emitting diode (LED) display and plasma scope.In some implementations In mode, display equipment can be touch screen.
The various embodiments of system and technology described herein can be in digital electronic circuitry, integrated circuit system It is realized in system, dedicated ASIC (specific integrated circuit), computer hardware, firmware, software, and/or their combination.These are various Embodiment may include: to implement in one or more computer program, which can be It executes and/or explains in programmable system containing at least one programmable processor, which can be dedicated Or general purpose programmable processors, number can be received from storage system, at least one input unit and at least one output device According to and instruction, and data and instruction is transmitted to the storage system, at least one input unit and this at least one output Device.
These calculation procedures (also referred to as program, software, software application or code) include the machine of programmable processor Instruction, and can use programming language, and/or the compilation/machine language of level process and/or object-oriented to implement these Calculation procedure.As used herein, term " machine readable media " and " computer-readable medium " are referred to for referring to machine It enables and/or data is supplied to any computer program product, equipment, and/or the device of programmable processor (for example, disk, light Disk, memory, programmable logic device (PLD)), including, receive the machine readable of the machine instruction as machine-readable signal Medium.Term " machine-readable signal " is referred to for machine instruction and/or data to be supplied to any of programmable processor Signal.
In order to provide the interaction with user, system and technology described herein, the computer can be implemented on computers The display device for showing information to user is included (for example, CRT (cathode-ray tube) or LCD (liquid crystal display) monitoring Device);And keyboard and indicator device (for example, mouse or trace ball), user can by the keyboard and the indicator device come Provide input to computer.The device of other types can be also used for providing the interaction with user;For example, being supplied to user's Feedback may be any type of sensory feedback (for example, visual feedback, audio feedback or touch feedback);And it can use Any form (including vocal input, voice input or tactile input) receives input from the user.
System described herein and technology can be implemented including the computing system of background component (for example, as data Server) or the computing system (for example, application server) including middleware component or the calculating including front end component System is (for example, the subscriber computer with graphic user interface or web browser, user can pass through graphical user circle Face or the web browser to interact with the embodiment of system described herein and technology) or including this backstage portion In any combination of computing system of part, middleware component or front end component.Any form or the number of medium can be passed through Digital data communicates (for example, communication network) and is connected with each other the component of system.The example of communication network includes: local area network (LAN), wide area network (WAN) and internet.
Computer system may include client and server.Client and server is generally off-site from each other and usually logical Communication network is crossed to interact.By being run on corresponding computer and each other with the meter of client-server relation Calculation machine program generates the relationship of client and server.
According to the technical solution of the embodiment of the present application, by obtaining text to be sorted, and by the word of model text to be sorted Sequence inputting term vector encoding model, to determine the term vector sequence of word sequence;Extremely by the entity sequence inputting of text to be sorted Entity vector model made of text training based on entity spectrum data library, to determine the corresponding entity vector sequence of entity sequence Column;According to term vector sequence and entity sequence vector, treats classifying text and carry out Classification and Identification.Above-mentioned technical proposal by word to The use of encoding model and entity vector model is measured, determines term vector sequence corresponding with text to be sorted and entity vector respectively Sequence, the building for avoiding Feature Engineering and training sample are constituted, and reduce the building difficulty of textual classification model;By in word Under two kinds of different dimensions of sequence vector and entity sequence vector, the comprehensive Classification and Identification for carrying out text to be sorted improves text The semantic susceptibility of disaggregated model, and then improve the accuracy of the classification results of text to be sorted.
It should be understood that various forms of processes illustrated above can be used, rearrangement increases or deletes step.Example Such as, each step recorded in the application of this hair can be performed in parallel or be sequentially performed the order that can also be different and execute, As long as it is desired as a result, being not limited herein to can be realized technical solution disclosed in the present application.
Above-mentioned specific embodiment does not constitute the limitation to the application protection scope.Those skilled in the art should be bright White, according to design requirement and other factors, various modifications can be carried out, combination, sub-portfolio and substitution.It is any in the application Spirit and principle within made modifications, equivalent substitutions and improvements etc., should be included within the application protection scope.

Claims (13)

1. a kind of file classification method characterized by comprising
Obtain text to be sorted;
By the word sequence input word vector coding model of text to be sorted, with the term vector sequence of the determination word sequence;
By the entity sequence inputting entity vector model of the text to be sorted, with the corresponding entity of the determination entity sequence to Measure sequence, wherein the entity vector model is that entity vector is determined based on entity vector coding model, the entity vector Encoding model is the text based on entity mobility models spectrum data library as made of sample training;
According to the term vector sequence and entity sequence vector, Classification and Identification is carried out to the text to be sorted.
2. the method according to claim 1, wherein according to the term vector sequence and entity sequence vector, Before the text progress Classification and Identification to be sorted, further includes:
By the term vector sequence inputting term vector attention mechanism model, with the attention weight of each term vector of determination;
The entity sequence vector is inputted into entity vector attention mechanism model, with the attention weight of each entity vector of determination;
Correspondingly, carrying out Classification and Identification to the text to be sorted includes: according to the term vector sequence and entity sequence vector
According to the term vector sequence, entity sequence vector and respective attention weight, the text to be sorted is divided Class identification.
3. according to the method described in claim 2, it is characterized in that, according to the term vector sequence, entity sequence vector and Respective attention weight, carrying out Classification and Identification to the text to be sorted includes:
By the term vector respectively multiplied by corresponding attention weight, by the entity vector respectively multiplied by corresponding attention weight;
Head and the tail splicing will be carried out multiplied by the term vector sequence and entity sequence vector that pay attention to weight, forms complete vector sequence;
By the complete vector sequence inputting classifier, using output result as the classification results of the text to be sorted.
4. the method according to claim 1, wherein the training process of the entity vector coding model includes:
Training sample based on the entity description text in entity mobility models spectrum data library as entity;
Using the training sample of the entity, entity vector coding model is trained.
5. according to the method described in claim 4, it is characterized in that, using the entity training sample, to entity vector compile Code model, which is trained, includes:
According to the context training sample of each entity, the first-level model of each entity is trained, with each institute of determination State the level-one vector of entity;
Entity relationship group is determined from entity mobility models spectrum data library, and/or, according to the entity co-occurrence feelings in urtext Condition determines entity relationship group;Wherein, the entity relationship group includes at least two relationships between entity and entity;
According to the level-one vector of the entity and the entity relationship group, the entity relationship training sample of each entity is determined respectively This, inputs in the corresponding second-level model of each entity and is trained, to update the level-one vector of each entity, obtain Final entity vector;
Wherein, the entity vector model includes the mapping relations of each entity and entity vector that obtain after training.
6. according to the method described in claim 5, it is characterized in that, the first-level model includes NN model and similarity function, The second-level model is skip-gram model.
7. according to any method of claim 4-6, which is characterized in that based on the entity in entity mobility models spectrum data library Text, which is described, as the training sample of entity includes:
Obtain original statement;
Based on entity mobility models map, at least one entity in the original statement is identified;
The original statement for carrying out entity positive example mark is obtained, as positive example training sample, wherein the entity in positive example training sample With the Entities Matching in the entity mobility models map;
Counter-example training sample is determined according to positive example entity, wherein entity and the entity mobility models map in counter-example training sample In entity mismatch;
Entity description text of the positive example entity in entity mobility models spectrum data library is obtained, positive example training sample is added to, as The context training sample.
8. the method according to the description of claim 7 is characterized in that determining counter-example training sample packet according to positive example training sample It includes:
According to positive example entity, the identical or different different entities point of content is determined from entity mobility models map, as counter-example entity;
Entity description text of the counter-example entity in entity mobility models spectrum data library is obtained, as counter-example training sample, also conduct The context training sample.
9. according to the method described in claim 1, it is characterized by:
The term vector encoding model is word2vec model or Glove model, using the unsupervised training of samples of text progress At.
10. the method according to claim 1, wherein it includes at least one of following for obtaining text to be sorted:
From social media application software, user comment text is obtained, key sentence is extracted from the user comment text, made For text to be sorted;
From search engine application software, user's search statement is obtained, as the text to be sorted;
From consultation information push application software, the topic or key sentence of consultation information entry are obtained, as described to be sorted Text;
Advertising information key sentence is obtained, as the text to be sorted.
11. a kind of document sorting apparatus characterized by comprising
Text to be sorted obtains module, for obtaining text to be sorted;
Term vector sequence determining module, for by the word sequence input word vector coding model of text to be sorted, described in determination The term vector sequence of word sequence;
Entity sequence vector determining module, for by the entity sequence inputting entity vector model of the text to be sorted, with true Determine the corresponding entity sequence vector of the entity sequence, wherein the entity vector model is based on entity vector coding model To determine that entity vector, the entity vector coding model are the texts based on entity mobility models spectrum data library as sample training Made of;
Classification and Identification module, for dividing the text to be sorted according to the term vector sequence and entity sequence vector Class identification.
12. a kind of electronic equipment characterized by comprising
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one It manages device to execute, so that at least one described processor is able to carry out a kind of text classification of any of claims 1-10 Method.
13. a kind of non-transitory computer-readable storage medium for being stored with computer instruction, which is characterized in that the computer refers to It enables for making the computer perform claim require a kind of file classification method described in any one of 1-10.
CN201910816831.9A 2019-08-30 2019-08-30 A kind of file classification method, device, equipment and medium Pending CN110516073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910816831.9A CN110516073A (en) 2019-08-30 2019-08-30 A kind of file classification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910816831.9A CN110516073A (en) 2019-08-30 2019-08-30 A kind of file classification method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN110516073A true CN110516073A (en) 2019-11-29

Family

ID=68629603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910816831.9A Pending CN110516073A (en) 2019-08-30 2019-08-30 A kind of file classification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110516073A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128391A (en) * 2019-12-24 2020-05-08 北京推想科技有限公司 Information processing apparatus, method and storage medium
CN111145914A (en) * 2019-12-30 2020-05-12 四川大学华西医院 Method and device for determining lung cancer clinical disease library text entity
CN111241234A (en) * 2019-12-27 2020-06-05 北京百度网讯科技有限公司 Text classification method and device
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN111459959A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and apparatus for updating event set
CN111506702A (en) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN111666373A (en) * 2020-05-07 2020-09-15 华东师范大学 Chinese news classification method based on Transformer
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN112016601A (en) * 2020-08-17 2020-12-01 华东师范大学 Network model construction method based on knowledge graph enhanced small sample visual classification
CN112182230A (en) * 2020-11-27 2021-01-05 北京健康有益科技有限公司 Text data classification method and device based on deep learning
CN112182249A (en) * 2020-10-23 2021-01-05 四川大学 Automatic classification method and device for aviation safety report
CN112307752A (en) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN112328653A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Data identification method and device, electronic equipment and storage medium
CN112632971A (en) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112800214A (en) * 2021-01-29 2021-05-14 西安交通大学 Theme co-occurrence network and external knowledge based theme identification method, system and equipment
CN113011187A (en) * 2021-03-12 2021-06-22 平安科技(深圳)有限公司 Named entity processing method, system and equipment
CN113010669A (en) * 2020-12-24 2021-06-22 华戎信息产业有限公司 News classification method and system
CN113643241A (en) * 2021-07-15 2021-11-12 北京迈格威科技有限公司 Interaction relation detection method, interaction relation detection model training method and device
CN113762998A (en) * 2020-07-31 2021-12-07 北京沃东天骏信息技术有限公司 Category analysis method, device, equipment and storage medium
CN113963357A (en) * 2021-12-16 2022-01-21 北京大学 Knowledge graph-based sensitive text detection method and system
CN114266255A (en) * 2022-03-01 2022-04-01 深圳壹账通科技服务有限公司 Corpus classification method, apparatus, device and storage medium based on clustering model
CN114579740A (en) * 2022-01-20 2022-06-03 马上消费金融股份有限公司 Text classification method and device, electronic equipment and storage medium
CN116975297A (en) * 2023-09-22 2023-10-31 北京利久医药科技有限公司 Method for evaluating clinical trial risk
CN117493568A (en) * 2023-11-09 2024-02-02 中安启成科技有限公司 End-to-end software function point extraction and identification method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3144825A1 (en) * 2015-09-16 2017-03-22 Valossa Labs Oy Enhanced digital media indexing and retrieval
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108595708A (en) * 2018-05-10 2018-09-28 北京航空航天大学 A kind of exception information file classification method of knowledge based collection of illustrative plates
CN108733792A (en) * 2018-05-14 2018-11-02 北京大学深圳研究生院 A kind of entity relation extraction method
CN108959482A (en) * 2018-06-21 2018-12-07 北京慧闻科技发展有限公司 Single-wheel dialogue data classification method, device and electronic equipment based on deep learning
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN109597997A (en) * 2018-12-07 2019-04-09 上海宏原信息科技有限公司 Based on comment entity, aspect grade sensibility classification method and device and its model training
US20190138653A1 (en) * 2017-11-03 2019-05-09 Salesforce.Com, Inc. Calculating relationship strength using an activity-based distributed graph
CN109902171A (en) * 2019-01-30 2019-06-18 中国地质大学(武汉) Text Relation extraction method and system based on layering knowledge mapping attention model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3144825A1 (en) * 2015-09-16 2017-03-22 Valossa Labs Oy Enhanced digital media indexing and retrieval
US20190138653A1 (en) * 2017-11-03 2019-05-09 Salesforce.Com, Inc. Calculating relationship strength using an activity-based distributed graph
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108595708A (en) * 2018-05-10 2018-09-28 北京航空航天大学 A kind of exception information file classification method of knowledge based collection of illustrative plates
CN108733792A (en) * 2018-05-14 2018-11-02 北京大学深圳研究生院 A kind of entity relation extraction method
CN108959482A (en) * 2018-06-21 2018-12-07 北京慧闻科技发展有限公司 Single-wheel dialogue data classification method, device and electronic equipment based on deep learning
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN109597997A (en) * 2018-12-07 2019-04-09 上海宏原信息科技有限公司 Based on comment entity, aspect grade sensibility classification method and device and its model training
CN109902171A (en) * 2019-01-30 2019-06-18 中国地质大学(武汉) Text Relation extraction method and system based on layering knowledge mapping attention model

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128391B (en) * 2019-12-24 2021-01-12 推想医疗科技股份有限公司 Information processing apparatus, method and storage medium
CN111128391A (en) * 2019-12-24 2020-05-08 北京推想科技有限公司 Information processing apparatus, method and storage medium
CN111241234A (en) * 2019-12-27 2020-06-05 北京百度网讯科技有限公司 Text classification method and device
CN111241234B (en) * 2019-12-27 2023-07-18 北京百度网讯科技有限公司 Text classification method and device
CN111145914A (en) * 2019-12-30 2020-05-12 四川大学华西医院 Method and device for determining lung cancer clinical disease library text entity
CN111145914B (en) * 2019-12-30 2023-08-04 四川大学华西医院 Method and device for determining text entity of lung cancer clinical disease seed bank
CN111274815B (en) * 2020-01-15 2024-04-12 北京百度网讯科技有限公司 Method and device for mining entity focus point in text
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
US11775761B2 (en) 2020-01-15 2023-10-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for mining entity focus in text
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN111506702A (en) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN111459959A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and apparatus for updating event set
CN111666373A (en) * 2020-05-07 2020-09-15 华东师范大学 Chinese news classification method based on Transformer
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111797194B (en) * 2020-05-20 2024-04-02 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN113762998A (en) * 2020-07-31 2021-12-07 北京沃东天骏信息技术有限公司 Category analysis method, device, equipment and storage medium
CN112016601A (en) * 2020-08-17 2020-12-01 华东师范大学 Network model construction method based on knowledge graph enhanced small sample visual classification
CN112016601B (en) * 2020-08-17 2022-08-05 华东师范大学 Network model construction method based on knowledge graph enhanced small sample visual classification
CN111966836A (en) * 2020-08-29 2020-11-20 深圳呗佬智能有限公司 Knowledge graph vector representation method and device, computer equipment and storage medium
CN112182249A (en) * 2020-10-23 2021-01-05 四川大学 Automatic classification method and device for aviation safety report
CN112328653B (en) * 2020-10-30 2023-07-28 北京百度网讯科技有限公司 Data identification method, device, electronic equipment and storage medium
CN112328653A (en) * 2020-10-30 2021-02-05 北京百度网讯科技有限公司 Data identification method and device, electronic equipment and storage medium
CN112307752A (en) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN112182230A (en) * 2020-11-27 2021-01-05 北京健康有益科技有限公司 Text data classification method and device based on deep learning
CN112182230B (en) * 2020-11-27 2021-03-16 北京健康有益科技有限公司 Text data classification method and device based on deep learning
CN112632971A (en) * 2020-12-18 2021-04-09 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN112632971B (en) * 2020-12-18 2023-08-25 上海明略人工智能(集团)有限公司 Word vector training method and system for entity matching
CN113010669A (en) * 2020-12-24 2021-06-22 华戎信息产业有限公司 News classification method and system
CN112800214B (en) * 2021-01-29 2023-04-18 西安交通大学 Theme co-occurrence network and external knowledge based theme identification method, system and equipment
CN112800214A (en) * 2021-01-29 2021-05-14 西安交通大学 Theme co-occurrence network and external knowledge based theme identification method, system and equipment
CN113011187A (en) * 2021-03-12 2021-06-22 平安科技(深圳)有限公司 Named entity processing method, system and equipment
CN113643241A (en) * 2021-07-15 2021-11-12 北京迈格威科技有限公司 Interaction relation detection method, interaction relation detection model training method and device
CN113963357A (en) * 2021-12-16 2022-01-21 北京大学 Knowledge graph-based sensitive text detection method and system
CN113963357B (en) * 2021-12-16 2022-03-11 北京大学 Knowledge graph-based sensitive text detection method and system
CN114579740A (en) * 2022-01-20 2022-06-03 马上消费金融股份有限公司 Text classification method and device, electronic equipment and storage medium
CN114579740B (en) * 2022-01-20 2023-12-05 马上消费金融股份有限公司 Text classification method, device, electronic equipment and storage medium
CN114266255B (en) * 2022-03-01 2022-05-17 深圳壹账通科技服务有限公司 Corpus classification method, apparatus, device and storage medium based on clustering model
CN114266255A (en) * 2022-03-01 2022-04-01 深圳壹账通科技服务有限公司 Corpus classification method, apparatus, device and storage medium based on clustering model
CN116975297A (en) * 2023-09-22 2023-10-31 北京利久医药科技有限公司 Method for evaluating clinical trial risk
CN116975297B (en) * 2023-09-22 2023-12-01 北京利久医药科技有限公司 Method for evaluating clinical trial risk
CN117493568A (en) * 2023-11-09 2024-02-02 中安启成科技有限公司 End-to-end software function point extraction and identification method
CN117493568B (en) * 2023-11-09 2024-04-19 中安启成科技有限公司 End-to-end software function point extraction and identification method

Similar Documents

Publication Publication Date Title
CN110516073A (en) A kind of file classification method, device, equipment and medium
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN107491531B (en) Chinese network comment sensibility classification method based on integrated study frame
US9489625B2 (en) Rapid development of virtual personal assistant applications
US9081411B2 (en) Rapid development of virtual personal assistant applications
US20220004714A1 (en) Event extraction method and apparatus, and storage medium
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN109960800A (en) Weakly supervised file classification method and device based on Active Learning
CN111177569A (en) Recommendation processing method, device and equipment based on artificial intelligence
CN110447042A (en) It cooperative trains and/or using individual input and subsequent content neural network to carry out information retrieval
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
WO2021129123A1 (en) Corpus data processing method and apparatus, server, and storage medium
CN109582788A (en) Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing
JP2021131858A (en) Entity word recognition method and apparatus
CN107145514A (en) Chinese sentence pattern sorting technique based on decision tree and SVM mixed models
CN110060674A (en) Form management method, apparatus, terminal and storage medium
CN110517767A (en) Aided diagnosis method, device, electronic equipment and storage medium
CN110795565A (en) Semantic recognition-based alias mining method, device, medium and electronic equipment
CN109710760A (en) Clustering method, device, medium and the electronic equipment of short text
CN112287656A (en) Text comparison method, device, equipment and storage medium
Chen et al. A review and roadmap of deep learning causal discovery in different variable paradigms
Wang et al. Aspect-based sentiment analysis with graph convolutional networks over dependency awareness
Arafat et al. Analyzing public emotion and predicting stock market using social media
Chang et al. Multi-information preprocessing event extraction with BiLSTM-CRF attention for academic knowledge graph construction
CN116757195A (en) Implicit emotion recognition method based on prompt learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129

RJ01 Rejection of invention patent application after publication