CN110245232A - File classification method, device, medium and calculating equipment - Google Patents
File classification method, device, medium and calculating equipment Download PDFInfo
- Publication number
- CN110245232A CN110245232A CN201910480256.XA CN201910480256A CN110245232A CN 110245232 A CN110245232 A CN 110245232A CN 201910480256 A CN201910480256 A CN 201910480256A CN 110245232 A CN110245232 A CN 110245232A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- sorted
- category
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
Embodiments of the present invention provide a kind of file classification method.This method comprises: obtaining text to be sorted;According to text to be sorted, the first classification information of text to be sorted is obtained using the first deep learning model;In the case where the classification that the first classification information characterizes text to be sorted is not first category, the classification of text to be sorted is determined using the second deep learning model.Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the classification of the first classification information characterization of first text is not first category.Method of the invention determines text classification using two deep learning models, wherein the sample data of the second deep learning model is the text recalled via the first deep learning model.Therefore the concentration of effective sample in sample data can be improved, to reduce sample mark, and can be improved the accuracy rate of classification prediction.In addition, embodiments of the present invention additionally provide a kind of document sorting apparatus, medium and calculate equipment.
Description
Technical field
Embodiments of the present invention are related to text-processing field, more specifically, embodiments of the present invention are related to a kind of text
This classification method, apparatus, medium and calculating equipment.
Background technique
Background that this section is intended to provide an explanation of the embodiments of the present invention set forth in the claims or context.Herein
Description recognizes it is the prior art not because not being included in this section.
In internet area, generally require to pick out text against regulation from a large amount of texts, to avoid displaying
The case where negative guidance is played to user caused by the text against regulation generation.Wherein, text, which is selected, can use text
The method of this classification carries out.
In general, the most file classification method in natural language processing field may be applicable to text against regulation
The detection of (such as " three customs " news).For example, the detection of the text against regulation of early stage or the manual examination and verification based on editor,
But with internet and greatly developing from media, news magnitude degree increases so that efficiency is lower and the manual examination and verification of high expensive
It is unable to satisfy demand, then artificial and machine combines the mode of audit to become mainstream.In machine audit, text against regulation
The detection method method that is mainly based upon dictionary, the antistop list of vocabulary against regulation is contained by constructing, in text
Hold and carries out canonical matching.In recent years, the outstanding behaviours with machine learning and deep learning in text classification field, some depth
Learning model is also applied in text classification.
Wherein, the mode based on dictionary carries out text audit often because vocabulary is not abundant enough, cannot capture text semantic
Relationship, and cause recall rate lower.And the method for machine learning and deep learning is all often to solve text classification end to end
Problem, but under newsletter archive scene, since the total amount of the total text of text Zhan against regulation is lower, training data is often difficult to
It obtains.Simultaneously in use, being difficult to protect since text against regulation and the distribution for meeting regulation text are extremely unbalanced
The recall rate of the accuracy and text against regulation classified on card line.
Summary of the invention
Therefore in the prior art, the text under the Internet scenes such as news is carried out using existing file classification method
When classification, exist because different type text be unevenly distributed weighing apparatus caused by accuracy it is low, the problem of recall rate is difficult to ensure.Furthermore
Due to needing largely there is the sample of mark to carry out training pattern, the defect obtained is difficult to there is also training data.
Thus, it is also very desirable to a kind of improved file classification method, it can be right under the premise of being not necessarily to a large amount of training datas
The unbalanced text of category distribution carries out Accurate classification.
In the present context, embodiments of the present invention are desirable to first improve the concentration of effective sample in sample text,
Train classification models are come as training data to the sample text with high concentration effective sample again, with total in reduction training data
Under the premise of amount, the accuracy of disaggregated model classification is improved.
In the first aspect of embodiment of the present invention, a kind of file classification method is provided, comprising: obtain text to be sorted
This;According to text to be sorted, the first classification information of text to be sorted is obtained using the first deep learning model;In the first classification
In the case that the classification of information representation text to be sorted is not first category, text to be sorted is determined using the second deep learning model
This classification.Wherein, optimization obtain the second deep learning model sample data include the first text, the first of first text
The classification of classification information characterization is not first category.
In one embodiment of the invention, the above-mentioned classification packet that text to be sorted is determined using the second deep learning model
It includes: obtaining the second classification information of text to be sorted using the second deep learning model: and determine the classification of text to be sorted
For the classification of the second classification information characterization.
In another embodiment of the present invention, above-mentioned first deep learning model specification has first threshold, the second depth
Learning model is set with second threshold.Wherein, the first classification information includes first category information and text to be sorted relative to more
A other multiple first confidence levels of predetermined class, first confidence level of the first category information by text to be sorted relative to second category
It is determined with the size relation of first threshold;Second classification information includes second category information and text to be sorted relative to multiple pre-
Determine multiple second confidence levels of classification, second confidence level and the of the second category information by text to be sorted relative to second category
The size relation of two threshold values determines.Wherein, first threshold is less than second threshold, and first category information and second category information are used for
The classification of text to be sorted is characterized, multiple predetermined classifications include first category and second category.
In yet another embodiment of the present invention, above-mentioned first classification information and/or the second classification information include to be sorted
Text is relative to the other multiple confidence levels of multiple predetermined class, after the classification for determining text to be sorted, above-mentioned file classification method
Further include: the parameter value for determining text to be sorted is confidence level of the text to be sorted relative to second category.Wherein, multiple predetermined
Classification includes first category and second category.
In yet another embodiment of the present invention, above-mentioned file classification method further include: obtain multiple second texts and more
The concrete class of a second text;Using multiple second texts as the input of the first deep learning model, multiple second texts are obtained
This first classification information;According to the first classification information of multiple second texts and the concrete class of multiple second texts, optimization
First deep learning model.Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
In yet another embodiment of the present invention, concrete class is the text ratio of first category in above-mentioned multiple second texts
Example is greater than the text scale that concrete class in multiple first texts is first category.
In yet another embodiment of the present invention, above-mentioned file classification method further include: obtain multiple first texts and more
The concrete class of a first text;Using multiple first texts as the input of the second deep learning model, multiple first texts are obtained
This second classification information;And according to the second classification information of multiple first texts and the concrete class of multiple first texts,
Optimize the second deep learning model.Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, length
Short-term memory network model.
In yet another embodiment of the present invention, above-mentioned multiple first texts of acquisition include: to obtain multiple third texts;With
Input of multiple third texts as the first deep learning model, obtains the first classification information of multiple third texts;Determine
The classification of one classification information characterization is not that the third text of first category is the first text.
In yet another embodiment of the present invention, before the first classification information for obtaining text to be sorted, above-mentioned text
Classification method further include: classifying text is treated using pretreated model and is pre-processed, extraction obtains the feature of text to be sorted
Information.Wherein, using the characteristic information of text to be sorted as the input of the first deep learning model, output obtains text to be sorted
The first classification information;Pretreated model includes word frequency-inverse file frequency model or term vector model.
In the second aspect of embodiment of the present invention, a kind of document sorting apparatus is provided, comprising: text to be sorted obtains
Modulus block, for obtaining text to be sorted;First category determining module is used for according to the text to be sorted, deep using first
Degree learning model obtains the first classification information of the text to be sorted;And second category determining module, for described the
It is true using the second deep learning model in the case that the classification that one classification information characterizes the text to be sorted is not first category
The classification of the fixed text to be sorted.Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model,
This, the classification of the first classification information characterization of first text is not the first category.
In one embodiment of the invention, above-mentioned second category determining module includes: that the second classification information obtains submodule
Block, for obtaining the second classification information of the text to be sorted using the second deep learning model;And classification determines submodule
Block, for determining that the classification of the text to be sorted is the classification of second classification information characterization.
In another embodiment of the present invention, above-mentioned first deep learning model specification has a first threshold, and described second
Deep learning model specification has second threshold, wherein first classification information includes first category information and described to be sorted
Text relative to other multiple first confidence levels of multiple predetermined class, the first category information by the text to be sorted relative to
First confidence level of second category and the size relation of the first threshold determine;Second classification information includes second category
Information and the text to be sorted are relative to other multiple second confidence levels of multiple predetermined class, and the second category information is by described
Text to be sorted is determined relative to the second confidence level of the second category and the size relation of the second threshold.Wherein, institute
It is described wait divide for characterizing less than the second threshold, the first category information and the second category information to state first threshold
The classification of class text, the multiple predetermined classification include the first category and second category.
In yet another embodiment of the present invention, first classification information and/or second classification information include institute
Text to be sorted is stated relative to the other multiple confidence levels of multiple predetermined class.Above-mentioned document sorting apparatus further includes that parameter value determines mould
Block, for determining that the parameter value of the text to be sorted is confidence level of the text to be sorted relative to second category.Wherein,
The multiple predetermined classification includes the first category and the second category.
In yet another embodiment of the present invention, above-mentioned document sorting apparatus further include: the first model optimization module is used for
It executes following operation: obtaining the concrete class of multiple second texts and the multiple second text;With the multiple second text
As the input of the first deep learning model, the first classification information of the multiple second text is obtained;According to described more
The concrete class of first classification information of a second text and the multiple second text optimizes the first deep learning mould
Type.Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
In yet another embodiment of the present invention, concrete class is the text of the first category in above-mentioned multiple second texts
This ratio is greater than the text scale that concrete class in multiple first texts is the first category.
In yet another embodiment of the present invention, above-mentioned document sorting apparatus further includes the second model optimization module, is used for
It executes following operation: obtaining the concrete class of multiple first texts and the multiple first text;With the multiple first text
As the input of the second deep learning model, the second classification information of the multiple first text is obtained;And according to described more
The concrete class of second classification information of a first text and the multiple first text optimizes the second deep learning mould
Type.Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, long memory network mould in short-term
Type.
In yet another embodiment of the present invention, obtaining multiple first texts includes: to obtain multiple third texts;With described
Input of multiple third texts as the first deep learning model obtains the first classification letter of the multiple third text
Breath;Determine the first classification information characterization classification be not the first category third text be first text.
In yet another embodiment of the present invention, above-mentioned file classification method further includes preprocessing module, for described
Before first category determining module obtains the first classification information of the text to be sorted, using pretreated model to described wait divide
Class text is pre-processed, and extraction obtains the characteristic information of the text to be sorted.Wherein, first category determining module is specifically used
In using the characteristic information of the text to be sorted as the input of the first deep learning model, output obtains described to be sorted
First classification information of text;The pretreated model includes word frequency-inverse file frequency model or term vector model.
In the third aspect of embodiment of the present invention, a kind of computer readable storage medium is provided, is stored thereon with
Executable instruction, the first aspect which makes processor execute embodiment according to the present invention when being executed by processor are mentioned
The file classification method of confession.
In the fourth aspect of embodiment of the present invention, a kind of calculating equipment is provided.The calculating equipment includes being stored with
One or more storage units of executable instruction, and one or more processing units.It is executable that the processing unit executes this
Instruction, to realize file classification method provided by the first aspect of embodiment according to the present invention.
The file classification method, device, medium of embodiment and calculating equipment according to the present invention, using two deep learnings
Model determines text classification, wherein the sample data of the second deep learning model is recalled via the first deep learning model
The text arrived, then the concentration of effective sample is high in the sample data of the second deep learning model, so as to improve the second depth
The precision of learning model, and therefore improve the accuracy rate to text classification result.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention
, feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention
Dry embodiment, in which:
Fig. 1 diagrammatically illustrates the file classification method, device, medium and computer equipment of embodiment according to the present invention
Application scenarios;
Fig. 2A diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present invention;
Fig. 2 B diagrammatically illustrates second deep learning model of use according to an embodiment of the present invention and determines text to be sorted
Classification flow chart;
Fig. 3 diagrammatically illustrates the flow chart of file classification method according to another embodiment of the present invention;
Fig. 4, which is diagrammatically illustrated, optimizes the first deep learning model in file classification method according to an embodiment of the present invention
Flow chart;
Fig. 5 A, which is diagrammatically illustrated, optimizes the second deep learning model in file classification method according to an embodiment of the present invention
Flow chart;
Fig. 5 B diagrammatically illustrates the flow chart according to an embodiment of the present invention for obtaining the first text;
Fig. 6 diagrammatically illustrates the techniqueflow chart of file classification method according to an embodiment of the present invention;
Fig. 7 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the present invention;
Fig. 8 diagrammatically illustrates the signal of the program product according to an embodiment of the present invention for being adapted for carrying out file classification method
Figure;And
Fig. 9 diagrammatically illustrates the frame of the calculating equipment according to an embodiment of the present invention for being adapted for carrying out file classification method
Figure.
In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.
Specific embodiment
The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this
A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any
Mode limits the scope of the invention.On the contrary, thesing embodiments are provided so that the present invention is more thorough and complete, and energy
It enough will fully convey the scope of the invention to those skilled in the art.
One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method
Or computer program product.Therefore, the present invention can be with specific implementation is as follows, it may be assumed that complete hardware, complete software
The form that (including firmware, resident software, microcode etc.) or hardware and software combine.
Embodiment according to the present invention proposes a kind of file classification method, device, medium and calculates equipment.
Herein, it is to be understood that related term is explained as follows:
Machine learning, be using probability theory, statistics, Approximation Theory, convextiry analysis scheduling theory, research computer how to simulate or
It realizes the learning behavior of the mankind, obtains knowledge and skills, reorganize the multi-field cross discipline of the existing structure of knowledge.
Deep learning is a branch of machine learning, by establishing the brain of simulation people, constructs neural network to explain
And learning data.
Natural language processing is an important branch of machine learning, and main research is realized between people and computer with certainly
Right language carries out the theory and method of effective communication, is a fusional language, computer science, mathematics in the subject of one.
Text classification is a branch of natural language processing field, and main research computer is by text set according to certain
Classification system or standard classified automatically.
LR, logistics regressive (logistic regression) are a kind of linear regression analysis models, are usually used in data digging
The fields such as pick, economic forecasting.
SVM, Support Vector Machine (support vector machines) are that one kind carries out data by supervised learning mode
The linear classifier of binary classification, decision boundary are the maximum back gauge hyperplane solved to learning sample.
LSTM, Long Short-Term Memory (long memory network in short-term), are a kind of time recurrent neural networks, fit
Together in processing and predicted time sequence in be spaced and postpone relatively long critical event.
Tf-idf, term frequency-inverse document frequency (word frequency-inverse file frequency) is
A kind of statistical method, to assess a words for the important journey of a copy of it file in a file set or a corpus
Degree.The importance of words can go out in corpus with the directly proportional increase of number that it occurs hereof, but simultaneously with it
Existing frequency is inversely proportional decline.
Word2vec indicates a kind of term vector model of word, has preferable semantic expressiveness.
Accuracy rate, a kind of machine learning algorithm Performance Evaluating Indexes judge sample to be tested such as a classification problem
The accuracy rate of the classifier is obtained divided by sample to be tested sum for the sum of correct classification.
Recall rate, a kind of machine learning algorithm Performance Evaluating Indexes, such as a classification problem, classifier will be to test sample
Originally it is judged as that the correct number of some classification divided by the substantial amt of the category is that the classifier recalls the identification of the category
Rate.
Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.
Summary of the invention
In the prior art, why not text unbalanced for category distribution, classified using machine learning model
People's will to the greatest extent, is that a large amount of sample is generally required due to the training of machine learning model to guarantee to have a certain amount of effective sample
(such as " three customs " text in newsletter archive).Machine learning model is often due to the presence of other invalid samples leads to study
Feature is not accurate, so that the recall rate of machine learning model classification is lower.The inventors discovered that if passing through a deep learning mould
Type first screens the sample for being unevenly distributed weighing apparatus, obtains the higher sample set of effective sample concentration, then use the sample set pair
The deep learning model for carrying out text classification is trained, and can improve the deep learning for carrying out text classification to a certain extent
The precision of model.Correspondingly, in actual classification, equally text to be sorted is classified by two deep learning models,
Feature distribution in the feature distribution and training sample of the text to be sorted can be made to be closer to, and therefore can be further
Improve classification accuracy.
After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention
Formula.
Application scenarios overview
Referring initially to Fig. 1.
Fig. 1 diagrammatically illustrates the file classification method, device, medium and computer equipment of embodiment according to the present invention
Application scenarios.It should be noted that being only the example that can apply the application scenarios of the embodiment of the present invention shown in Fig. 1, with side
The technology contents those skilled in the art understand that of the invention are helped, but are not meant to that the embodiment of the present invention may not be usable for other and set
Standby, system, environment or scene.
As shown in Figure 1, the application scenarios 100 include terminal device 111,112,113, server 120 and network 130.
Network 130 can wrap for providing the medium of communication link, network between terminal device 111,112,113 and server 120
Include various connection types, such as wired, wireless communication link or fiber optic cables etc..
Wherein terminal device 111,112,113 is classified for example with processing function with treating classifying text, obtain to
The classification of classifying text and the confidence level for belonging to preset classification.According to an embodiment of the invention, the terminal device 111,112,113
Including but not limited to desktop computer, pocket computer on knee, tablet computer, smart phone, intelligent wearable device or intelligence
Energy household electrical appliances etc..
According to an embodiment of the invention, being integrated with the deep learning that pre-training obtains in the terminal device 111,112,113
Model is classified with treating classifying text by deep learning model.Wherein, deep learning model can be the terminal device
111, it 112,113 obtains by using a large amount of training samples stored in server 120, or can be the instruction of server 120
It gets.
Wherein, text 121 to be sorted for example can be newsletter archive, it is described treat classifying text 121 carry out classification can be with
It is to divide text 121 to be sorted for " three customs " text or " non-three custom " text.Text 121 to be sorted can store in server
In 120, or it is stored in terminal device 111,112,113 locally.Wherein, in order to guarantee to have a certain amount of training sample,
Training sample specifically can store in server 120.
Wherein, terminal device 111,112,113 for example can have display screen, for showing text to be sorted to user
Classification results and text to be sorted " three popular value " (confidence level for for example, belonging to " three is popular " text), in order to user couple
The newsletter archive for belonging to " three customs " text is handled.
Wherein, server 120 can be to provide the server of various services, such as mention to terminal device 111,112,113
The deep learning model (merely illustrative) that pre-training obtains is provided for text to be sorted or training sample, or to terminal device.Or
Person, the server 120 can also for example have processing function, for using trained deep learning model it is stored to
Classifying text 121 is classified.
It should be noted that file classification method provided by the embodiment of the present invention generally can by terminal device 111,
112,113 or server 120 execute.Correspondingly, document sorting apparatus provided by the embodiment of the present invention generally can be set in
In terminal device 111,112,113 or server 120.File classification method provided by the embodiment of the present invention can also be by difference
In server 120 and the server or server cluster that can be communicated with terminal device 111,112,113 and/or server 120
It executes.Correspondingly, document sorting apparatus provided by the embodiment of the present invention also can be set in being different from server 120 and can
In the server or server cluster communicated with terminal device 111,112,113 and/or server 120.
It should be understood that the terminal device, network, server, the number of text and type in Fig. 1 are only schematical.
According to needs are realized, terminal device, network, server and the text of arbitrary number and type can have.
Illustrative methods
Below with reference to the application scenarios of Fig. 1, the text of illustrative embodiments according to the present invention is described with reference to Fig. 2A~Fig. 6
This classification method.It should be noted which is shown only for the purpose of facilitating an understanding of the spirit and principles of the present invention for above-mentioned application scenarios,
Embodiments of the present invention are not limited in this respect.On the contrary, embodiments of the present invention can be applied to applicable appoint
What scene.
Fig. 2A diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present invention, and Fig. 2 B is schematically shown
The flow chart of the classification according to an embodiment of the present invention that text to be sorted is determined using the second deep learning model.
As shown in Figure 2 A, the file classification method of the embodiment of the present invention includes operation S201~operation S203.This article one's duty
Class method for example can by Fig. 1 terminal device 111,112,113 or server 120 execute.
In operation S201, text to be sorted is obtained.
The text to be sorted can be the text to be sorted 121 being stored in server 120 in Fig. 1.Wherein, this is to be sorted
Text can be for example the newsletter archive in News Field, determine news text with file classification method through the embodiment of the present invention
Whether this belongs to " three customs " text.
First point of text to be sorted is obtained using the first deep learning model according to text to be sorted in operation S202
Category information.
According to an embodiment of the invention, operation S202 specifically can be directly using text to be sorted as the first depth
The input of model is practised, output obtains the first classification information.Wherein, which for example can be logistic regression mould
Type, long memory network model in short-term or any deep learning model that can be used in solving classification problem.Specifically, for only
Including phrase, the text scene not strong to semantic requirements, which can use Logic Regression Models.It is right
It, then can be using long memory network model in short-term, to improve in including long sentence, understand text semantic more demanding scene
The accuracy of first classification information.The first deep learning model for example can be excellent in advance by the optimization method of Fig. 4 description
What change obtained, this will not be detailed here.
It is obtained according to an embodiment of the invention, first classification information can for example be characterized by the first deep learning model
Text to be sorted belonging to classification.Specifically, which includes characterizing the first kind of the classification of text to be sorted
Other information.For example, for newsletter archive, first classification information can for example characterize text to be sorted belong to " three customs " text or
Person belongs to " non-three custom " text, and correspondingly, first category information can be " three customs " or " non-three custom ".
According to an embodiment of the invention, first classification information specifically for example can also include text to be sorted relative to more
A other multiple first confidence levels of predetermined class, to characterize text to be sorted, to belong to each predetermined class in multiple predetermined classifications other several
Rate.Further, the first deep learning model is for example, it can be set to have first threshold, as according to multiple first confidence levels
Determine the foundation of first category information.Specifically, in two classification problems, which can be by text phase to be sorted
The size relation of the first confidence level and the first threshold for second category determines.When text to be sorted is relative to the second class
When other first confidence level is greater than first threshold, determines that the first category information is second category, otherwise determine the first category
Information is first category.For example, if multiple predetermined classifications include " three customs " and " non-three custom ", when text to be sorted is relative to " three
When the confidence level of custom " is greater than first threshold, determines that the first category information of text to be sorted is " three customs ", otherwise determine to be sorted
The first category information of text is " non-three custom ".Therefore, above-mentioned first category is " non-three custom ", and second category is " three
Custom ".Wherein, the value of first threshold can be set according to actual needs, and the embodiment of the present invention is not construed as limiting this.Such as
The first threshold can take the arbitrary value greater than 0.5 less than or equal to 0.7, or other arbitrary values.
According to an embodiment of the invention, in view of when text to be sorted belongs to the unbalanced text type of category distribution,
It is mainly used for determining whether text to be sorted belongs in multiple predetermined classifications and is distributed sparse classification.Then in order to improve this first
The amount of recalling of deep learning model avoids the text mistake that will belong to the sparse classification (second category) of distribution from being divided into densely distributed
Classification (first category), should should be small as far as possible by the value of the first threshold.For example, for newsletter archive, the first of setting
Threshold value answer it is as small as possible, can be inadequate to avoid the feature of " three custom " text having by the first deep learning model, and will belong to
In the text classification of " three custom " is " non-three custom " the case where.
It is adopted in operation S203 in the case where the classification that the first classification information characterizes text to be sorted is not first category
The classification of text to be sorted is determined with the second deep learning model.
According to an embodiment of the invention, the setting due to first threshold is smaller, then often presence will belong to first category
The case where text mistake is divided into second category.For example, being often divided into " three customs " text in the presence of by the text mistake for belonging to " non-three custom "
Situation.Then determines whether text to be sorted belongs in multiple predetermined classifications in order to further increase and be distributed the accurate of sparse classification
Rate also copes with the first classification information characterization and is not belonging to the text of first category (belonging to second category) carrying out high-precision point
Class.
Wherein, it is somebody's turn to do to improve the second deep learning model for being distributed the discriminant accuracy of sparse classification, optimization
The sample data of second deep learning model for example can be the first classification information characterization classification be not the first category (i.e.
Second category) the first text.Then since the first classification information is obtained by the first deep learning model, in the first text
Belong to and is distributed the text of sparse classification compared to directly belonging to the text for being distributed sparse classification from the text obtained in server 120
This is more intensive.Consequently facilitating the second deep learning model is more comprehensively distributed sparse class another characteristic.
According to an embodiment of the invention, operation S203 specifically can be directly using text to be sorted as the second depth
The input for practising model determines the classification of text to be sorted according to the output of the second deep learning model.Wherein, second depth
Practising model for example can be supporting vector machine model, Random Forest model, long memory network model in short-term or any can be used in
Solve the deep learning model of classification problem.The second deep learning model and the first deep learning model can be different type
Model, or may be same type but with different parameters model.Specifically, which can
It is obtained with being that the optimization method that is described by Fig. 5 A is pre-optimized, this will not be detailed here.
According to an embodiment of the invention, as shown in Figure 2 B, it is above-mentioned that text to be sorted is determined using the second deep learning model
Classification for example may include operation S213~operation S223.In operation S213, obtained using the second deep learning model wait divide
Second classification information of class text.In operation S223, determine that the classification of text to be sorted is the classification of the second classification information characterization.
Wherein, which can for example characterize the text institute to be sorted obtained by the second deep learning model
The classification of category.Specifically, which includes the second category information for the classification for characterizing text to be sorted.For example, right
In newsletter archive, which can for example characterize text to be sorted and belong to " three customs " text or belong to " non-three custom "
Text, correspondingly, second category information can be " three customs " or " non-three custom ".
According to an embodiment of the invention, second classification information specifically for example can also include text to be sorted relative to more
A other multiple second confidence levels of predetermined class, to characterize text to be sorted, to belong to each predetermined class in multiple predetermined classifications other several
Rate.Further, the second deep learning model is for example, it can be set to have second threshold, as according to multiple second confidence levels
Determine the foundation of second category information.Similar, in two classification problems, which can be by text phase to be sorted
The size relation of the second confidence level and the second threshold for second category determines.When text to be sorted is relative to the second class
When other second confidence level is greater than second threshold, determines that the second category information is second category, otherwise determine the second category
Information is first category.For example, if multiple predetermined classifications include " three custom " and " non-three is popular ", text to be sorted relative to
In the case that the confidence level of " three customs " is greater than second threshold, determine that the second category information of text to be sorted is " three customs ", accordingly
Ground determines that text to be sorted is " three customs " text.Otherwise the second category information for determining text to be sorted is " non-three custom ", is determined
Text to be sorted is " non-three custom " text.Wherein, in order to improve the accuracy rate that text classification to be sorted is " three custom " text, this
The value of two threshold values should be greater than first threshold.It is understood that the second threshold can be set according to actual needs, this
Inventive embodiments are not construed as limiting this.Such as the second threshold can take the value greater than 0.9, or other arbitrary values.
Wherein, it is contemplated that the first threshold of the first deep learning model less than the second deep learning model second threshold,
Then often there is no the first classification informations that the text characterization that will belong to second category is first category.Therefore in the first classification letter
When the classification of breath characterization text to be sorted is first category, it can determine that text to be sorted is the text for belonging to first category.
In summary, the file classification method of the embodiment of the present invention, text to be sorted come via two deep learning models
It determines classification, the accuracy rate and recall rate of text classification can be improved to a certain extent.Furthermore due to the second deep learning mould
The sample data of type is the text recalled via the first deep learning model.Therefore, the sample of the second deep learning model
The concentration of effective sample is high in data, so as to improve the precision of the second deep learning model, further increases via second
The accuracy rate for the classification results that deep learning model determines.
Fig. 3 diagrammatically illustrates the flow chart of file classification method according to another embodiment of the present invention.
In accordance with an embodiment of the present disclosure, after the classification that text to be sorted has been determined, it is contemplated that characterize text category to be sorted
It can be used as the reference that user treats the subsequent processing of classifying text in the probability of second category (being distributed sparse classification).Cause
This, as shown in figure 3, the file classification method of the embodiment of the present invention is determining text to be sorted by operating S201~operation S203
It can also include operation S304 after this classification.
In operation S304, determine that the parameter value of text to be sorted is confidence level of the text to be sorted relative to second category.
Wherein, for two classification problems, it is that the first classification information obtained in operation S202, which characterizes text to be sorted,
When one classification, determine that the classification of text to be sorted is first category.Can then determine the parameter value be the first classification information in
Confidence level of the classifying text relative to second category.The first classification information obtained in operation S202 characterizes text to be sorted not
When being first category but determining that the classification of text to be sorted is first category/second category by operating S203, then it can determine
Confidence level of the parameter value for text to be sorted in the second classification information relative to second category.
When multiple predetermined classifications include " three customs " and " non-three custom ", which is specifically as follows " three popular values ", with table
Levy three popular degree of text to be sorted.Then user can treat classifying text and do accordingly according to three popular degree of text to be sorted
Ground processing.
According to an embodiment of the invention, obtaining more accurate first according to input for the ease of the first deep learning model
Classification information can also treat in advance classifying text and be pre-processed, to extract the feature of text to be sorted.Then as shown in figure 3,
The file classification method of the embodiment of the present invention further includes operation S305 before operating S202.
In operation S305, classifying text is treated using pretreated model and is pre-processed, extraction obtains text to be sorted
Characteristic information.The feature letter of obtained text to be sorted is as extracted in the input for then operating the first deep learning model in S202
Breath, output are the first classification information of text to be sorted.
Wherein, which specifically can be the text information first treated in classifying text and is identified, further according to identification
As a result crucial words is extracted;Then crucial words is converted to the vector that can characterize the key words, so that it is deep to obtain first
Spend the input of learning model.Wherein, extracting the pretreated model that crucial words obtains using when vector may include the inverse text of word frequency-
Part frequency model, term vector model or can be used in the prior art extract text feature arbitrary model, the present invention to this not
It limits.
Fig. 4, which is diagrammatically illustrated, optimizes the first deep learning model in file classification method according to an embodiment of the present invention
Flow chart.
As shown in figure 4, the method that the embodiment of the present invention optimizes the first deep learning model may include operation S406~behaviour
Make S408.
In operation S406, the concrete class of multiple second texts and multiple second text is obtained.
Wherein, multiple second texts for example can be obtained at random from the text library stored in server 120 it is multiple
Text, multiple second text are specifically as follows multiple newsletter archives of acquisition.The concrete class of multiple second texts is then pre-
The classification first distributed.Specifically, the concrete class of each second text can be the mark of each second text distribution by user
Label obtain, and the concrete class of each second text is indicated for the label of each second text distribution.Then aforesaid operations S406 exists
It can also include the operation for obtaining the label that user is the distribution of each second text after getting multiple second texts.Further
Ground, for the ease of simultaneously using label as the input of the first deep learning model, after getting label, operation S406 can be with
Including following operation: the label of each second text and each second text being spliced to form corresponding with each second text
Second sample data.
Wherein, it is contemplated that after the first deep learning model, can also further using the second deep learning model come
Determine the classification of text to be sorted.Therefore, the embodiment of the present invention is not high to the required precision of the first deep learning model.Optimizing
When obtaining the first deep learning model, the second amount of text for needing is without too many, to guarantee lower mark amount, with drop
Low mark cost.
In operation S407, using multiple second texts as the input of the first deep learning model, multiple second texts are obtained
The first classification information.
According to an embodiment of the invention, operation S407 specifically can be, multiple second texts are inputted first one by one
Deep learning model, to obtain the first classification information one by one.The specific implementation process of operation S407 is similar to operation S202,
Difference is only that the first deep learning model herein is the Logic Regression Models being not optimised or long memory network model etc. in short-term.
Wherein, as corresponding with each second text for specifically can be above-mentioned splicing and obtaining of the first deep learning mode input
Two sample datas.
S408 is operated, according to the first classification information of multiple second texts and the concrete class of multiple second texts, optimization
First deep learning model.
According to an embodiment of the invention, operation S408 specifically can be, according to each text in multiple second texts
The concrete class of each of first classification information the characterization classification of the second text and each second text, using first-loss letter
First-loss value corresponding with each second text is calculated in number.Then the first depth is adjusted according to the first-loss value
Practise the parameter in model, Lai Youhua the first deep learning model.
Fig. 5 A, which is diagrammatically illustrated, optimizes the second deep learning model in file classification method according to an embodiment of the present invention
Flow chart, Fig. 5 B diagrammatically illustrate it is according to an embodiment of the present invention obtain the first text flow chart.
As shown in Figure 5A, it may include operation S509~behaviour that the embodiment of the present invention, which optimizes the method for the second deep learning model,
Make S511.
In operation S509, the concrete class of multiple first texts and multiple first texts is obtained.
Wherein, since first text is the text that the first classification information characterization is not first category, this first
Text is to have obtained the text of the first classification information by the first deep learning model.As shown in Figure 5 B, multiple first texts are obtained
This operation may include operation S519~operation S539.
In operation S519, multiple third texts are obtained;In operation S529, using multiple third texts as the first deep learning
The input of model obtains the first classification information of multiple third texts;In operation S539, the class of the first classification information characterization is determined
The third text for not being first category is the first text.
According to an embodiment of the invention, multiple third texts in operation S519 are the text stored from server 120
The multiple texts obtained at random in this library.Multiple third text may include the second text of Fig. 4 description, can not also include
Second text of Fig. 4 description.The first classification information that operation S529 is obtained specifically can be the operation S202 described by Fig. 2A
It acquires, is also possible to acquire by the operation S407 that Fig. 4 is described.Then operating S539 can be according to each third
First classification information of text determines whether each third text is the first text.
According to an embodiment of the invention, in view of the first text is only that text in text library is determined by operation S539
First classification information characterization is not the third text of first category;And in Fig. 4 operate S404 obtain the second text be directly from
It is obtained at random in text library.Then concrete class is that the text scale of first category is greater than multiple first texts in multiple second texts
Concrete class is the text scale of first category in this.It is stored in text library in the case where be newsletter archive, multiple second
The ratio that " three customs " text is actually belonged in text is less than in multiple first texts the ratio for actually belonging to " three customs " text.
According to an embodiment of the invention, the concrete class of each third text can be each second text point by user
The label matched obtains, and the concrete class of each third text is indicated for the label of each third text distribution.S509 is then operated to exist
It can also include the operation for obtaining the label that user is the distribution of each first text after getting multiple first texts.Further
Ground, for the ease of simultaneously using label as the input of the second deep learning model, after getting label, operation S509 can be with
Including following operation: the label of each first text and each first text is spliced to form corresponding with each first text
One sample data.
It operates S510 and obtains multiple first texts using multiple first texts as the input of the second deep learning model
Second classification information.
According to an embodiment of the invention, operation S510 specifically can be, multiple first texts are inputted second one by one
Deep learning model, to obtain the second classification information one by one.The specific implementation process of operation S510 is similar to operation S203,
Difference is only that the second deep learning model herein is the supporting vector machine model being not optimised, Random Forest model or grows in short-term
Memory network model etc..Wherein, as the second deep learning mode input specifically can be that above-mentioned splicing obtains with it is each
The corresponding first sample data of first text.
S511 is operated, according to the concrete class of the second classification information of multiple first texts and multiple first texts, optimization
Second deep learning model,
According to an embodiment of the invention, operation S511 specifically can be, according to the first text each in multiple first texts
The concrete class of each of this second classification information characterization classification of the first text and each first text, using the second damage
It loses function and the second penalty values corresponding with each first text is calculated.Then deep according to second penalty values adjustment first
Spend the parameter in learning model, Lai Youhua the second deep learning model.Wherein, second loss function and first-loss above-mentioned
Function for example can be cross entropy loss function, sigmoid loss function or any other loss function, and this second loss
Function and first-loss function can be identical or different loss function, and this is not limited by the present invention.
In summary, the sample data for optimizing the second deep learning model is obtained by the first deep learning model discrimination
The first text for being not belonging to first category be labeled.Compared to text (or the first deep learning in text library
The input text of model), the density for belonging to the text of first category (" non-three custom ") in the first text can be effectively reduced, is increased
Belong to the density of the text of second category (" three customs ") in first text.Then belong to the text of second category in guaranteeing sample data
In the case that this amount is certain, compared to the prior art in the text library that needs text quantity, needs can be greatly reduced
The quantity of first text, and therefore can effectively reduce the mark quantity to text.To reduce sample number to a certain extent
According to the difficulty of acquisition, data mark amount is reduced, improves annotating efficiency.
Fig. 6 diagrammatically illustrates the techniqueflow chart of file classification method according to an embodiment of the present invention.
As shown in fig. 6, the file classification method of the embodiment of the present invention may include training process and detection process.The text
Two deep learning models involved in classification method, one of deep learning model in two deep learning models is with complete
It measures data (text directly obtained from server 120) to input as training sample data or detection data, using as height
Recall model.Another deep learning model in two deep learning models is to recall the text that model is recalled via height
This (i.e. high to recall " three customs " text that model obtains) inputs as training sample data or detection data, using as high precision mould
Type determines the classification results of the text recalled.
Wherein, in the training stage, first with small data quantity training sample, (" three customs " ratio is the sample of NATURAL DISTRIBUTION ratio
This) train height to recall model, to recall to full dose data, obtain training sample (" three customs " ratio of high precision model
High sample).Wherein, in order to increase the high amount of recalling for recalling model, it is (preceding the lesser threshold value of model specification can be recalled for the height
The first threshold stated).In order to guarantee that the training sample of high precision model has a certain number of samples, mould is recalled having trained height
After type, it can also continue to the input for recalling model using full dose data as height, obtain the training sample of high precision model again.?
After obtaining the training sample of enough high precision models, i.e., high precision model can be trained using the big quantity training sample, it should
The threshold value of high precision model is greater than the threshold value that height recalls model.
In detection-phase, the detection text (such as news) that each needs detect is first passed through into height first and recalls model judgement
Whether first category (" non-three custom ") is belonged to.If height recalls model and is judged to " non-three custom ", the text is directly skipped into high precision
Model, whole result are " non-three custom ", and " three popular values " is the confidence level relative to " three customs " classification that height recalls that model obtains.
If height recalls model and is judged to " three customs ", the text enters high precision model and is judged again, if high precision model is defeated
Result is " three custom " out, then the whole result of the text is " three customs ", " three popular values " for high precision model obtain relative to " three
The confidence level of custom " classification.If the output result of high precision model is " non-three custom ", the whole result of the text is " non-three
Custom ", " three popular values " are the confidence level relative to " three customs " classification that high precision model obtains.
In summary, in the file classification method of the embodiment of the present invention, since the training data of high precision model is by small
Data volume training recalls what model determination obtained, and " three customs " concentration is much higher than NATURAL DISTRIBUTION set, thus annotating efficiency
It is higher.Furthermore in the detection, since all detection texts to be detected all recall model through excessively high, the detection text
The distribution of feature distribution and training sample is closer to, so as to improve whole accuracy rate.
Exemplary means
After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 7 to the exemplary reality of the present invention
The document sorting apparatus for applying mode is illustrated.
Fig. 7 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the invention.
As shown in fig. 7, according to embodiments of the present invention, text sorter 700 may include that text to be sorted obtains mould
Block 710, first category determining module 720 and second category determining module 730.Text sorter 700 can be used to implement
File classification method according to an embodiment of the present invention.
Text to be sorted obtains module 710 for obtaining text to be sorted (operation S201).
First category determining module 720 is used to be obtained using the first deep learning model to be sorted according to text to be sorted
The first classification information (operation S202) of text.
Second category determining module 730 is used in the classification that the first classification information characterizes text to be sorted not be first category
In the case where, the classification (operation S203) of text to be sorted is determined using the second deep learning model.Wherein, optimization obtains second
The sample data of deep learning model includes the first text, and the classification of the first classification information characterization of first text is not first
Classification.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned second category determining module 730 may include the second classification
Acquisition of information submodule 731 and classification determine submodule 732.Wherein, the second classification information acquisition submodule 731 is used for using the
Two deep learning models obtain the second classification information (operation S213) of text to be sorted.Classification determines that submodule 732 is used for really
The classification of fixed text to be sorted is the classification (operation S223) of the second classification information characterization.
According to an embodiment of the invention, above-mentioned first deep learning model specification has first threshold, the second deep learning mould
Type is set with second threshold.Wherein, the first classification information includes first category information and text to be sorted relative to multiple predetermined
Multiple first confidence levels of classification, first confidence level and first of the first category information by text to be sorted relative to second category
The size relation of threshold value determines.Second classification information includes second category information and text to be sorted relative to multiple predetermined classifications
Multiple second confidence levels, second confidence level and second threshold of the second category information by text to be sorted relative to second category
Size relation determine.Wherein, first threshold is less than second threshold, first category information and second category information for characterize to
The classification of classifying text, multiple predetermined classifications include first category.
According to an embodiment of the invention, above-mentioned first classification information and/or the second classification information include text phase to be sorted
Multiple confidence levels other for multiple predetermined class.As shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that parameter value is true
Cover half block 740, for determining that the parameter value of text to be sorted is confidence level (operation of the text to be sorted relative to second category
S304).Wherein, multiple predetermined classifications include first category and second category.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that the first model is excellent
Change module 750, for performing the following operations: obtaining the concrete class (S406) of multiple second texts and multiple second texts;With
Input of multiple second texts as the first deep learning model obtains the first classification information (operation of multiple second texts
S407);According to the first classification information of multiple second texts and the concrete class of multiple second texts, optimize the first deep learning
Model (operation S408).Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
According to an embodiment of the invention, concrete class is that the text scale of first category is greater than in above-mentioned multiple second texts
Concrete class is the text scale of first category in multiple first texts.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that the second model is excellent
Change module 760, for performing the following operations: obtaining the concrete class (operation of multiple first texts and multiple first texts
S509);Using multiple first texts as the input of the second deep learning model, the second classification information of multiple first texts is obtained
(operation S510);And according to the second classification information of multiple first texts and the concrete class of multiple first texts, optimization the
Two deep learning models (operation S511).Wherein, the second deep learning model includes: supporting vector machine model, random forest mould
Type, long memory network model in short-term.
According to an embodiment of the invention, obtaining multiple first texts includes: to obtain multiple third texts (operation S519);With
Input of multiple third texts as the first deep learning model obtains the first classification information (operation of multiple third texts
S529);It is the first text (operation S539) that the classification for determining the first classification information characterization, which is not the third text of first category,.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned file classification method can also include preprocessing module 770,
For being treated using pretreated model before first category determining module 720 obtains the first classification information of text to be sorted
Classifying text is pre-processed, and is extracted and is obtained the characteristic information (operation S305) of text to be sorted.Wherein, first category determines mould
Block 720 is specifically used for using the characteristic information of text to be sorted as the input of the first deep learning model, and output obtains to be sorted
First classification information of text;Pretreated model includes word frequency-inverse file frequency model or term vector model.
Exemplary media
After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 8 to the exemplary reality of the present invention
The computer readable storage medium for being adapted for carrying out file classification method for applying mode is introduced.
According to an embodiment of the invention, additionally providing a kind of computer readable storage medium, it is stored thereon with executable finger
It enables, described instruction makes processor execute file classification method according to an embodiment of the present invention when being executed by processor.
In some possible embodiments, various aspects of the invention are also implemented as a kind of shape of program product
Formula comprising program code, when described program product is run on the computing device, said program code is for making the calculating
Equipment executes described in above-mentioned " illustrative methods " part of this specification the use of various illustrative embodiments according to the present invention
In executing the step in file classification method, for example, the calculating equipment can execute step S201 as shown in Figure 2 A: obtaining
Take text to be sorted;Step S202: according to the text to be sorted, the text to be sorted is obtained using the first deep learning model
This first classification information;Step S203: the classification that the text to be sorted is characterized in first classification information is not first
In the case where classification, the classification of the text to be sorted is determined using the second deep learning model.
Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter
Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red
The system of outside line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing
(non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory
(RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc
Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
As shown in figure 8, the program product 800 suitable for file classification method of embodiment according to the present invention is described,
It can be using portable compact disc read only memory (CD-ROM) and including program code, and can calculate equipment, such as
It is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing, which can be, appoints
What include or the tangible medium of storage program that the program can be commanded execution system, device or device use or and its
It is used in combination.
Readable signal medium may include in a base band or as the data-signal that carrier wave a part is propagated, wherein carrying
Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter
Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be other than readable storage medium storing program for executing it is any can
Read medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or
Program in connection.
The program code for including on readable medium can transmit with any suitable medium, including --- but being not limited to ---
Wirelessly, wired, optical cable, RF etc. or above-mentioned any appropriate combination.
The program for executing operation of the present invention can be write with any combination of one or more programming languages
Code, described program design language include object oriented program language one-such as Java, and C++ etc. further includes routine
Procedural programming language --- such as " C " language or similar programming language.Program code can fully exist
It is executed in user calculating equipment, part executes on a remote computing or completely remote on the user computing device for part
Journey calculates to be executed on equipment or server.In the situation for being related to remote computing device, remote computing device can be by any
The network of type --- it is connected to user calculating equipment including local area network (LAN) or wide area network (WAN) one, alternatively, can connect
To external computing device (such as being connected using ISP by internet).
Exemplary computer device
After method, medium and the device for describing exemplary embodiment of the invention, next, with reference to Fig. 9 to this
The calculating equipment for being adapted for carrying out file classification method of invention illustrative embodiments is illustrated.
The embodiment of the invention also provides a kind of calculating equipment.Person of ordinary skill in the field is it is understood that this hair
Bright various aspects can be implemented as system, method or program product.Therefore, various aspects of the invention can be implemented as
Following form, it may be assumed that complete hardware embodiment, complete Software Implementation (including firmware, microcode etc.) or hardware and
The embodiment that software aspects combine, may be collectively referred to as circuit, " module " or " system " here.
In some possible embodiments, it is single can to include at least at least one processing for calculating equipment according to the present invention
Member and at least one storage unit.Wherein, the storage unit is stored with program code, when said program code is described
When processing unit executes, so that the processing unit executes described in above-mentioned " illustrative methods " part of this specification according to this
Invent the step in the file classification method of various illustrative embodiments.For example, the processing unit can be executed such as Fig. 2A
Shown in step S201: obtain text to be sorted;Step S202: according to the text to be sorted, using the first deep learning
Model obtains the first classification information of the text to be sorted;Step S203: described wait divide in first classification information characterization
In the case that the classification of class text is not first category, the class of the text to be sorted is determined using the second deep learning model
Not.
The calculating for being adapted for carrying out file classification method of this embodiment according to the present invention is described referring to Fig. 9
Equipment 900.Calculating equipment 900 as shown in Figure 9 is only an example, function to the embodiment of the present invention and should not use model
Shroud carrys out any restrictions.
It is showed in the form of universal computing device as shown in figure 9, calculating equipment 900.The component for calculating equipment 900 can wrap
It includes but is not limited to: at least one above-mentioned processing unit 901, at least one above-mentioned storage unit 902, the different system components of connection
The bus 903 of (including storage unit 902 and processing unit 901).
Bus 903 may include data/address bus, address bus and control bus.
Storage unit 902 may include volatile memory, such as random access memory (RAM) 9021 and/or high speed
Buffer memory 9022 can further include read-only memory (ROM) 923.
Storage unit 902 can also include program/utility with one group of (at least one) program module 9024
9025, such program module 9024 includes but is not limited to: operating system, one or more application program, other program moulds
It may include the realization of network environment in block and program data, each of these examples or certain combination.
Calculating equipment 900 can also be with one or more external equipments 904 (such as keyboard, sensing equipment, bluetooth equipment
Deng) communicate, this communication can be carried out by input/output (I/0) interface 905.Also, calculating equipment 900 can also pass through
Network adapter 906 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as
Internet) communication.As shown, network adapter 906 is communicated by bus 903 with the other modules for calculating equipment 900.It should
Understand, although not shown in the drawings, other hardware and/or software module can be used in conjunction with equipment 900 is calculated, including but unlimited
In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number
According to backup storage system etc..
It should be noted that although being referred to several units/modules or subelement/submodule of device in the above detailed description
Block, but it is this division be only exemplary it is not enforceable.In fact, embodiment according to the present invention, is retouched above
The feature and function for two or more units/modules stated can embody in a units/modules.Conversely, above description
A units/modules feature and function can with further division be embodied by multiple units/modules.
In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or
Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired
As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one
Step is decomposed into execution of multiple steps.
Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this
It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects
Combination is benefited to carry out, this to divide the convenience merely to statement.The present invention is directed to cover appended claims spirit and
Included various modifications and equivalent arrangements in range.
Claims (12)
1. a kind of file classification method, comprising:
Obtain text to be sorted;
According to the text to be sorted, the first classification information of the text to be sorted is obtained using the first deep learning model;
It is deep using second in the case where the classification that first classification information characterizes the text to be sorted is not first category
Degree learning model determines the classification of the text to be sorted,
Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the of first text
The classification of one classification information characterization is not the first category.
2. according to the method described in claim 1, wherein, the class of the text to be sorted is determined using the second deep learning model
Do not include:
The second classification information of the text to be sorted is obtained using the second deep learning model: and determine the text to be sorted
This classification is the classification of second classification information characterization.
3. according to the method described in claim 2, wherein, the first deep learning model specification has a first threshold, described
Two deep learning model specifications have second threshold, in which:
First classification information includes that first category information and the text to be sorted are other multiple relative to multiple predetermined class
First confidence level, the first category information is by the text to be sorted relative to the first confidence level of second category and described the
The size relation of one threshold value determines;
Second classification information includes that second category information and the text to be sorted are other multiple relative to multiple predetermined class
Second confidence level, second confidence level and institute of the second category information by the text to be sorted relative to the second category
The size relation for stating second threshold is determining,
Wherein, the first threshold is less than the second threshold, and the first category information and the second category information are used for
The classification of the text to be sorted is characterized, the multiple predetermined classification includes the first category and the second category.
4. according to the method described in claim 2, wherein, first classification information and/or second classification information include
The text to be sorted is relative to the other multiple confidence levels of multiple predetermined class, after the classification for determining the text to be sorted, institute
State method further include:
The parameter value for determining the text to be sorted is confidence level of the text to be sorted relative to second category,
Wherein, the multiple predetermined classification includes the first category and the second category.
5. according to the method described in claim 1, further include:
Obtain the concrete class of multiple second texts and the multiple second text;
Using the multiple second text as the input of the first deep learning model, the of the multiple second text is obtained
One classification information;
According to the first classification information of the multiple second text and the concrete class of the multiple second text, optimize described the
One deep learning model,
Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
6. according to the method described in claim 5, wherein, concrete class is the first category in the multiple second text
Text scale is greater than the text scale that concrete class in multiple first texts is the first category.
7. according to the method described in claim 1, further include:
Obtain the concrete class of multiple first texts and the multiple first text;
Using the multiple first text as the input of the second deep learning model, second point of the multiple first text is obtained
Category information;And
According to the concrete class of the second classification information of the multiple first text and the multiple first text, optimize described the
Two deep learning models,
Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, long memory network in short-term
Model.
8. according to the method described in claim 7, wherein, multiple first texts of acquisition include:
Obtain multiple third texts;
Using the multiple third text as the input of the first deep learning model, the of the multiple third text is obtained
One classification information;
Determine the first classification information characterization classification be not the first category third text be first text.
9. according to the method described in claim 1, wherein, before the first classification information for obtaining the text to be sorted, institute
State method further include:
The text to be sorted is pre-processed using pretreated model, extracts and obtains the feature letter of the text to be sorted
Breath,
Wherein, using the characteristic information of the text to be sorted as the input of the first deep learning model, output obtains institute
State the first classification information of text to be sorted;The pretreated model includes word frequency-inverse file frequency model or term vector model.
10. a kind of document sorting apparatus, comprising:
Text to be sorted obtains module, for obtaining text to be sorted;
First category determining module, for being obtained using the first deep learning model described wait divide according to the text to be sorted
First classification information of class text;And
Second category determining module, the classification for characterizing the text to be sorted in first classification information is not the first kind
In other situation, the classification of the text to be sorted is determined using the second deep learning model,
Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the of first text
The classification of one classification information characterization is not the first category.
11. a kind of computer readable storage medium, is stored thereon with executable instruction, described instruction is real when being executed by processor
Existing method according to claim 1 to 9.
12. a kind of calculating equipment, comprising:
One or more memories, are stored with executable instruction;And
One or more processors execute the executable instruction, described in realization according to claim 1~any one of 9
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910480256.XA CN110245232B (en) | 2019-06-03 | 2019-06-03 | Text classification method, device, medium and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910480256.XA CN110245232B (en) | 2019-06-03 | 2019-06-03 | Text classification method, device, medium and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110245232A true CN110245232A (en) | 2019-09-17 |
CN110245232B CN110245232B (en) | 2022-02-18 |
Family
ID=67886011
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910480256.XA Active CN110245232B (en) | 2019-06-03 | 2019-06-03 | Text classification method, device, medium and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110245232B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111243607A (en) * | 2020-03-26 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speaker information |
CN111930939A (en) * | 2020-07-08 | 2020-11-13 | 泰康保险集团股份有限公司 | Text detection method and device |
CN112711940A (en) * | 2019-10-08 | 2021-04-27 | 台达电子工业股份有限公司 | Information processing system, information processing method, and non-transitory computer-readable recording medium |
CN113536806A (en) * | 2021-07-18 | 2021-10-22 | 北京奇艺世纪科技有限公司 | Text classification method and device |
CN114065759A (en) * | 2021-11-19 | 2022-02-18 | 深圳视界信息技术有限公司 | Model failure detection method and device, electronic equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156885A (en) * | 2010-02-12 | 2011-08-17 | 中国科学院自动化研究所 | Image classification method based on cascaded codebook generation |
CN102521656A (en) * | 2011-12-29 | 2012-06-27 | 北京工商大学 | Integrated transfer learning method for classification of unbalance samples |
CN103593470A (en) * | 2013-11-29 | 2014-02-19 | 河南大学 | Double-degree integrated unbalanced data stream classification algorithm |
CN103824092A (en) * | 2014-03-04 | 2014-05-28 | 国家电网公司 | Image classification method for monitoring state of electric transmission and transformation equipment on line |
US20170032276A1 (en) * | 2015-07-29 | 2017-02-02 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN106453033A (en) * | 2016-08-31 | 2017-02-22 | 电子科技大学 | Multilevel Email classification method based on Email content |
CN107644057A (en) * | 2017-08-09 | 2018-01-30 | 天津大学 | A kind of absolute uneven file classification method based on transfer learning |
US20180357299A1 (en) * | 2017-06-07 | 2018-12-13 | Accenture Global Solutions Limited | Identification and management system for log entries |
-
2019
- 2019-06-03 CN CN201910480256.XA patent/CN110245232B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156885A (en) * | 2010-02-12 | 2011-08-17 | 中国科学院自动化研究所 | Image classification method based on cascaded codebook generation |
CN102521656A (en) * | 2011-12-29 | 2012-06-27 | 北京工商大学 | Integrated transfer learning method for classification of unbalance samples |
CN103593470A (en) * | 2013-11-29 | 2014-02-19 | 河南大学 | Double-degree integrated unbalanced data stream classification algorithm |
CN103824092A (en) * | 2014-03-04 | 2014-05-28 | 国家电网公司 | Image classification method for monitoring state of electric transmission and transformation equipment on line |
US20170032276A1 (en) * | 2015-07-29 | 2017-02-02 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN106453033A (en) * | 2016-08-31 | 2017-02-22 | 电子科技大学 | Multilevel Email classification method based on Email content |
US20180357299A1 (en) * | 2017-06-07 | 2018-12-13 | Accenture Global Solutions Limited | Identification and management system for log entries |
CN107644057A (en) * | 2017-08-09 | 2018-01-30 | 天津大学 | A kind of absolute uneven file classification method based on transfer learning |
Non-Patent Citations (1)
Title |
---|
刘胥影 等: ""一种基于级联模型的类别不平衡数据分类方法"", 《南京大学学报(自然科学版)》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112711940A (en) * | 2019-10-08 | 2021-04-27 | 台达电子工业股份有限公司 | Information processing system, information processing method, and non-transitory computer-readable recording medium |
CN111243607A (en) * | 2020-03-26 | 2020-06-05 | 北京字节跳动网络技术有限公司 | Method, apparatus, electronic device, and medium for generating speaker information |
CN111930939A (en) * | 2020-07-08 | 2020-11-13 | 泰康保险集团股份有限公司 | Text detection method and device |
CN113536806A (en) * | 2021-07-18 | 2021-10-22 | 北京奇艺世纪科技有限公司 | Text classification method and device |
CN113536806B (en) * | 2021-07-18 | 2023-09-08 | 北京奇艺世纪科技有限公司 | Text classification method and device |
CN114065759A (en) * | 2021-11-19 | 2022-02-18 | 深圳视界信息技术有限公司 | Model failure detection method and device, electronic equipment and medium |
CN114065759B (en) * | 2021-11-19 | 2023-10-13 | 深圳数阔信息技术有限公司 | Model failure detection method and device, electronic equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN110245232B (en) | 2022-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112632385B (en) | Course recommendation method, course recommendation device, computer equipment and medium | |
CN110245232A (en) | File classification method, device, medium and calculating equipment | |
CN109815487B (en) | Text quality inspection method, electronic device, computer equipment and storage medium | |
CN110309514A (en) | A kind of method for recognizing semantics and device | |
CN111143569B (en) | Data processing method, device and computer readable storage medium | |
CN109840287A (en) | A kind of cross-module state information retrieval method neural network based and device | |
CN108304468A (en) | A kind of file classification method and document sorting apparatus | |
CN110427463A (en) | Search statement response method, device and server and storage medium | |
CN111143226B (en) | Automatic test method and device, computer readable storage medium and electronic equipment | |
Ling et al. | Integrating extra knowledge into word embedding models for biomedical NLP tasks | |
WO2021139279A1 (en) | Data processing method and apparatus based on classification model, and electronic device and medium | |
CN110334186A (en) | Data query method, apparatus, computer equipment and computer readable storage medium | |
CN115328756A (en) | Test case generation method, device and equipment | |
CN111694937A (en) | Interviewing method and device based on artificial intelligence, computer equipment and storage medium | |
CN110162766A (en) | Term vector update method and device | |
CN109710760A (en) | Clustering method, device, medium and the electronic equipment of short text | |
CN107943940A (en) | Data processing method, medium, system and electronic equipment | |
CN110705255A (en) | Method and device for detecting association relation between sentences | |
CN110232128A (en) | Topic file classification method and device | |
CN111539612B (en) | Training method and system of risk classification model | |
CN116049376A (en) | Method, device and system for retrieving and replying information and creating knowledge | |
US20230014904A1 (en) | Searchable data structure for electronic documents | |
CN116167382A (en) | Intention event extraction method and device, electronic equipment and storage medium | |
CN108733702B (en) | Method, device, electronic equipment and medium for extracting upper and lower relation of user query | |
WO2022216462A1 (en) | Text to question-answer model system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |