CN110245232A - File classification method, device, medium and calculating equipment - Google Patents

File classification method, device, medium and calculating equipment Download PDF

Info

Publication number
CN110245232A
CN110245232A CN201910480256.XA CN201910480256A CN110245232A CN 110245232 A CN110245232 A CN 110245232A CN 201910480256 A CN201910480256 A CN 201910480256A CN 110245232 A CN110245232 A CN 110245232A
Authority
CN
China
Prior art keywords
text
classification
sorted
category
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910480256.XA
Other languages
Chinese (zh)
Other versions
CN110245232B (en
Inventor
赵振宇
丁长林
张华�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Media Technology Beijing Co Ltd
Original Assignee
Netease Media Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Media Technology Beijing Co Ltd filed Critical Netease Media Technology Beijing Co Ltd
Priority to CN201910480256.XA priority Critical patent/CN110245232B/en
Publication of CN110245232A publication Critical patent/CN110245232A/en
Application granted granted Critical
Publication of CN110245232B publication Critical patent/CN110245232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

Embodiments of the present invention provide a kind of file classification method.This method comprises: obtaining text to be sorted;According to text to be sorted, the first classification information of text to be sorted is obtained using the first deep learning model;In the case where the classification that the first classification information characterizes text to be sorted is not first category, the classification of text to be sorted is determined using the second deep learning model.Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the classification of the first classification information characterization of first text is not first category.Method of the invention determines text classification using two deep learning models, wherein the sample data of the second deep learning model is the text recalled via the first deep learning model.Therefore the concentration of effective sample in sample data can be improved, to reduce sample mark, and can be improved the accuracy rate of classification prediction.In addition, embodiments of the present invention additionally provide a kind of document sorting apparatus, medium and calculate equipment.

Description

File classification method, device, medium and calculating equipment
Technical field
Embodiments of the present invention are related to text-processing field, more specifically, embodiments of the present invention are related to a kind of text This classification method, apparatus, medium and calculating equipment.
Background technique
Background that this section is intended to provide an explanation of the embodiments of the present invention set forth in the claims or context.Herein Description recognizes it is the prior art not because not being included in this section.
In internet area, generally require to pick out text against regulation from a large amount of texts, to avoid displaying The case where negative guidance is played to user caused by the text against regulation generation.Wherein, text, which is selected, can use text The method of this classification carries out.
In general, the most file classification method in natural language processing field may be applicable to text against regulation The detection of (such as " three customs " news).For example, the detection of the text against regulation of early stage or the manual examination and verification based on editor, But with internet and greatly developing from media, news magnitude degree increases so that efficiency is lower and the manual examination and verification of high expensive It is unable to satisfy demand, then artificial and machine combines the mode of audit to become mainstream.In machine audit, text against regulation The detection method method that is mainly based upon dictionary, the antistop list of vocabulary against regulation is contained by constructing, in text Hold and carries out canonical matching.In recent years, the outstanding behaviours with machine learning and deep learning in text classification field, some depth Learning model is also applied in text classification.
Wherein, the mode based on dictionary carries out text audit often because vocabulary is not abundant enough, cannot capture text semantic Relationship, and cause recall rate lower.And the method for machine learning and deep learning is all often to solve text classification end to end Problem, but under newsletter archive scene, since the total amount of the total text of text Zhan against regulation is lower, training data is often difficult to It obtains.Simultaneously in use, being difficult to protect since text against regulation and the distribution for meeting regulation text are extremely unbalanced The recall rate of the accuracy and text against regulation classified on card line.
Summary of the invention
Therefore in the prior art, the text under the Internet scenes such as news is carried out using existing file classification method When classification, exist because different type text be unevenly distributed weighing apparatus caused by accuracy it is low, the problem of recall rate is difficult to ensure.Furthermore Due to needing largely there is the sample of mark to carry out training pattern, the defect obtained is difficult to there is also training data.
Thus, it is also very desirable to a kind of improved file classification method, it can be right under the premise of being not necessarily to a large amount of training datas The unbalanced text of category distribution carries out Accurate classification.
In the present context, embodiments of the present invention are desirable to first improve the concentration of effective sample in sample text, Train classification models are come as training data to the sample text with high concentration effective sample again, with total in reduction training data Under the premise of amount, the accuracy of disaggregated model classification is improved.
In the first aspect of embodiment of the present invention, a kind of file classification method is provided, comprising: obtain text to be sorted This;According to text to be sorted, the first classification information of text to be sorted is obtained using the first deep learning model;In the first classification In the case that the classification of information representation text to be sorted is not first category, text to be sorted is determined using the second deep learning model This classification.Wherein, optimization obtain the second deep learning model sample data include the first text, the first of first text The classification of classification information characterization is not first category.
In one embodiment of the invention, the above-mentioned classification packet that text to be sorted is determined using the second deep learning model It includes: obtaining the second classification information of text to be sorted using the second deep learning model: and determine the classification of text to be sorted For the classification of the second classification information characterization.
In another embodiment of the present invention, above-mentioned first deep learning model specification has first threshold, the second depth Learning model is set with second threshold.Wherein, the first classification information includes first category information and text to be sorted relative to more A other multiple first confidence levels of predetermined class, first confidence level of the first category information by text to be sorted relative to second category It is determined with the size relation of first threshold;Second classification information includes second category information and text to be sorted relative to multiple pre- Determine multiple second confidence levels of classification, second confidence level and the of the second category information by text to be sorted relative to second category The size relation of two threshold values determines.Wherein, first threshold is less than second threshold, and first category information and second category information are used for The classification of text to be sorted is characterized, multiple predetermined classifications include first category and second category.
In yet another embodiment of the present invention, above-mentioned first classification information and/or the second classification information include to be sorted Text is relative to the other multiple confidence levels of multiple predetermined class, after the classification for determining text to be sorted, above-mentioned file classification method Further include: the parameter value for determining text to be sorted is confidence level of the text to be sorted relative to second category.Wherein, multiple predetermined Classification includes first category and second category.
In yet another embodiment of the present invention, above-mentioned file classification method further include: obtain multiple second texts and more The concrete class of a second text;Using multiple second texts as the input of the first deep learning model, multiple second texts are obtained This first classification information;According to the first classification information of multiple second texts and the concrete class of multiple second texts, optimization First deep learning model.Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
In yet another embodiment of the present invention, concrete class is the text ratio of first category in above-mentioned multiple second texts Example is greater than the text scale that concrete class in multiple first texts is first category.
In yet another embodiment of the present invention, above-mentioned file classification method further include: obtain multiple first texts and more The concrete class of a first text;Using multiple first texts as the input of the second deep learning model, multiple first texts are obtained This second classification information;And according to the second classification information of multiple first texts and the concrete class of multiple first texts, Optimize the second deep learning model.Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, length Short-term memory network model.
In yet another embodiment of the present invention, above-mentioned multiple first texts of acquisition include: to obtain multiple third texts;With Input of multiple third texts as the first deep learning model, obtains the first classification information of multiple third texts;Determine The classification of one classification information characterization is not that the third text of first category is the first text.
In yet another embodiment of the present invention, before the first classification information for obtaining text to be sorted, above-mentioned text Classification method further include: classifying text is treated using pretreated model and is pre-processed, extraction obtains the feature of text to be sorted Information.Wherein, using the characteristic information of text to be sorted as the input of the first deep learning model, output obtains text to be sorted The first classification information;Pretreated model includes word frequency-inverse file frequency model or term vector model.
In the second aspect of embodiment of the present invention, a kind of document sorting apparatus is provided, comprising: text to be sorted obtains Modulus block, for obtaining text to be sorted;First category determining module is used for according to the text to be sorted, deep using first Degree learning model obtains the first classification information of the text to be sorted;And second category determining module, for described the It is true using the second deep learning model in the case that the classification that one classification information characterizes the text to be sorted is not first category The classification of the fixed text to be sorted.Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, This, the classification of the first classification information characterization of first text is not the first category.
In one embodiment of the invention, above-mentioned second category determining module includes: that the second classification information obtains submodule Block, for obtaining the second classification information of the text to be sorted using the second deep learning model;And classification determines submodule Block, for determining that the classification of the text to be sorted is the classification of second classification information characterization.
In another embodiment of the present invention, above-mentioned first deep learning model specification has a first threshold, and described second Deep learning model specification has second threshold, wherein first classification information includes first category information and described to be sorted Text relative to other multiple first confidence levels of multiple predetermined class, the first category information by the text to be sorted relative to First confidence level of second category and the size relation of the first threshold determine;Second classification information includes second category Information and the text to be sorted are relative to other multiple second confidence levels of multiple predetermined class, and the second category information is by described Text to be sorted is determined relative to the second confidence level of the second category and the size relation of the second threshold.Wherein, institute It is described wait divide for characterizing less than the second threshold, the first category information and the second category information to state first threshold The classification of class text, the multiple predetermined classification include the first category and second category.
In yet another embodiment of the present invention, first classification information and/or second classification information include institute Text to be sorted is stated relative to the other multiple confidence levels of multiple predetermined class.Above-mentioned document sorting apparatus further includes that parameter value determines mould Block, for determining that the parameter value of the text to be sorted is confidence level of the text to be sorted relative to second category.Wherein, The multiple predetermined classification includes the first category and the second category.
In yet another embodiment of the present invention, above-mentioned document sorting apparatus further include: the first model optimization module is used for It executes following operation: obtaining the concrete class of multiple second texts and the multiple second text;With the multiple second text As the input of the first deep learning model, the first classification information of the multiple second text is obtained;According to described more The concrete class of first classification information of a second text and the multiple second text optimizes the first deep learning mould Type.Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
In yet another embodiment of the present invention, concrete class is the text of the first category in above-mentioned multiple second texts This ratio is greater than the text scale that concrete class in multiple first texts is the first category.
In yet another embodiment of the present invention, above-mentioned document sorting apparatus further includes the second model optimization module, is used for It executes following operation: obtaining the concrete class of multiple first texts and the multiple first text;With the multiple first text As the input of the second deep learning model, the second classification information of the multiple first text is obtained;And according to described more The concrete class of second classification information of a first text and the multiple first text optimizes the second deep learning mould Type.Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, long memory network mould in short-term Type.
In yet another embodiment of the present invention, obtaining multiple first texts includes: to obtain multiple third texts;With described Input of multiple third texts as the first deep learning model obtains the first classification letter of the multiple third text Breath;Determine the first classification information characterization classification be not the first category third text be first text.
In yet another embodiment of the present invention, above-mentioned file classification method further includes preprocessing module, for described Before first category determining module obtains the first classification information of the text to be sorted, using pretreated model to described wait divide Class text is pre-processed, and extraction obtains the characteristic information of the text to be sorted.Wherein, first category determining module is specifically used In using the characteristic information of the text to be sorted as the input of the first deep learning model, output obtains described to be sorted First classification information of text;The pretreated model includes word frequency-inverse file frequency model or term vector model.
In the third aspect of embodiment of the present invention, a kind of computer readable storage medium is provided, is stored thereon with Executable instruction, the first aspect which makes processor execute embodiment according to the present invention when being executed by processor are mentioned The file classification method of confession.
In the fourth aspect of embodiment of the present invention, a kind of calculating equipment is provided.The calculating equipment includes being stored with One or more storage units of executable instruction, and one or more processing units.It is executable that the processing unit executes this Instruction, to realize file classification method provided by the first aspect of embodiment according to the present invention.
The file classification method, device, medium of embodiment and calculating equipment according to the present invention, using two deep learnings Model determines text classification, wherein the sample data of the second deep learning model is recalled via the first deep learning model The text arrived, then the concentration of effective sample is high in the sample data of the second deep learning model, so as to improve the second depth The precision of learning model, and therefore improve the accuracy rate to text classification result.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:
Fig. 1 diagrammatically illustrates the file classification method, device, medium and computer equipment of embodiment according to the present invention Application scenarios;
Fig. 2A diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present invention;
Fig. 2 B diagrammatically illustrates second deep learning model of use according to an embodiment of the present invention and determines text to be sorted Classification flow chart;
Fig. 3 diagrammatically illustrates the flow chart of file classification method according to another embodiment of the present invention;
Fig. 4, which is diagrammatically illustrated, optimizes the first deep learning model in file classification method according to an embodiment of the present invention Flow chart;
Fig. 5 A, which is diagrammatically illustrated, optimizes the second deep learning model in file classification method according to an embodiment of the present invention Flow chart;
Fig. 5 B diagrammatically illustrates the flow chart according to an embodiment of the present invention for obtaining the first text;
Fig. 6 diagrammatically illustrates the techniqueflow chart of file classification method according to an embodiment of the present invention;
Fig. 7 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the present invention;
Fig. 8 diagrammatically illustrates the signal of the program product according to an embodiment of the present invention for being adapted for carrying out file classification method Figure;And
Fig. 9 diagrammatically illustrates the frame of the calculating equipment according to an embodiment of the present invention for being adapted for carrying out file classification method Figure.
In the accompanying drawings, identical or corresponding label indicates identical or corresponding part.
Specific embodiment
The principle and spirit of the invention are described below with reference to several illustrative embodiments.It should be appreciated that providing this A little embodiments are used for the purpose of making those skilled in the art can better understand that realizing the present invention in turn, and be not with any Mode limits the scope of the invention.On the contrary, thesing embodiments are provided so that the present invention is more thorough and complete, and energy It enough will fully convey the scope of the invention to those skilled in the art.
One skilled in the art will appreciate that embodiments of the present invention can be implemented as a kind of system, device, equipment, method Or computer program product.Therefore, the present invention can be with specific implementation is as follows, it may be assumed that complete hardware, complete software The form that (including firmware, resident software, microcode etc.) or hardware and software combine.
Embodiment according to the present invention proposes a kind of file classification method, device, medium and calculates equipment.
Herein, it is to be understood that related term is explained as follows:
Machine learning, be using probability theory, statistics, Approximation Theory, convextiry analysis scheduling theory, research computer how to simulate or It realizes the learning behavior of the mankind, obtains knowledge and skills, reorganize the multi-field cross discipline of the existing structure of knowledge.
Deep learning is a branch of machine learning, by establishing the brain of simulation people, constructs neural network to explain And learning data.
Natural language processing is an important branch of machine learning, and main research is realized between people and computer with certainly Right language carries out the theory and method of effective communication, is a fusional language, computer science, mathematics in the subject of one.
Text classification is a branch of natural language processing field, and main research computer is by text set according to certain Classification system or standard classified automatically.
LR, logistics regressive (logistic regression) are a kind of linear regression analysis models, are usually used in data digging The fields such as pick, economic forecasting.
SVM, Support Vector Machine (support vector machines) are that one kind carries out data by supervised learning mode The linear classifier of binary classification, decision boundary are the maximum back gauge hyperplane solved to learning sample.
LSTM, Long Short-Term Memory (long memory network in short-term), are a kind of time recurrent neural networks, fit Together in processing and predicted time sequence in be spaced and postpone relatively long critical event.
Tf-idf, term frequency-inverse document frequency (word frequency-inverse file frequency) is A kind of statistical method, to assess a words for the important journey of a copy of it file in a file set or a corpus Degree.The importance of words can go out in corpus with the directly proportional increase of number that it occurs hereof, but simultaneously with it Existing frequency is inversely proportional decline.
Word2vec indicates a kind of term vector model of word, has preferable semantic expressiveness.
Accuracy rate, a kind of machine learning algorithm Performance Evaluating Indexes judge sample to be tested such as a classification problem The accuracy rate of the classifier is obtained divided by sample to be tested sum for the sum of correct classification.
Recall rate, a kind of machine learning algorithm Performance Evaluating Indexes, such as a classification problem, classifier will be to test sample Originally it is judged as that the correct number of some classification divided by the substantial amt of the category is that the classifier recalls the identification of the category Rate.
Below with reference to several representative embodiments of the invention, the principle and spirit of the present invention are explained in detail.
Summary of the invention
In the prior art, why not text unbalanced for category distribution, classified using machine learning model People's will to the greatest extent, is that a large amount of sample is generally required due to the training of machine learning model to guarantee to have a certain amount of effective sample (such as " three customs " text in newsletter archive).Machine learning model is often due to the presence of other invalid samples leads to study Feature is not accurate, so that the recall rate of machine learning model classification is lower.The inventors discovered that if passing through a deep learning mould Type first screens the sample for being unevenly distributed weighing apparatus, obtains the higher sample set of effective sample concentration, then use the sample set pair The deep learning model for carrying out text classification is trained, and can improve the deep learning for carrying out text classification to a certain extent The precision of model.Correspondingly, in actual classification, equally text to be sorted is classified by two deep learning models, Feature distribution in the feature distribution and training sample of the text to be sorted can be made to be closer to, and therefore can be further Improve classification accuracy.
After introduced the basic principles of the present invention, lower mask body introduces various non-limiting embodiment party of the invention Formula.
Application scenarios overview
Referring initially to Fig. 1.
Fig. 1 diagrammatically illustrates the file classification method, device, medium and computer equipment of embodiment according to the present invention Application scenarios.It should be noted that being only the example that can apply the application scenarios of the embodiment of the present invention shown in Fig. 1, with side The technology contents those skilled in the art understand that of the invention are helped, but are not meant to that the embodiment of the present invention may not be usable for other and set Standby, system, environment or scene.
As shown in Figure 1, the application scenarios 100 include terminal device 111,112,113, server 120 and network 130. Network 130 can wrap for providing the medium of communication link, network between terminal device 111,112,113 and server 120 Include various connection types, such as wired, wireless communication link or fiber optic cables etc..
Wherein terminal device 111,112,113 is classified for example with processing function with treating classifying text, obtain to The classification of classifying text and the confidence level for belonging to preset classification.According to an embodiment of the invention, the terminal device 111,112,113 Including but not limited to desktop computer, pocket computer on knee, tablet computer, smart phone, intelligent wearable device or intelligence Energy household electrical appliances etc..
According to an embodiment of the invention, being integrated with the deep learning that pre-training obtains in the terminal device 111,112,113 Model is classified with treating classifying text by deep learning model.Wherein, deep learning model can be the terminal device 111, it 112,113 obtains by using a large amount of training samples stored in server 120, or can be the instruction of server 120 It gets.
Wherein, text 121 to be sorted for example can be newsletter archive, it is described treat classifying text 121 carry out classification can be with It is to divide text 121 to be sorted for " three customs " text or " non-three custom " text.Text 121 to be sorted can store in server In 120, or it is stored in terminal device 111,112,113 locally.Wherein, in order to guarantee to have a certain amount of training sample, Training sample specifically can store in server 120.
Wherein, terminal device 111,112,113 for example can have display screen, for showing text to be sorted to user Classification results and text to be sorted " three popular value " (confidence level for for example, belonging to " three is popular " text), in order to user couple The newsletter archive for belonging to " three customs " text is handled.
Wherein, server 120 can be to provide the server of various services, such as mention to terminal device 111,112,113 The deep learning model (merely illustrative) that pre-training obtains is provided for text to be sorted or training sample, or to terminal device.Or Person, the server 120 can also for example have processing function, for using trained deep learning model it is stored to Classifying text 121 is classified.
It should be noted that file classification method provided by the embodiment of the present invention generally can by terminal device 111, 112,113 or server 120 execute.Correspondingly, document sorting apparatus provided by the embodiment of the present invention generally can be set in In terminal device 111,112,113 or server 120.File classification method provided by the embodiment of the present invention can also be by difference In server 120 and the server or server cluster that can be communicated with terminal device 111,112,113 and/or server 120 It executes.Correspondingly, document sorting apparatus provided by the embodiment of the present invention also can be set in being different from server 120 and can In the server or server cluster communicated with terminal device 111,112,113 and/or server 120.
It should be understood that the terminal device, network, server, the number of text and type in Fig. 1 are only schematical. According to needs are realized, terminal device, network, server and the text of arbitrary number and type can have.
Illustrative methods
Below with reference to the application scenarios of Fig. 1, the text of illustrative embodiments according to the present invention is described with reference to Fig. 2A~Fig. 6 This classification method.It should be noted which is shown only for the purpose of facilitating an understanding of the spirit and principles of the present invention for above-mentioned application scenarios, Embodiments of the present invention are not limited in this respect.On the contrary, embodiments of the present invention can be applied to applicable appoint What scene.
Fig. 2A diagrammatically illustrates the flow chart of file classification method according to an embodiment of the present invention, and Fig. 2 B is schematically shown The flow chart of the classification according to an embodiment of the present invention that text to be sorted is determined using the second deep learning model.
As shown in Figure 2 A, the file classification method of the embodiment of the present invention includes operation S201~operation S203.This article one's duty Class method for example can by Fig. 1 terminal device 111,112,113 or server 120 execute.
In operation S201, text to be sorted is obtained.
The text to be sorted can be the text to be sorted 121 being stored in server 120 in Fig. 1.Wherein, this is to be sorted Text can be for example the newsletter archive in News Field, determine news text with file classification method through the embodiment of the present invention Whether this belongs to " three customs " text.
First point of text to be sorted is obtained using the first deep learning model according to text to be sorted in operation S202 Category information.
According to an embodiment of the invention, operation S202 specifically can be directly using text to be sorted as the first depth The input of model is practised, output obtains the first classification information.Wherein, which for example can be logistic regression mould Type, long memory network model in short-term or any deep learning model that can be used in solving classification problem.Specifically, for only Including phrase, the text scene not strong to semantic requirements, which can use Logic Regression Models.It is right It, then can be using long memory network model in short-term, to improve in including long sentence, understand text semantic more demanding scene The accuracy of first classification information.The first deep learning model for example can be excellent in advance by the optimization method of Fig. 4 description What change obtained, this will not be detailed here.
It is obtained according to an embodiment of the invention, first classification information can for example be characterized by the first deep learning model Text to be sorted belonging to classification.Specifically, which includes characterizing the first kind of the classification of text to be sorted Other information.For example, for newsletter archive, first classification information can for example characterize text to be sorted belong to " three customs " text or Person belongs to " non-three custom " text, and correspondingly, first category information can be " three customs " or " non-three custom ".
According to an embodiment of the invention, first classification information specifically for example can also include text to be sorted relative to more A other multiple first confidence levels of predetermined class, to characterize text to be sorted, to belong to each predetermined class in multiple predetermined classifications other several Rate.Further, the first deep learning model is for example, it can be set to have first threshold, as according to multiple first confidence levels Determine the foundation of first category information.Specifically, in two classification problems, which can be by text phase to be sorted The size relation of the first confidence level and the first threshold for second category determines.When text to be sorted is relative to the second class When other first confidence level is greater than first threshold, determines that the first category information is second category, otherwise determine the first category Information is first category.For example, if multiple predetermined classifications include " three customs " and " non-three custom ", when text to be sorted is relative to " three When the confidence level of custom " is greater than first threshold, determines that the first category information of text to be sorted is " three customs ", otherwise determine to be sorted The first category information of text is " non-three custom ".Therefore, above-mentioned first category is " non-three custom ", and second category is " three Custom ".Wherein, the value of first threshold can be set according to actual needs, and the embodiment of the present invention is not construed as limiting this.Such as The first threshold can take the arbitrary value greater than 0.5 less than or equal to 0.7, or other arbitrary values.
According to an embodiment of the invention, in view of when text to be sorted belongs to the unbalanced text type of category distribution, It is mainly used for determining whether text to be sorted belongs in multiple predetermined classifications and is distributed sparse classification.Then in order to improve this first The amount of recalling of deep learning model avoids the text mistake that will belong to the sparse classification (second category) of distribution from being divided into densely distributed Classification (first category), should should be small as far as possible by the value of the first threshold.For example, for newsletter archive, the first of setting Threshold value answer it is as small as possible, can be inadequate to avoid the feature of " three custom " text having by the first deep learning model, and will belong to In the text classification of " three custom " is " non-three custom " the case where.
It is adopted in operation S203 in the case where the classification that the first classification information characterizes text to be sorted is not first category The classification of text to be sorted is determined with the second deep learning model.
According to an embodiment of the invention, the setting due to first threshold is smaller, then often presence will belong to first category The case where text mistake is divided into second category.For example, being often divided into " three customs " text in the presence of by the text mistake for belonging to " non-three custom " Situation.Then determines whether text to be sorted belongs in multiple predetermined classifications in order to further increase and be distributed the accurate of sparse classification Rate also copes with the first classification information characterization and is not belonging to the text of first category (belonging to second category) carrying out high-precision point Class.
Wherein, it is somebody's turn to do to improve the second deep learning model for being distributed the discriminant accuracy of sparse classification, optimization The sample data of second deep learning model for example can be the first classification information characterization classification be not the first category (i.e. Second category) the first text.Then since the first classification information is obtained by the first deep learning model, in the first text Belong to and is distributed the text of sparse classification compared to directly belonging to the text for being distributed sparse classification from the text obtained in server 120 This is more intensive.Consequently facilitating the second deep learning model is more comprehensively distributed sparse class another characteristic.
According to an embodiment of the invention, operation S203 specifically can be directly using text to be sorted as the second depth The input for practising model determines the classification of text to be sorted according to the output of the second deep learning model.Wherein, second depth Practising model for example can be supporting vector machine model, Random Forest model, long memory network model in short-term or any can be used in Solve the deep learning model of classification problem.The second deep learning model and the first deep learning model can be different type Model, or may be same type but with different parameters model.Specifically, which can It is obtained with being that the optimization method that is described by Fig. 5 A is pre-optimized, this will not be detailed here.
According to an embodiment of the invention, as shown in Figure 2 B, it is above-mentioned that text to be sorted is determined using the second deep learning model Classification for example may include operation S213~operation S223.In operation S213, obtained using the second deep learning model wait divide Second classification information of class text.In operation S223, determine that the classification of text to be sorted is the classification of the second classification information characterization.
Wherein, which can for example characterize the text institute to be sorted obtained by the second deep learning model The classification of category.Specifically, which includes the second category information for the classification for characterizing text to be sorted.For example, right In newsletter archive, which can for example characterize text to be sorted and belong to " three customs " text or belong to " non-three custom " Text, correspondingly, second category information can be " three customs " or " non-three custom ".
According to an embodiment of the invention, second classification information specifically for example can also include text to be sorted relative to more A other multiple second confidence levels of predetermined class, to characterize text to be sorted, to belong to each predetermined class in multiple predetermined classifications other several Rate.Further, the second deep learning model is for example, it can be set to have second threshold, as according to multiple second confidence levels Determine the foundation of second category information.Similar, in two classification problems, which can be by text phase to be sorted The size relation of the second confidence level and the second threshold for second category determines.When text to be sorted is relative to the second class When other second confidence level is greater than second threshold, determines that the second category information is second category, otherwise determine the second category Information is first category.For example, if multiple predetermined classifications include " three custom " and " non-three is popular ", text to be sorted relative to In the case that the confidence level of " three customs " is greater than second threshold, determine that the second category information of text to be sorted is " three customs ", accordingly Ground determines that text to be sorted is " three customs " text.Otherwise the second category information for determining text to be sorted is " non-three custom ", is determined Text to be sorted is " non-three custom " text.Wherein, in order to improve the accuracy rate that text classification to be sorted is " three custom " text, this The value of two threshold values should be greater than first threshold.It is understood that the second threshold can be set according to actual needs, this Inventive embodiments are not construed as limiting this.Such as the second threshold can take the value greater than 0.9, or other arbitrary values.
Wherein, it is contemplated that the first threshold of the first deep learning model less than the second deep learning model second threshold, Then often there is no the first classification informations that the text characterization that will belong to second category is first category.Therefore in the first classification letter When the classification of breath characterization text to be sorted is first category, it can determine that text to be sorted is the text for belonging to first category.
In summary, the file classification method of the embodiment of the present invention, text to be sorted come via two deep learning models It determines classification, the accuracy rate and recall rate of text classification can be improved to a certain extent.Furthermore due to the second deep learning mould The sample data of type is the text recalled via the first deep learning model.Therefore, the sample of the second deep learning model The concentration of effective sample is high in data, so as to improve the precision of the second deep learning model, further increases via second The accuracy rate for the classification results that deep learning model determines.
Fig. 3 diagrammatically illustrates the flow chart of file classification method according to another embodiment of the present invention.
In accordance with an embodiment of the present disclosure, after the classification that text to be sorted has been determined, it is contemplated that characterize text category to be sorted It can be used as the reference that user treats the subsequent processing of classifying text in the probability of second category (being distributed sparse classification).Cause This, as shown in figure 3, the file classification method of the embodiment of the present invention is determining text to be sorted by operating S201~operation S203 It can also include operation S304 after this classification.
In operation S304, determine that the parameter value of text to be sorted is confidence level of the text to be sorted relative to second category.
Wherein, for two classification problems, it is that the first classification information obtained in operation S202, which characterizes text to be sorted, When one classification, determine that the classification of text to be sorted is first category.Can then determine the parameter value be the first classification information in Confidence level of the classifying text relative to second category.The first classification information obtained in operation S202 characterizes text to be sorted not When being first category but determining that the classification of text to be sorted is first category/second category by operating S203, then it can determine Confidence level of the parameter value for text to be sorted in the second classification information relative to second category.
When multiple predetermined classifications include " three customs " and " non-three custom ", which is specifically as follows " three popular values ", with table Levy three popular degree of text to be sorted.Then user can treat classifying text and do accordingly according to three popular degree of text to be sorted Ground processing.
According to an embodiment of the invention, obtaining more accurate first according to input for the ease of the first deep learning model Classification information can also treat in advance classifying text and be pre-processed, to extract the feature of text to be sorted.Then as shown in figure 3, The file classification method of the embodiment of the present invention further includes operation S305 before operating S202.
In operation S305, classifying text is treated using pretreated model and is pre-processed, extraction obtains text to be sorted Characteristic information.The feature letter of obtained text to be sorted is as extracted in the input for then operating the first deep learning model in S202 Breath, output are the first classification information of text to be sorted.
Wherein, which specifically can be the text information first treated in classifying text and is identified, further according to identification As a result crucial words is extracted;Then crucial words is converted to the vector that can characterize the key words, so that it is deep to obtain first Spend the input of learning model.Wherein, extracting the pretreated model that crucial words obtains using when vector may include the inverse text of word frequency- Part frequency model, term vector model or can be used in the prior art extract text feature arbitrary model, the present invention to this not It limits.
Fig. 4, which is diagrammatically illustrated, optimizes the first deep learning model in file classification method according to an embodiment of the present invention Flow chart.
As shown in figure 4, the method that the embodiment of the present invention optimizes the first deep learning model may include operation S406~behaviour Make S408.
In operation S406, the concrete class of multiple second texts and multiple second text is obtained.
Wherein, multiple second texts for example can be obtained at random from the text library stored in server 120 it is multiple Text, multiple second text are specifically as follows multiple newsletter archives of acquisition.The concrete class of multiple second texts is then pre- The classification first distributed.Specifically, the concrete class of each second text can be the mark of each second text distribution by user Label obtain, and the concrete class of each second text is indicated for the label of each second text distribution.Then aforesaid operations S406 exists It can also include the operation for obtaining the label that user is the distribution of each second text after getting multiple second texts.Further Ground, for the ease of simultaneously using label as the input of the first deep learning model, after getting label, operation S406 can be with Including following operation: the label of each second text and each second text being spliced to form corresponding with each second text Second sample data.
Wherein, it is contemplated that after the first deep learning model, can also further using the second deep learning model come Determine the classification of text to be sorted.Therefore, the embodiment of the present invention is not high to the required precision of the first deep learning model.Optimizing When obtaining the first deep learning model, the second amount of text for needing is without too many, to guarantee lower mark amount, with drop Low mark cost.
In operation S407, using multiple second texts as the input of the first deep learning model, multiple second texts are obtained The first classification information.
According to an embodiment of the invention, operation S407 specifically can be, multiple second texts are inputted first one by one Deep learning model, to obtain the first classification information one by one.The specific implementation process of operation S407 is similar to operation S202, Difference is only that the first deep learning model herein is the Logic Regression Models being not optimised or long memory network model etc. in short-term. Wherein, as corresponding with each second text for specifically can be above-mentioned splicing and obtaining of the first deep learning mode input Two sample datas.
S408 is operated, according to the first classification information of multiple second texts and the concrete class of multiple second texts, optimization First deep learning model.
According to an embodiment of the invention, operation S408 specifically can be, according to each text in multiple second texts The concrete class of each of first classification information the characterization classification of the second text and each second text, using first-loss letter First-loss value corresponding with each second text is calculated in number.Then the first depth is adjusted according to the first-loss value Practise the parameter in model, Lai Youhua the first deep learning model.
Fig. 5 A, which is diagrammatically illustrated, optimizes the second deep learning model in file classification method according to an embodiment of the present invention Flow chart, Fig. 5 B diagrammatically illustrate it is according to an embodiment of the present invention obtain the first text flow chart.
As shown in Figure 5A, it may include operation S509~behaviour that the embodiment of the present invention, which optimizes the method for the second deep learning model, Make S511.
In operation S509, the concrete class of multiple first texts and multiple first texts is obtained.
Wherein, since first text is the text that the first classification information characterization is not first category, this first Text is to have obtained the text of the first classification information by the first deep learning model.As shown in Figure 5 B, multiple first texts are obtained This operation may include operation S519~operation S539.
In operation S519, multiple third texts are obtained;In operation S529, using multiple third texts as the first deep learning The input of model obtains the first classification information of multiple third texts;In operation S539, the class of the first classification information characterization is determined The third text for not being first category is the first text.
According to an embodiment of the invention, multiple third texts in operation S519 are the text stored from server 120 The multiple texts obtained at random in this library.Multiple third text may include the second text of Fig. 4 description, can not also include Second text of Fig. 4 description.The first classification information that operation S529 is obtained specifically can be the operation S202 described by Fig. 2A It acquires, is also possible to acquire by the operation S407 that Fig. 4 is described.Then operating S539 can be according to each third First classification information of text determines whether each third text is the first text.
According to an embodiment of the invention, in view of the first text is only that text in text library is determined by operation S539 First classification information characterization is not the third text of first category;And in Fig. 4 operate S404 obtain the second text be directly from It is obtained at random in text library.Then concrete class is that the text scale of first category is greater than multiple first texts in multiple second texts Concrete class is the text scale of first category in this.It is stored in text library in the case where be newsletter archive, multiple second The ratio that " three customs " text is actually belonged in text is less than in multiple first texts the ratio for actually belonging to " three customs " text.
According to an embodiment of the invention, the concrete class of each third text can be each second text point by user The label matched obtains, and the concrete class of each third text is indicated for the label of each third text distribution.S509 is then operated to exist It can also include the operation for obtaining the label that user is the distribution of each first text after getting multiple first texts.Further Ground, for the ease of simultaneously using label as the input of the second deep learning model, after getting label, operation S509 can be with Including following operation: the label of each first text and each first text is spliced to form corresponding with each first text One sample data.
It operates S510 and obtains multiple first texts using multiple first texts as the input of the second deep learning model Second classification information.
According to an embodiment of the invention, operation S510 specifically can be, multiple first texts are inputted second one by one Deep learning model, to obtain the second classification information one by one.The specific implementation process of operation S510 is similar to operation S203, Difference is only that the second deep learning model herein is the supporting vector machine model being not optimised, Random Forest model or grows in short-term Memory network model etc..Wherein, as the second deep learning mode input specifically can be that above-mentioned splicing obtains with it is each The corresponding first sample data of first text.
S511 is operated, according to the concrete class of the second classification information of multiple first texts and multiple first texts, optimization Second deep learning model,
According to an embodiment of the invention, operation S511 specifically can be, according to the first text each in multiple first texts The concrete class of each of this second classification information characterization classification of the first text and each first text, using the second damage It loses function and the second penalty values corresponding with each first text is calculated.Then deep according to second penalty values adjustment first Spend the parameter in learning model, Lai Youhua the second deep learning model.Wherein, second loss function and first-loss above-mentioned Function for example can be cross entropy loss function, sigmoid loss function or any other loss function, and this second loss Function and first-loss function can be identical or different loss function, and this is not limited by the present invention.
In summary, the sample data for optimizing the second deep learning model is obtained by the first deep learning model discrimination The first text for being not belonging to first category be labeled.Compared to text (or the first deep learning in text library The input text of model), the density for belonging to the text of first category (" non-three custom ") in the first text can be effectively reduced, is increased Belong to the density of the text of second category (" three customs ") in first text.Then belong to the text of second category in guaranteeing sample data In the case that this amount is certain, compared to the prior art in the text library that needs text quantity, needs can be greatly reduced The quantity of first text, and therefore can effectively reduce the mark quantity to text.To reduce sample number to a certain extent According to the difficulty of acquisition, data mark amount is reduced, improves annotating efficiency.
Fig. 6 diagrammatically illustrates the techniqueflow chart of file classification method according to an embodiment of the present invention.
As shown in fig. 6, the file classification method of the embodiment of the present invention may include training process and detection process.The text Two deep learning models involved in classification method, one of deep learning model in two deep learning models is with complete It measures data (text directly obtained from server 120) to input as training sample data or detection data, using as height Recall model.Another deep learning model in two deep learning models is to recall the text that model is recalled via height This (i.e. high to recall " three customs " text that model obtains) inputs as training sample data or detection data, using as high precision mould Type determines the classification results of the text recalled.
Wherein, in the training stage, first with small data quantity training sample, (" three customs " ratio is the sample of NATURAL DISTRIBUTION ratio This) train height to recall model, to recall to full dose data, obtain training sample (" three customs " ratio of high precision model High sample).Wherein, in order to increase the high amount of recalling for recalling model, it is (preceding the lesser threshold value of model specification can be recalled for the height The first threshold stated).In order to guarantee that the training sample of high precision model has a certain number of samples, mould is recalled having trained height After type, it can also continue to the input for recalling model using full dose data as height, obtain the training sample of high precision model again.? After obtaining the training sample of enough high precision models, i.e., high precision model can be trained using the big quantity training sample, it should The threshold value of high precision model is greater than the threshold value that height recalls model.
In detection-phase, the detection text (such as news) that each needs detect is first passed through into height first and recalls model judgement Whether first category (" non-three custom ") is belonged to.If height recalls model and is judged to " non-three custom ", the text is directly skipped into high precision Model, whole result are " non-three custom ", and " three popular values " is the confidence level relative to " three customs " classification that height recalls that model obtains. If height recalls model and is judged to " three customs ", the text enters high precision model and is judged again, if high precision model is defeated Result is " three custom " out, then the whole result of the text is " three customs ", " three popular values " for high precision model obtain relative to " three The confidence level of custom " classification.If the output result of high precision model is " non-three custom ", the whole result of the text is " non-three Custom ", " three popular values " are the confidence level relative to " three customs " classification that high precision model obtains.
In summary, in the file classification method of the embodiment of the present invention, since the training data of high precision model is by small Data volume training recalls what model determination obtained, and " three customs " concentration is much higher than NATURAL DISTRIBUTION set, thus annotating efficiency It is higher.Furthermore in the detection, since all detection texts to be detected all recall model through excessively high, the detection text The distribution of feature distribution and training sample is closer to, so as to improve whole accuracy rate.
Exemplary means
After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 7 to the exemplary reality of the present invention The document sorting apparatus for applying mode is illustrated.
Fig. 7 diagrammatically illustrates the block diagram of document sorting apparatus according to an embodiment of the invention.
As shown in fig. 7, according to embodiments of the present invention, text sorter 700 may include that text to be sorted obtains mould Block 710, first category determining module 720 and second category determining module 730.Text sorter 700 can be used to implement File classification method according to an embodiment of the present invention.
Text to be sorted obtains module 710 for obtaining text to be sorted (operation S201).
First category determining module 720 is used to be obtained using the first deep learning model to be sorted according to text to be sorted The first classification information (operation S202) of text.
Second category determining module 730 is used in the classification that the first classification information characterizes text to be sorted not be first category In the case where, the classification (operation S203) of text to be sorted is determined using the second deep learning model.Wherein, optimization obtains second The sample data of deep learning model includes the first text, and the classification of the first classification information characterization of first text is not first Classification.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned second category determining module 730 may include the second classification Acquisition of information submodule 731 and classification determine submodule 732.Wherein, the second classification information acquisition submodule 731 is used for using the Two deep learning models obtain the second classification information (operation S213) of text to be sorted.Classification determines that submodule 732 is used for really The classification of fixed text to be sorted is the classification (operation S223) of the second classification information characterization.
According to an embodiment of the invention, above-mentioned first deep learning model specification has first threshold, the second deep learning mould Type is set with second threshold.Wherein, the first classification information includes first category information and text to be sorted relative to multiple predetermined Multiple first confidence levels of classification, first confidence level and first of the first category information by text to be sorted relative to second category The size relation of threshold value determines.Second classification information includes second category information and text to be sorted relative to multiple predetermined classifications Multiple second confidence levels, second confidence level and second threshold of the second category information by text to be sorted relative to second category Size relation determine.Wherein, first threshold is less than second threshold, first category information and second category information for characterize to The classification of classifying text, multiple predetermined classifications include first category.
According to an embodiment of the invention, above-mentioned first classification information and/or the second classification information include text phase to be sorted Multiple confidence levels other for multiple predetermined class.As shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that parameter value is true Cover half block 740, for determining that the parameter value of text to be sorted is confidence level (operation of the text to be sorted relative to second category S304).Wherein, multiple predetermined classifications include first category and second category.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that the first model is excellent Change module 750, for performing the following operations: obtaining the concrete class (S406) of multiple second texts and multiple second texts;With Input of multiple second texts as the first deep learning model obtains the first classification information (operation of multiple second texts S407);According to the first classification information of multiple second texts and the concrete class of multiple second texts, optimize the first deep learning Model (operation S408).Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
According to an embodiment of the invention, concrete class is that the text scale of first category is greater than in above-mentioned multiple second texts Concrete class is the text scale of first category in multiple first texts.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned document sorting apparatus 700 can also include that the second model is excellent Change module 760, for performing the following operations: obtaining the concrete class (operation of multiple first texts and multiple first texts S509);Using multiple first texts as the input of the second deep learning model, the second classification information of multiple first texts is obtained (operation S510);And according to the second classification information of multiple first texts and the concrete class of multiple first texts, optimization the Two deep learning models (operation S511).Wherein, the second deep learning model includes: supporting vector machine model, random forest mould Type, long memory network model in short-term.
According to an embodiment of the invention, obtaining multiple first texts includes: to obtain multiple third texts (operation S519);With Input of multiple third texts as the first deep learning model obtains the first classification information (operation of multiple third texts S529);It is the first text (operation S539) that the classification for determining the first classification information characterization, which is not the third text of first category,.
According to an embodiment of the invention, as shown in fig. 7, above-mentioned file classification method can also include preprocessing module 770, For being treated using pretreated model before first category determining module 720 obtains the first classification information of text to be sorted Classifying text is pre-processed, and is extracted and is obtained the characteristic information (operation S305) of text to be sorted.Wherein, first category determines mould Block 720 is specifically used for using the characteristic information of text to be sorted as the input of the first deep learning model, and output obtains to be sorted First classification information of text;Pretreated model includes word frequency-inverse file frequency model or term vector model.
Exemplary media
After describing the method for exemplary embodiment of the invention, next, with reference to Fig. 8 to the exemplary reality of the present invention The computer readable storage medium for being adapted for carrying out file classification method for applying mode is introduced.
According to an embodiment of the invention, additionally providing a kind of computer readable storage medium, it is stored thereon with executable finger It enables, described instruction makes processor execute file classification method according to an embodiment of the present invention when being executed by processor.
In some possible embodiments, various aspects of the invention are also implemented as a kind of shape of program product Formula comprising program code, when described program product is run on the computing device, said program code is for making the calculating Equipment executes described in above-mentioned " illustrative methods " part of this specification the use of various illustrative embodiments according to the present invention In executing the step in file classification method, for example, the calculating equipment can execute step S201 as shown in Figure 2 A: obtaining Take text to be sorted;Step S202: according to the text to be sorted, the text to be sorted is obtained using the first deep learning model This first classification information;Step S203: the classification that the text to be sorted is characterized in first classification information is not first In the case where classification, the classification of the text to be sorted is determined using the second deep learning model.
Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red The system of outside line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing (non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
As shown in figure 8, the program product 800 suitable for file classification method of embodiment according to the present invention is described, It can be using portable compact disc read only memory (CD-ROM) and including program code, and can calculate equipment, such as It is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing, which can be, appoints What include or the tangible medium of storage program that the program can be commanded execution system, device or device use or and its It is used in combination.
Readable signal medium may include in a base band or as the data-signal that carrier wave a part is propagated, wherein carrying Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be other than readable storage medium storing program for executing it is any can Read medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or Program in connection.
The program code for including on readable medium can transmit with any suitable medium, including --- but being not limited to --- Wirelessly, wired, optical cable, RF etc. or above-mentioned any appropriate combination.
The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language one-such as Java, and C++ etc. further includes routine Procedural programming language --- such as " C " language or similar programming language.Program code can fully exist It is executed in user calculating equipment, part executes on a remote computing or completely remote on the user computing device for part Journey calculates to be executed on equipment or server.In the situation for being related to remote computing device, remote computing device can be by any The network of type --- it is connected to user calculating equipment including local area network (LAN) or wide area network (WAN) one, alternatively, can connect To external computing device (such as being connected using ISP by internet).
Exemplary computer device
After method, medium and the device for describing exemplary embodiment of the invention, next, with reference to Fig. 9 to this The calculating equipment for being adapted for carrying out file classification method of invention illustrative embodiments is illustrated.
The embodiment of the invention also provides a kind of calculating equipment.Person of ordinary skill in the field is it is understood that this hair Bright various aspects can be implemented as system, method or program product.Therefore, various aspects of the invention can be implemented as Following form, it may be assumed that complete hardware embodiment, complete Software Implementation (including firmware, microcode etc.) or hardware and The embodiment that software aspects combine, may be collectively referred to as circuit, " module " or " system " here.
In some possible embodiments, it is single can to include at least at least one processing for calculating equipment according to the present invention Member and at least one storage unit.Wherein, the storage unit is stored with program code, when said program code is described When processing unit executes, so that the processing unit executes described in above-mentioned " illustrative methods " part of this specification according to this Invent the step in the file classification method of various illustrative embodiments.For example, the processing unit can be executed such as Fig. 2A Shown in step S201: obtain text to be sorted;Step S202: according to the text to be sorted, using the first deep learning Model obtains the first classification information of the text to be sorted;Step S203: described wait divide in first classification information characterization In the case that the classification of class text is not first category, the class of the text to be sorted is determined using the second deep learning model Not.
The calculating for being adapted for carrying out file classification method of this embodiment according to the present invention is described referring to Fig. 9 Equipment 900.Calculating equipment 900 as shown in Figure 9 is only an example, function to the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
It is showed in the form of universal computing device as shown in figure 9, calculating equipment 900.The component for calculating equipment 900 can wrap It includes but is not limited to: at least one above-mentioned processing unit 901, at least one above-mentioned storage unit 902, the different system components of connection The bus 903 of (including storage unit 902 and processing unit 901).
Bus 903 may include data/address bus, address bus and control bus.
Storage unit 902 may include volatile memory, such as random access memory (RAM) 9021 and/or high speed Buffer memory 9022 can further include read-only memory (ROM) 923.
Storage unit 902 can also include program/utility with one group of (at least one) program module 9024 9025, such program module 9024 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.
Calculating equipment 900 can also be with one or more external equipments 904 (such as keyboard, sensing equipment, bluetooth equipment Deng) communicate, this communication can be carried out by input/output (I/0) interface 905.Also, calculating equipment 900 can also pass through Network adapter 906 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as Internet) communication.As shown, network adapter 906 is communicated by bus 903 with the other modules for calculating equipment 900.It should Understand, although not shown in the drawings, other hardware and/or software module can be used in conjunction with equipment 900 is calculated, including but unlimited In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number According to backup storage system etc..
It should be noted that although being referred to several units/modules or subelement/submodule of device in the above detailed description Block, but it is this division be only exemplary it is not enforceable.In fact, embodiment according to the present invention, is retouched above The feature and function for two or more units/modules stated can embody in a units/modules.Conversely, above description A units/modules feature and function can with further division be embodied by multiple units/modules.
In addition, although describing the operation of the method for the present invention in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.
Although detailed description of the preferred embodimentsthe spirit and principles of the present invention are described by reference to several, it should be appreciated that, this It is not limited to the specific embodiments disclosed for invention, does not also mean that the feature in these aspects cannot to the division of various aspects Combination is benefited to carry out, this to divide the convenience merely to statement.The present invention is directed to cover appended claims spirit and Included various modifications and equivalent arrangements in range.

Claims (12)

1. a kind of file classification method, comprising:
Obtain text to be sorted;
According to the text to be sorted, the first classification information of the text to be sorted is obtained using the first deep learning model;
It is deep using second in the case where the classification that first classification information characterizes the text to be sorted is not first category Degree learning model determines the classification of the text to be sorted,
Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the of first text The classification of one classification information characterization is not the first category.
2. according to the method described in claim 1, wherein, the class of the text to be sorted is determined using the second deep learning model Do not include:
The second classification information of the text to be sorted is obtained using the second deep learning model: and determine the text to be sorted This classification is the classification of second classification information characterization.
3. according to the method described in claim 2, wherein, the first deep learning model specification has a first threshold, described Two deep learning model specifications have second threshold, in which:
First classification information includes that first category information and the text to be sorted are other multiple relative to multiple predetermined class First confidence level, the first category information is by the text to be sorted relative to the first confidence level of second category and described the The size relation of one threshold value determines;
Second classification information includes that second category information and the text to be sorted are other multiple relative to multiple predetermined class Second confidence level, second confidence level and institute of the second category information by the text to be sorted relative to the second category The size relation for stating second threshold is determining,
Wherein, the first threshold is less than the second threshold, and the first category information and the second category information are used for The classification of the text to be sorted is characterized, the multiple predetermined classification includes the first category and the second category.
4. according to the method described in claim 2, wherein, first classification information and/or second classification information include The text to be sorted is relative to the other multiple confidence levels of multiple predetermined class, after the classification for determining the text to be sorted, institute State method further include:
The parameter value for determining the text to be sorted is confidence level of the text to be sorted relative to second category,
Wherein, the multiple predetermined classification includes the first category and the second category.
5. according to the method described in claim 1, further include:
Obtain the concrete class of multiple second texts and the multiple second text;
Using the multiple second text as the input of the first deep learning model, the of the multiple second text is obtained One classification information;
According to the first classification information of the multiple second text and the concrete class of the multiple second text, optimize described the One deep learning model,
Wherein, the first deep learning model includes: Logic Regression Models or long memory network model in short-term.
6. according to the method described in claim 5, wherein, concrete class is the first category in the multiple second text Text scale is greater than the text scale that concrete class in multiple first texts is the first category.
7. according to the method described in claim 1, further include:
Obtain the concrete class of multiple first texts and the multiple first text;
Using the multiple first text as the input of the second deep learning model, second point of the multiple first text is obtained Category information;And
According to the concrete class of the second classification information of the multiple first text and the multiple first text, optimize described the Two deep learning models,
Wherein, the second deep learning model includes: supporting vector machine model, Random Forest model, long memory network in short-term Model.
8. according to the method described in claim 7, wherein, multiple first texts of acquisition include:
Obtain multiple third texts;
Using the multiple third text as the input of the first deep learning model, the of the multiple third text is obtained One classification information;
Determine the first classification information characterization classification be not the first category third text be first text.
9. according to the method described in claim 1, wherein, before the first classification information for obtaining the text to be sorted, institute State method further include:
The text to be sorted is pre-processed using pretreated model, extracts and obtains the feature letter of the text to be sorted Breath,
Wherein, using the characteristic information of the text to be sorted as the input of the first deep learning model, output obtains institute State the first classification information of text to be sorted;The pretreated model includes word frequency-inverse file frequency model or term vector model.
10. a kind of document sorting apparatus, comprising:
Text to be sorted obtains module, for obtaining text to be sorted;
First category determining module, for being obtained using the first deep learning model described wait divide according to the text to be sorted First classification information of class text;And
Second category determining module, the classification for characterizing the text to be sorted in first classification information is not the first kind In other situation, the classification of the text to be sorted is determined using the second deep learning model,
Wherein, it includes the first text that optimization, which obtains the sample data of the second deep learning model, and the of first text The classification of one classification information characterization is not the first category.
11. a kind of computer readable storage medium, is stored thereon with executable instruction, described instruction is real when being executed by processor Existing method according to claim 1 to 9.
12. a kind of calculating equipment, comprising:
One or more memories, are stored with executable instruction;And
One or more processors execute the executable instruction, described in realization according to claim 1~any one of 9 Method.
CN201910480256.XA 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment Active CN110245232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910480256.XA CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910480256.XA CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Publications (2)

Publication Number Publication Date
CN110245232A true CN110245232A (en) 2019-09-17
CN110245232B CN110245232B (en) 2022-02-18

Family

ID=67886011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910480256.XA Active CN110245232B (en) 2019-06-03 2019-06-03 Text classification method, device, medium and computing equipment

Country Status (1)

Country Link
CN (1) CN110245232B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243607A (en) * 2020-03-26 2020-06-05 北京字节跳动网络技术有限公司 Method, apparatus, electronic device, and medium for generating speaker information
CN111930939A (en) * 2020-07-08 2020-11-13 泰康保险集团股份有限公司 Text detection method and device
CN112711940A (en) * 2019-10-08 2021-04-27 台达电子工业股份有限公司 Information processing system, information processing method, and non-transitory computer-readable recording medium
CN113536806A (en) * 2021-07-18 2021-10-22 北京奇艺世纪科技有限公司 Text classification method and device
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156885A (en) * 2010-02-12 2011-08-17 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN103824092A (en) * 2014-03-04 2014-05-28 国家电网公司 Image classification method for monitoring state of electric transmission and transformation equipment on line
US20170032276A1 (en) * 2015-07-29 2017-02-02 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning
US20180357299A1 (en) * 2017-06-07 2018-12-13 Accenture Global Solutions Limited Identification and management system for log entries

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156885A (en) * 2010-02-12 2011-08-17 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
CN102521656A (en) * 2011-12-29 2012-06-27 北京工商大学 Integrated transfer learning method for classification of unbalance samples
CN103593470A (en) * 2013-11-29 2014-02-19 河南大学 Double-degree integrated unbalanced data stream classification algorithm
CN103824092A (en) * 2014-03-04 2014-05-28 国家电网公司 Image classification method for monitoring state of electric transmission and transformation equipment on line
US20170032276A1 (en) * 2015-07-29 2017-02-02 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
US20180357299A1 (en) * 2017-06-07 2018-12-13 Accenture Global Solutions Limited Identification and management system for log entries
CN107644057A (en) * 2017-08-09 2018-01-30 天津大学 A kind of absolute uneven file classification method based on transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘胥影 等: ""一种基于级联模型的类别不平衡数据分类方法"", 《南京大学学报(自然科学版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112711940A (en) * 2019-10-08 2021-04-27 台达电子工业股份有限公司 Information processing system, information processing method, and non-transitory computer-readable recording medium
CN111243607A (en) * 2020-03-26 2020-06-05 北京字节跳动网络技术有限公司 Method, apparatus, electronic device, and medium for generating speaker information
CN111930939A (en) * 2020-07-08 2020-11-13 泰康保险集团股份有限公司 Text detection method and device
CN113536806A (en) * 2021-07-18 2021-10-22 北京奇艺世纪科技有限公司 Text classification method and device
CN113536806B (en) * 2021-07-18 2023-09-08 北京奇艺世纪科技有限公司 Text classification method and device
CN114065759A (en) * 2021-11-19 2022-02-18 深圳视界信息技术有限公司 Model failure detection method and device, electronic equipment and medium
CN114065759B (en) * 2021-11-19 2023-10-13 深圳数阔信息技术有限公司 Model failure detection method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN110245232B (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
CN110245232A (en) File classification method, device, medium and calculating equipment
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN110309514A (en) A kind of method for recognizing semantics and device
CN111143569B (en) Data processing method, device and computer readable storage medium
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN108304468A (en) A kind of file classification method and document sorting apparatus
CN110427463A (en) Search statement response method, device and server and storage medium
CN111143226B (en) Automatic test method and device, computer readable storage medium and electronic equipment
Ling et al. Integrating extra knowledge into word embedding models for biomedical NLP tasks
WO2021139279A1 (en) Data processing method and apparatus based on classification model, and electronic device and medium
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
CN115328756A (en) Test case generation method, device and equipment
CN111694937A (en) Interviewing method and device based on artificial intelligence, computer equipment and storage medium
CN110162766A (en) Term vector update method and device
CN109710760A (en) Clustering method, device, medium and the electronic equipment of short text
CN107943940A (en) Data processing method, medium, system and electronic equipment
CN110705255A (en) Method and device for detecting association relation between sentences
CN110232128A (en) Topic file classification method and device
CN111539612B (en) Training method and system of risk classification model
CN116049376A (en) Method, device and system for retrieving and replying information and creating knowledge
US20230014904A1 (en) Searchable data structure for electronic documents
CN116167382A (en) Intention event extraction method and device, electronic equipment and storage medium
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
WO2022216462A1 (en) Text to question-answer model system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant