CN109726288A

CN109726288A - File classification method and device based on artificial intelligence process

Info

Publication number: CN109726288A
Application number: CN201811625414.8A
Authority: CN
Inventors: 李晖; 熊荣正; 张雨薇
Original assignee: Shanghai Point Information Technology Co Ltd
Current assignee: Shanghai Point Information Technology Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-05-07

Abstract

Embodiment of the disclosure discloses a kind of file classification method based on artificial intelligence process, this method comprises: being classified using textual classification model to each text in the first text set for not marking classification, with the confidence level of each text in determination first text set, wherein, the textual classification model is generated based on the history text collection for having marked classification；Based on the confidence level of each text in first text set, one or more text is determined from first text set, and to one or more of text marking classifications；When the text of the new category different when the classification for including in one or more of texts after mark from the history text is concentrated, the history text collection is updated using one or more of texts after mark.New text categories can be found automatically using the method for embodiment of the disclosure, and improve the classification accuracy of textual classification model.

Description

File classification method and device based on artificial intelligence process

Technical field

Present disclosure belongs to technical field of information processing more particularly to a kind of text classification based on artificial intelligence process Method, apparatus and a kind of corresponding computer readable storage medium.

Background technique

Artificial intelligence (Artificial Intelligence), english abbreviation AI.It is research, develop for simulating, Extend and the theory of the intelligence of extension people, method, a new technological sciences of technology and application system.Text classification, which refers to, adopts Text (sample) is collected with natural language processing (NLP) technology and carries out automatic contingency table according to certain classification system or standard Note.Text classification can be widely used in various fields, such as positive and negative public sentiment monitoring, intelligent customer service, differentiate spam, Film comment emotion recognition and any classifiable task dispatching.Traditional file classification method includes two processes: 1, being based on The sample for largely having marked classification, trains model using machine learning method；2, using model to the sample for not marking classification This is classified.However, this method be built upon classification it is fixed on the basis of, when occur new sample be not belonging to it is previously given Any classification when, model classification performance will be deteriorated.

Summary of the invention

Embodiment of the disclosure provides a kind of file classification method based on artificial intelligence process, device and a kind of phase The computer readable storage medium answered, at least to be partially solved above-mentioned and other potential problem.

The first aspect of embodiment of the disclosure proposes a kind of file classification method based on artificial intelligence process, described File classification method the following steps are included:

A. classified using textual classification model to each text in the first text set for not marking classification, with determination The confidence level of each text in first text set, wherein the textual classification model is based on the history for having marked classification Text set generates；

B. the confidence level based on each text in first text set, from first text set determine one or Multiple texts, and to one or more of text marking classifications；

It C. include different new of classification concentrated from the history text in one or more of texts after mark When the text of classification, the history text collection is updated using one or more of texts after mark；And

D. new textual classification model is generated using updated history text collection for first text set In other texts for not marking classify.

The second aspect of embodiment of the disclosure proposes a kind of document sorting apparatus based on artificial intelligence process, described Document sorting apparatus includes:

Processor；And

Memory makes the processor execute following steps when executed for storing instruction:

The third aspect of embodiment of the disclosure proposes a kind of computer readable storage medium, including computer can be performed Instruction, the computer executable instructions execute described device according to this hair disclosed embodiment Based on the file classification method of artificial intelligence process described in first aspect.

According to the file classification method based on artificial intelligence process of embodiment of the disclosure, device and corresponding calculating Machine readable storage medium storing program for executing makes it possible to carry out text classification using the sample for having marked classification on a small quantity, by increment iterative come automatic It was found that new text categories, and the textual classification model that can timely update, to improve the classification accuracy of model.

Detailed description of the invention

It refers to the following detailed description in conjunction with the accompanying drawings, the feature, advantage and other aspects of the presently disclosed embodiments will become Must be more obvious, show several embodiments of the disclosure by way of example rather than limitation herein, in the accompanying drawings:

Fig. 1 shows the process of the file classification method 100 according to an embodiment of the present disclosure based on artificial intelligence process Figure；

The process of Fig. 2 shows the according to an embodiment of the present disclosure file classification method 200 based on artificial intelligence process Figure；

Fig. 3 shows according to an embodiment of the present disclosure for selecting new sample to be labeled according to new classification results Illustrative methods 300 flow chart；And

Fig. 4 shows the signal of the document sorting apparatus 400 according to an embodiment of the present disclosure based on artificial intelligence process Figure.

Specific embodiment

Below with reference to each exemplary embodiment of the attached drawing detailed description disclosure.Flow chart and block diagram in attached drawing are shown The architecture, function and operation in the cards of method and system according to various embodiments of the present disclosure.It should be noted that Each of flowchart or block diagram box can represent a part of a module, program segment or code, the module, journey Sequence section or a part of code may include it is one or more for realizing in each embodiment the logic function of defined can It executes instruction.It should also be noted that in some alternative implementations, function marked in the box can also be according to being different from The sequence marked in attached drawing occurs.For example, two boxes succeedingly indicated can actually be basically executed in parallel, or They can also be executed in a reverse order sometimes, this depends on related function.It should also be noted that flow chart And/or the combination of the box in each of block diagram box and flowchart and or block diagram, it can be used as defined in execution The dedicated hardware based systems of functions or operations is realized, or the combination of specialized hardware and computer instruction can be used To realize.

Term as used herein "include", "comprise" and similar terms are open terms, i.e., " including/include but It is not limited to ", expression can also include other content.Term "based" is " being based at least partially on ".Term " one embodiment " It indicates " at least one embodiment "；Term " another embodiment " expression " at least one other embodiment " etc..

Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as part of specification.For the company between each unit in attached drawing Line, it is only for convenient for explanation, indicate that the unit at least line both ends is in communication with each other, it is not intended that the non-line of limitation Unit between can not communicate.

For ease of description, some terms occurred in present disclosure are illustrated below, it should be understood that the application Used term, which should be interpreted that, to be had and it is in the context of present specification and in relation to the consistent meaning of meaning in field Justice.

As previously mentioned, traditional file classification method needs have largely marked the sample of classification to train model, and Classified using model to the sample for not marking classification.However, this method is only applicable to the fixed classification scene of classification, such as Fruit has new sample to be not belonging to previously given any classification, and the performance of model will be deteriorated.

In order to solve problems, embodiment of the disclosure provides improved file classification method, makes it possible to utilize The sample for having marked classification on a small quantity carries out text classification, finds new text categories automatically by increment iterative, and can and Shi Gengxin textual classification model, to improve the classification accuracy of model.

Fig. 1 shows the process of the file classification method 100 according to an embodiment of the present disclosure based on artificial intelligence process Figure.As shown in the flowchart, method 100 the following steps are included:

Step 101: classified using textual classification model to each text in the first text set for not marking classification, To determine the confidence level of each text in first text set, wherein text disaggregated model is based on having marked going through for classification History text set generates.In this step, can based on the textual classification model trained to each text not marked into Row classification, to generate the confidence level of each text, confidence level can indicate a possibility that each text belongs to particular category.Example Such as, textual classification model can based on such as machine learning method (e.g., including but be not limited to naive Bayesian, supporting vector Machine (SVM) and such as CNN (convolutional neural networks), RNN (Recognition with Recurrent Neural Network), shot and long term memory models (LSTM) etc Deep learning method etc.) history text collection is trained to generate, training process may include for example extracting about text Characteristic information (for example, TF (word frequency)/IDF (reverse document frequency) feature, bag of words feature etc.) and be sent into model and be trained.

Step 102: the confidence level based on each text in first text set determines one from first text set Or multiple texts, and to the one or more text marking classification.In this step, can be determined based on confidence level will mark One or more texts of (being labeled for example, being pushed to artificial judgment).For example, confidence can be selected from the first text set Degree is that low one or more texts are labeled, or one or more texts can be selected to be marked according to other way Note.

Step 103: including different from the classification that the history text is concentrated in one or more text after mark When the text of new category, the history text collection is updated using the one or more text after mark.It in this step, can be with Expand history text collection using the one or more text after mark.

Step 104: generating new textual classification model using updated history text collection for first text Other texts that this concentration does not mark are classified.In this step, it is new to train to can use the history text collection of expansion Disaggregated model classify to the remaining text that do not mark, generate new textual classification model process be similar in step Generating process described in 101.

In some embodiments, which includes different classes of multiple subsets, each of multiple subset Subset includes the text of the same category.For example, history text collection can have different classes of text.For example, text can wrap Include short text or long text (including sentence).

In some embodiments, step 102 may include: to generate set corresponding with the classification that the history text is concentrated Confidence threshold；And it is based on the confidence threshold value, one or more of texts are selected from second text set.In the step In rapid, the one or more texts to be marked can be selected based on confidence threshold value, for example, can select from the first text set One or more texts that confidence level is selected lower than confidence threshold value are labeled, or can be selected to mark according to other way One or more texts of note.By for different classes of setting confidence threshold value, can by the setting to confidence threshold value come It efficiently controls and different classes of text selecting is labeled.

In some embodiments, step 102 can also include: the classification results based on each text, adjust the confidence Spend threshold value.In this step, confidence threshold value can be dynamically adjusted based on classification results, allows to be particular category Text is selected more often to be labeled.

In some embodiments, the classification results based on each text, adjusting the multiple confidence threshold value can wrap Include: the classification results based on each text in first text set calculate different classes of text scale；And not based on this The ratio of generic text adjusts the confidence threshold value.In this step, it is adjusted according to the text scale of classification results dynamic Confidence threshold value, such as particular category ratio are lower (for example, new category ratio is lower at the beginning), can setting respective classes Confidence threshold is adjusted to higher, allows to be that the text of particular category (for example, new category) is more selected as and is labeled, A possibility that improving discovery particular category, thus the effectively ratio of balanced different classes of sample, the classification for improving model is quasi- Exactness.

In some embodiments, step 103 may include: that the one or more text after mark is added to the history Text set.In this step, expand text collection by the way that one or more texts after mark are added to history text collection.

In some embodiments, the one or more text after mark is added to the history text collection may include: The quantity of text based on the new category calculates the phase of each text and the text of the new category in the one or more text Like degree；Based on the similarity, at least one text in the one or more text is determined, and again at least one text Mark classification；And the one or more text of at least one text including marking again is added to the history text Collection.In this step, it if the amount of text of newly-increased classification or very few, can calculate similar to newly-increased classification text Degree, and at least one (for example, some or all of etc.) text to mark again is selected based on similarity, to further mention A possibility that text of high discovery new category.For example, text can be calculated for example, by TF/IDF, PMI (mutual information between point) etc. Similarity between this, and can choose at least one text that for example similarity is higher than a certain similarity threshold and marked Note, or at least one text can be selected to be labeled according to other way.

In some embodiments, the new text for not marking classification can be received with the time to update first text Collection.It should be appreciated that above-mentioned step 101-104 can be iteratively performed to realize the classification to text, to identify all Classification the discovery of all categories is thus completed in a manner of increment since the first text set is constantly expanding, if after Changing occurs in continuous text distribution, and can also timely update model, make correct judgement to the new category text of appearance.

According to Fig. 1 described embodiment, compared with traditional file classification method, improved text classification side is provided Method makes it possible to carry out text classification using the sample for having marked classification on a small quantity, finds new text automatically by increment iterative This classification, and if changing occurs in subsequent samples distribution, can timely update model, make to the new category sample of appearance Correctly judgement, to improve the classification accuracy of model.

For example, can text classification in such as scene of collection business, sale or customer service etc in application method 100 Model.For example, in collection business scenario, employee (that is, the person of urging) and use based on mechanism (for example, internet financial institution etc.) Talk between family (that is, by personnel are urged), can according to dialog history information using the textual classification model in method 100 to All sentences at family carry out intent classifier, and (for example, determining whether user has loan repayment capacity and/or refund wish, such as user has There are loan repayment capacity and refund wish, there is refund wish but there is no loan repayment capacity, there is loan repayment capacity but do not have refund wish, without also Money ability is also without the classifications such as refund wish or the classification of various other types), corresponding response then is established to these classifications Mode, so that how this answers when employee knows the sentence for encountering related category.Similarly, above-mentioned textual classification model can be with Applied to sale, customer service scene, process be it is similar, be no longer described in detail.

The process of Fig. 2 shows the according to an embodiment of the present disclosure file classification method 200 based on artificial intelligence process Figure.As shown in the flowchart, method 200 the following steps are included:

Step 201: training textual classification model using the sample (text) for having marked classification on a small quantity.In the step In, initially, it can use the mode as described in the step 101 of method 100 to train initial textual classification model.

Step 202: being classified to the sample for not marking classification using text disaggregated model to generate each sample Confidence level, and the sample for selecting one or more confidence levels low is labeled classification.In this step, it can choose such as confidence Degree is labeled lower than the sample of confidence threshold value, or the sample that can be selected confidence level low according to other way.

Step 203: training new textual classification model using the sample newly marked and the previous sample marked.? In the step, the sample that has marked can be expanded to update textual classification model.

Step 204: being classified using new textual classification model to the other texts not marked.In this step, may be used More accurately to be classified to the text not marked using updated textual classification model.

Step 205: selecting new sample to be labeled according to new classification results.It, can be with after the completion of step 205 executes Return step 203 carries out in a manner of iteratively, until there is no new categories to occur.

Described embodiment according to fig. 2 provides improved file classification method, makes it possible to using having marked on a small quantity The sample of classification carries out text classification, finds new text categories, and if subsequent samples point automatically by increment iterative Changing occurs in cloth, and can timely update model, correct judgement be made to the new category sample of appearance, to improve model Classification accuracy.

For example, can text classification in such as scene of collection business, sale or customer service etc in application method 200 Model.For example, in collection business scenario, employee (that is, the person of urging) and use based on mechanism (for example, internet financial institution etc.) Talk between family (that is, by personnel are urged), can according to dialog history information using the textual classification model in method 200 to All sentences at family carry out intent classifier, then corresponding response mode are established to these classifications, so that employee, which knows, encounters phase How this answers when the sentence of pass classification.Similarly, above-mentioned textual classification model can also be applied to sale, customer service scene, Process be it is similar, be no longer described in detail.

Fig. 3 shows according to an embodiment of the present disclosure for selecting new sample to be labeled according to new classification results Illustrative methods 300 flow chart, i.e. Fig. 3 shows the example implementations of the frame 205 in the method 200 of Fig. 2.Such as stream Shown in journey figure, method 300 includes:

Step 301: the confidence threshold value based on classification dynamically adjusts confidence level threshold according to the sample proportion of classification results Value selects confidence level to be labeled lower than the sample of confidence threshold value.It in this step, can be based on classification results come dynamically Confidence threshold value is adjusted, allows to be that the sample of particular category is selected more often to be labeled.Such as particular category ratio Example lower (for example, new category ratio is lower at the beginning), can be adjusted to higher for the confidence threshold value of respective classes, so that can It can be a possibility that sample of particular category (for example, new category) is more selected as and is labeled, improves discovery particular category, To the ratio of effectively balanced different classes of sample, the classification accuracy of model is improved.

Step 302: determining whether the quantity of the sample of new category is very few.In this step, it can be determined that by step 301 Whether the quantity of what can be obtained the be labeled as sample of new category is very few.

Step 303: if determining that the quantity of the sample of new category is very few at step 302, calculating the sample with new category This similarity, the sample for selecting similarity to be greater than a certain similarity threshold are labeled.On the contrary, if true at step 302 Determine the sample of new category quantity be not it is very few, then skip step 303.In the step 303, if the textual data of newly-increased classification It measures or very few, then can calculate the similarity with newly-increased classification sample, and select the sample to be marked, example based on similarity The sample for such as selecting similarity to be greater than a certain similarity threshold is labeled, thus further increase the sample of discovery new category Possibility.For example, the similarity between sample can be calculated for example, by TF/IDF, PMI (mutual information between point) etc., and can To select the sample that for example similarity is higher than a certain similarity threshold to be labeled, or sample can be selected according to other way Originally it is labeled.

According to Fig. 3 described embodiment, can allow to be that the sample of new category is more pushed to mark, from And a possibility that improving discovery new category, and the effectively ratio of balanced different classes of sample.It is similar with method 100,200, Method 300 also can be applied in such as scene of collection business, sale or customer service etc.

Fig. 4 shows the signal of the document sorting apparatus 400 according to an embodiment of the present disclosure based on artificial intelligence process Figure.Device 400 may include: memory 401 and the processor 402 for being coupled to memory 401.Memory 401 refers to for storing Enable, make processor 402 when executed execute various methods described herein (method 100 of such as Fig. 1, Method 200, the method 300 of Fig. 3 of Fig. 2) in one or more movements or step.

Memory 401 may include volatile memory and nonvolatile memory, such as ROM (read only Memory), RAM (random access memory), mobile disk, disk, CD and USB flash disk etc..Processor 402 can be center Processor (CPU), microcontroller, specific integrated circuit (ASIC), digital signal processor (DSP), field programmable gate array (FPGA) or other programmable logic device or be configured as realize embodiment of the disclosure one or more integrated circuits Deng.

Additionally or alternatively, the above method can be by computer program product, i.e. computer readable storage medium is real It is existing.Computer program product may include computer readable storage medium, containing for executing each of present disclosure The computer-readable program instructions of aspect.Computer readable storage medium, which can be, can keep and store by instruction execution equipment The tangible device of the instruction used.Computer readable storage medium for example can be but not limited to storage device electric, magnetic storage is set Standby, light storage device, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.It is computer-readable The more specific example (non exhaustive list) of storage medium includes: portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), Portable compressed disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding Equipment, the punch card for being for example stored thereon with instruction or groove internal projection structure and above-mentioned any appropriate combination.Here Used computer readable storage medium is not interpreted as instantaneous signal itself, such as radio wave or other Free propagations Electromagnetic wave, the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) propagated by waveguide or other transmission mediums or pass through The electric signal of electric wire transmission.

In general, the various example embodiments of the disclosure can in hardware or special circuit, software, firmware, logic, or Implement in any combination thereof.Some aspects can be implemented within hardware, and other aspects can be can be by controller, micro process Implement in the firmware or software that device or other calculating equipment execute.When the various aspects of embodiment of the disclosure are illustrated or described as frame When figure, flow chart or other certain graphical representations of use, it will be understood that box described herein, device, system, techniques or methods can Using as unrestricted example in hardware, software, firmware, special circuit or logic, common hardware or controller or other in terms of It calculates and implements in equipment or its certain combination.

It should be noted that although being referred to several modules or unit of device in the detailed description above, this stroke It point is only exemplary rather than enforceable.In fact, in accordance with an embodiment of the present disclosure, two or more above-described modules Feature and function can be embodied in a module.Conversely, the feature and function of an above-described module can be into One step, which is divided by multiple modules, to be embodied.

The foregoing is merely embodiment of the disclosure alternative embodiments, are not limited to embodiment of the disclosure, for For those skilled in the art, embodiment of the disclosure can have various modifications and variations.It is all in embodiment of the disclosure Within spirit and principle, made any modification, equivalence replacement, improvement etc. should be included in the protection of embodiment of the disclosure Within the scope of.

Although describing embodiment of the disclosure by reference to several specific embodiments, it should be appreciated that, the disclosure Embodiment is not limited to disclosed specific embodiment.Embodiment of the disclosure be intended to cover appended claims spirit and Included various modifications and equivalent arrangements in range.Scope of the following claims is to be accorded the broadest interpretation, thus comprising All such modifications and equivalent structure and function.

Claims

1. a kind of file classification method based on artificial intelligence process, which comprises the following steps:

A. classified using textual classification model to each text in the first text set for not marking classification, described in determination The confidence level of each text in first text set, wherein the textual classification model is based on the history text for having marked classification Collection is to generate；

B. the confidence level based on each text in first text set determines one or more from first text set Text, and to one or more of text marking classifications；

It C. include the new category different from the classification that the history text is concentrated in one or more of texts after mark Text when, the history text collection is updated using one or more of texts after mark；And

D. generate new textual classification model using updated history text collection with for in first text set not Other texts of mark are classified.

2. the method according to claim 1, wherein the history text collection includes different classes of multiple sons Collect, each subset in the multiple subset includes the text of the same category.

3. file classification method according to claim 1, which is characterized in that based in first text set in step B Each text confidence level, from first text set determine one or more texts include:

Generate confidence threshold value corresponding with the classification that the history text is concentrated；And

Based on the confidence threshold value, one or more of texts are selected from second text set.

4. file classification method according to claim 3, which is characterized in that further comprise:

Based on the classification results of each text, the confidence threshold value is adjusted.

5. file classification method according to claim 4, which is characterized in that the classification results based on each text, Adjusting the confidence threshold value includes:

Based on the classification results of each text in first text set, different classes of text scale is calculated；And

Based on the ratio of the different classes of text, the confidence threshold value is adjusted.

6. file classification method according to claim 1, which is characterized in that step C. is one or more after mark When including the text for the different new category of classification concentrated from the history text in a text, using one after mark Or multiple texts include: to update the history text collection

One or more of texts after mark are added to the history text collection.

7. file classification method according to claim 6, which is characterized in that by one or more of texts after mark Being added to the history text collection includes:

The quantity of text based on the new category calculates each text in one or more of texts and the new category Text similarity；

Based on the similarity, at least one text in one or more of texts is determined, and at least one described text This marks classification again；And

To include mark again described in one or more of texts of at least one text be added to the history text collection.

8. a kind of document sorting apparatus based on artificial intelligence process characterized by comprising

Processor；And

9. document sorting apparatus according to claim 8, which is characterized in that the history text collection includes different classes of Multiple subsets, each subset in the multiple subset include the text of the same category.

10. document sorting apparatus according to claim 8, which is characterized in that based in first text set in step B Each text confidence level, from first text set determine one or more texts include:

11. document sorting apparatus according to claim 10, which is characterized in that also make institute when executed It states processor and executes following steps:

12. document sorting apparatus according to claim 11, which is characterized in that the classification knot based on each text Fruit, adjusting the confidence threshold value includes:

Based on the classification results of each text in first text set, the ratio of different classes of text is calculated；And

13. document sorting apparatus according to claim 8, which is characterized in that step C. when mark after it is one or When including the text of the new category different from the classification that the history text is concentrated in multiple texts, described one after mark is utilized A or multiple texts include: to update the history text collection

One or more of texts after mark are added to the history text collection.

14. document sorting apparatus according to claim 13, which is characterized in that by one or more of texts after mark Originally being added to the history text collection includes:

15. a kind of computer readable storage medium, including computer executable instructions, the computer executable instructions are in device Described device is made to execute the text described in any one of -7 based on artificial intelligence process point according to claim 1 when middle operation Class method.