CN109189932A - File classification method and device, computer readable storage medium - Google Patents

File classification method and device, computer readable storage medium Download PDF

Info

Publication number
CN109189932A
CN109189932A CN201811035883.4A CN201811035883A CN109189932A CN 109189932 A CN109189932 A CN 109189932A CN 201811035883 A CN201811035883 A CN 201811035883A CN 109189932 A CN109189932 A CN 109189932A
Authority
CN
China
Prior art keywords
classification
label
template
adjustment
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811035883.4A
Other languages
Chinese (zh)
Other versions
CN109189932B (en
Inventor
林江华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811035883.4A priority Critical patent/CN109189932B/en
Publication of CN109189932A publication Critical patent/CN109189932A/en
Application granted granted Critical
Publication of CN109189932B publication Critical patent/CN109189932B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This disclosure relates to file classification method and device, computer readable storage medium.File classification method, comprising: classified using textual classification model to multiple mark corpus, obtain the category of model label of each mark corpus;The preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus;Text in each sample corpus is separately converted to word list;The word combination extracted from each word list is sorted out according to category of model label, obtains the word combination under each category of model label;Classification adjustment template is generated according to word combination, the classification adjustment template includes original classification label, template content and adjustment tag along sort, the template content includes the word combination, the original classification label is the corresponding category of model label of the word combination, and the adjustment tag along sort is the mark label of the corresponding sample corpus of the word combination.

Description

File classification method and device, computer readable storage medium
Technical field
This disclosure relates to computer field, in particular to a kind of file classification method and device, computer-readable storage medium Matter.
Background technique
Text Classification is widely used in electronic text information processing.The development of depth learning technology is further expanded The application scenarios of text classification are opened up.
It is generally included based on the related text sorting technique of deep learning: determining classification standard;Collect and mark corpus, shape At corpus;With corpus train classification models;Classified with trained disaggregated model to other texts.
Summary of the invention
Due to the limitation of corpus and deep learning itself, the accuracy rate of disaggregated model can not reach 100%, and lack The part accuracy rate of mistake is difficult to effectively be promoted by the optimization of disaggregated model itself.
In consideration of it, the accuracy of text classification can be further increased the present disclosure proposes a kind of text classification scheme.
According to some embodiments of the present disclosure, a kind of file classification method is provided, comprising: utilize textual classification model pair Multiple mark corpus are classified, and the category of model label of each mark corpus is obtained;Preference pattern tag along sort with it is corresponding The inconsistent mark corpus of tag along sort is marked, as sample corpus;Text in each sample corpus is separately converted to word Language list;The word combination extracted from each word list is sorted out according to category of model label, obtains each model Word combination under tag along sort;Classification adjustment template is generated according to word combination, the classification adjustment template includes original point Class label, template content and adjustment tag along sort, the template content includes the word combination, and the original classification label is should The corresponding category of model label of word combination, the adjustment tag along sort are the mark mark of the corresponding sample corpus of the word combination Label.
In some embodiments, the file classification method further include: delete while appearing in multiple category of model labels Under word combination.
In some embodiments, the file classification method further include: delete the frequency of occurrence in sample corpus and be less than threshold The word combination of value.
In some embodiments, same word combination occurs repeatedly in a sample corpus, only counts by primary.
In some embodiments, the classification adjustment template further includes priority, the priority reflection adjustment contingency table A possibility that label are correct.
In some embodiments, the priority list is shown asA, b respectively indicates the word combination in the template content The frequency of occurrence in the sample corpus under original classification label, adjustment tag along sort.
In some embodiments, the priority list is shown asC is indicated in the original of the classification adjustment template The sum of sample corpus under tag along sort.
In some embodiments, the file classification method further include: using the textual classification model to text to be sorted This is classified, and the category of model label of the text to be sorted is obtained;Word list is converted by the text to be sorted;It will Meet the classification adjustment template of following conditions as matching result: the category of model label of the text to be sorted and the classification tune The original classification label of mould preparation plate is consistent, and at least one word combination extracted from the word list of the text to be sorted Included in the template content of classification adjustment template;There are the matching knots of at least one matching result and highest priority In the case that the correspondence priority of fruit is greater than or equal to priority threshold value, the matching result of highest priority is determined as matching point Class adjusts template;It is the adjustment classification of the matching classification adjustment template by the category of model tag modification of the text to be sorted Label, as classification results.
In some embodiments, by being segmented to text and stop words being gone to handle, word list is converted the text to.
In some embodiments, in word list between word sequence to be identical in corresponding text.
According to other embodiments of the disclosure, a kind of document sorting apparatus is provided, comprising: taxon is configured as Classified using textual classification model to multiple mark corpus, obtains the category of model label of each mark corpus;Selection is single Member is configured as the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus; Conversion unit is configured as the text in each sample corpus being separately converted to word list;Sort out unit, be configured as by The word combination extracted from each word list is sorted out according to category of model label, is obtained under each category of model label Word combination;Generation unit is configured as generating classification adjustment template according to word combination, and the classification adjustment template includes Original classification label, template content and adjustment tag along sort, the template content includes the word combination, the original classification mark Label are the corresponding category of model label of the word combination, and the adjustment tag along sort is the corresponding sample corpus of the word combination Mark label.
In some embodiments, the document sorting apparatus further include: delete unit, be configured as deleting while appearing in Word combination under multiple category of model labels deletes the word combination that the frequency of occurrence in sample corpus is less than threshold value.
In some embodiments, the document sorting apparatus further include: matching unit is configured as that following conditions will be met Classification adjustment template as matching result: the category of model label of the text to be sorted and the classification adjust the original of template Tag along sort is consistent, and at least one word combination extracted from the word list of the text to be sorted is included in the classification In the template content for adjusting template;Determination unit is configured as there are of at least one matching result and highest priority In the case that correspondence priority with result is greater than or equal to priority threshold value, the matching result of highest priority is determined as Distribution sort adjusts template;Adjustment unit is configured as the category of model tag modification of the text to be sorted being the matching The adjustment tag along sort of classification adjustment template, as classification results.
According to the other embodiment of the disclosure, a kind of document sorting apparatus is provided, comprising: memory and be coupled to institute The processor of memory is stated, the processor is configured to executing above-mentioned based on the instruction being stored in the memory device File classification method described in any one embodiment.
According to other embodiments of the disclosure, a kind of computer readable storage medium is provided, computer is stored thereon with Program, the program realize file classification method described in any of the above-described a embodiment when being executed by processor.
In the above-described embodiments, it is reprocessed by the classification results to textual classification model, generates classification adjustment mould Plate, to improve the accuracy of text classification.Classification adjustment template is generated so not to model training process and external call Side has an impact, and is adapted to different model training modes.
Detailed description of the invention
The attached drawing for constituting part of specification describes embodiment of the disclosure, and together with the description for solving Release the principle of the disclosure.
The disclosure can be more clearly understood according to following detailed description referring to attached drawing, in which:
Fig. 1 shows the flow chart of some embodiments of the file classification method according to the disclosure;
Fig. 2 shows the flow charts according to other embodiments of the file classification method of the disclosure;
Fig. 3 shows the flow chart of the other embodiment of the file classification method according to the disclosure;
Fig. 4 shows the block diagram of some embodiments of the document sorting apparatus according to the disclosure;
Fig. 5 shows the block diagram of other embodiments of the document sorting apparatus according to the disclosure;
Fig. 6 is the block diagram for showing the computer system for realizing some embodiments of the disclosure.
Specific embodiment
The various exemplary embodiments of the disclosure are described in detail now with reference to attached drawing.It should also be noted that unless in addition having Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally Scope of disclosure.
Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality Proportionate relationship draw.
Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the disclosure And its application or any restrictions used.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as authorizing part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without It is as limitation.Therefore, the other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
Fig. 1 shows the flow chart of some embodiments of the file classification method according to the disclosure.As shown in Figure 1, text point Class method includes step S1-S5.
In step sl, classified using textual classification model to multiple mark corpus, obtain each mark corpus Category of model label.
In some embodiments, based on the neural network of deep learning come training text disaggregated model.Marking corpus can be with It is obtained from the corpus of textual classification model training.It may include the fields such as text and mark tag along sort in mark corpus.
In step s 2, the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as Sample corpus.
It in some embodiments, can be by fields match come screening model tag along sort and corresponding mark tag along sort Inconsistent mark corpus.In training text disaggregated model, can also compare mark corpus category of model label with it is corresponding Tag along sort is marked, and adjusts textual classification model using comparison result, so that the consistent ratio of two kinds of labels increases.But It is the scale for being limited to depth learning technology and corpus, two kinds of consistent ratios of label can not reach 100%, lead to text The classification accuracy of disaggregated model is unable to reach desired value.The disclosure can be on the basis of the textual classification model trained up On, the mark corpus inconsistent to two kinds of labels is further processed, and classification adjustment template is generated, so as to textual classification model Classification results are adjusted, to further increase the accuracy rate of text classification.
In step s3, the text in each sample corpus is separately converted to word list.
In some embodiments, by being segmented to text and stop words being gone to handle, word list is converted the text to. Stop words is, for example, not have influential word to the semanteme of text.
Each sample corpus corresponds to a word list.In word list, sequence between word in corresponding text It is identical in this.For example, be " weather " for mark label, the sample corpus that text is " can wear and how much wear how many ", participle The word list obtained afterwards is " energy ", " wearing ", " how many ", " wearing ", " how many ".It can be seen that may exist position in a list " wearing ", " how many " in the identical word of different location, such as in above example.
In step s 4, the word combination extracted from each word list is sorted out according to category of model label, Obtain the word combination under each category of model label.
In some embodiments, word combination is extracted from the word list of each sample corpus, keeps the sequence of word It is constant, that is, keep the sequence of each word in a word combination constant.The length of the word combination of extraction can be according to reality It needs to select, such as length can be 1 to 3.It is still " weather " to mark label, the sample that text is " can wear and how much wear how many " For this corpus, it is assumed that only take the word combination that wherein length is 2, be then drawn into " can wear ", " wearing how many ", " how much wearing ", " wear How much " etc. word combinations.
In step s 5, classification adjustment template is generated according to word combination.
Classification adjustment template is a kind of data acquisition system, for identifying how that the sub-model class result by the condition that meets is adjusted to Another classification.Classification adjustment template includes original classification label, template content and adjustment tag along sort.Template content includes should Word combination, original classification label are the corresponding category of model label of the word combination, and adjustment tag along sort is the word combination The mark label of corresponding sample corpus.
The classification generated adjustment template is described below with reference to table 1-3.Table 1 shows one group of sample corpus and its corresponding mould Type tag along sort and mark tag along sort.As shown in table 1,2 classification adjustment can be generated for word combination " temperature is very high " Template, template content all include " temperature is very high ", and original classification label is all " weather ", but the 1st classification adjusts the adjustment of template Tag along sort is " mobile phone ", and the adjustment tag along sort of the 2nd classification adjustment template is " universe ".
Category of model label Mark tag along sort Sample corpus
Weather Weather Beijing temperature is very high
Weather Weather How is weather tomorrow
Weather Mobile phone Temperature is very high when playing game
Weather Mobile phone Charging temperature is very high, feels to scald one's hand
Weather Universe Sun surface temperature is very high
Table 1
Classification adjustment template can also include priority.A possibility that priority reflection adjustment tag along sort is correct.? In some embodiments, priority can be expressed asWherein, a, b respectively indicate the word combination in template content in original classification Frequency of occurrence in sample corpus under label, adjustment tag along sort.According to the data shown in table 1, for the 1st classification adjustment template There are a=4, b=2, i.e. priority is 0.5;And have a=4 for the 2nd classification adjustment template, and b=1, priority 0.25.
The 1st classification adjustment template of generation and the example of the 2nd classification adjustment template is shown respectively in table 2 and table 3.
Field name Field meanings
Original classification label Weather
Template content Temperature is very high
Adjust tag along sort Mobile phone
Priority 0.5
Table 2
Field name Field meanings
Original classification label Weather
Template content Temperature is very high
Adjust tag along sort Universe
Priority 0.25
Table 3
In further embodiments, priority list can also be shown asWherein c indicates to adjust mould in classification The sum of sample corpus under the original classification label of plate.According to the data shown in table 1, c=5, then the 1st classification adjusts template Priority is changed to about 0.45, and the priority of the 2nd classification adjustment template is changed to about 0.23.
It according to actual needs, can also be by manually adding template according to the format of classification adjustment template, as automatically generating A kind of supplement of classification adjustment template.Manually the classification adjustment template generated in above-described embodiment can also be modified or be deleted It removes, to promote the effect for adjustment template of classifying.There is the priority of the classification adjustment template manually accessed higher, can be set to 1.
Fig. 2 shows the flow charts according to other embodiments of the file classification method of the disclosure.The difference of Fig. 2 and Fig. 1 It is, after classification obtains the word combination under each category of model label, file classification method further includes step S41-S42.
In step S41, deletes while appearing in the word combination under multiple category of model labels.
There may be same word combinations in different sample corpus.Same word combination appears in different models Under tag along sort, it is little to indicate that such word combination influences the classification results of text.For example, all may be used in many sample corpus Can occur " I ", " " etc. words, but they are little to the semantic effect of text, therefore can delete such word combination. That is, not generating classification adjustment template for such word combination.The workload for generating classification adjustment template can be reduced in this way, And not appreciably affect the promotion of classification accuracy.
In step S42, the word combination that the frequency of occurrence in sample corpus is less than threshold value is deleted.
In some embodiments, when counting frequency of occurrence of the word combination in sample corpus, for same word group It closes and occurs in a sample corpus repeatedly, only counted by primary.For example, being " weather " for aforementioned mark label, text is The sample corpus of " how much can wear wear how many ", word combination " can wear ", " wearing how many ", " how much wearing ", " wearing how many " statistics knot Fruit are as follows: " can wear "=1, " wearing how many "=1, " how much wearing "=1.
The number that word combination occurs in all sample corpus, which is less than threshold value (such as 5 times), indicates such word group It is too low to close word frequency, for having little significance to classification adjustment.For example, in all sample corpus, in only 4 sample corpus There is phrase combination " can wear ", in the case of threshold value is set as 5 times, the phrase can be combined and be deleted.That is, not in this way Word combination generate classification adjustment template.It can also be reduced in the case of not appreciably affecting classification accuracy promotion in this way Generate the workload of classification adjustment template.
It should be understood that one in step S41-S42 can also only be executed.Also, step S42 can also be in step S41 It executes, or is performed simultaneously with step S41 before.That is, the execution sequence between step S41 and step S42 is for realizing this public affairs The text classification scheme opened does not influence.
In the above-described embodiments, it is reprocessed by the classification results to textual classification model, generates classification adjustment mould Plate, to improve the accuracy of text classification.Classification adjustment template is generated so not to model training process and external call Side has an impact, and is adapted to different model training modes.
Fig. 3 shows the flow chart of the other embodiment of the file classification method according to the disclosure.As shown in figure 3, text Classification method further includes step S6-S10.
In step s 6, classifying text is treated using textual classification model to classify, obtain the model of text to be sorted Tag along sort.
Classified using classifying text is treated to the same textual classification model of mark corpus classification, is obtained preliminary Classification results.It include the fields such as text and category of model label in preliminary classification results.
It below will be to carry out answering for interpretive classification adjustment template for text to be sorted " charging temperature is very high, is easy explosion " With.In step s 6, for example, being " weather " to the category of model label obtained after text classification.
In the step s 7, word list is converted by text to be sorted.Similar to step S3, can also by text into Row segments and stop words is gone to handle, and converts the text to word list.For example, " charging temperature is very high, is easy quick-fried for text to be sorted It is fried " word list of " charging ", " temperature ", " very high ", " easy ", " explosion " can be converted into.
In step s 8, the classification for meeting following conditions is adjusted into template as matching result: the model of text to be sorted Tag along sort and the classification adjustment original classification label of template are consistent, and extracted from the word list of text to be sorted to A few word combination is included in the template content of classification adjustment template.
The processing of word combination is extracted similar to the relevant treatment in step S4.For example, being arranged from the word of text to be sorted The word combinations such as " charging temperature ", " temperature is very high ", " very high to be easy ", " being easy explosion " can be drawn into table.
As previously mentioned, including word combination in the template content of the 1st classification adjustment template and the 2nd classification adjustment template " temperature is very high ", and original classification label is " weather ".Therefore, the two classification adjustment templates are all satisfied condition, Ke Yizuo For matching result.Assuming that not finding matched classification adjustment template, then for be sorted in example for other word combinations Text, available 2 matching results.
In step s 9, matching classification adjustment template is determined according to matching result.
There are at least one matching result, the matching result of highest priority is filtered out.In some implementations In example, preferred value threshold value can be set, be only greater than or equal to the threshold value in the correspondence priority of the matching result of highest priority When, the matching result of highest priority is just determined as matching classification adjustment template.Priority threshold value can be according to practical application To be arranged.
According to table 2 and table 3, as the 1st classification adjustment template of matching result and the preferential fraction of the 2nd classification adjustment template It Wei 0.5 and 0.25.In the case where priority threshold value is arranged to 0.5, the priority of the 1st classification adjustment template meets article Part therefore can be by the matching result of highest priority, i.e., the 1st classification adjustment template is as matching classification adjustment template.
, whereas if priority threshold value is set to larger than 0.5, for example, 0.6, then the matching result in above-mentioned example is all It is unsatisfactory for condition, that is, determines that matching classification adjustment template is not present, is adjusted without classification, and directly will be obtained in step S6 Category of model label is exported as classification results.
In the case where matching result is not present, it is determined that matching classification adjustment template is not present, and adjusts without classification, Directly exported category of model label obtained in step S6 as classification results.In step slo, by text to be sorted Category of model tag modification is the adjustment tag along sort of matching classification adjustment template, as classification results.
According to table 2, the adjustment tag along sort for the 1st classification adjustment template for adjusting template of classifying as matching is " mobile phone ", " mobile phone " therefore can be used as to the classification results of text " charging temperature is very high, is easy explosion ".
In the above-described embodiments, by introducing classification adjustment template after textual classification model, classification results are adjusted It is whole, the accuracy rate of text classification can be further promoted, classification error can also be purposefully corrected.
Fig. 4 shows the block diagram of some embodiments of the document sorting apparatus according to the disclosure.
As shown in figure 4, document sorting apparatus 4 includes taxon 41, selecting unit 42, conversion unit 43, sorts out unit 44 and generation unit 45.
Taxon 41 is configured as classifying to text using textual classification model.In some embodiments, classify Unit 41 is configured as classifying to multiple mark corpus using textual classification model, obtains the model point of each mark corpus Class label, such as execute step S1.In further embodiments, taxon 41 can be additionally configured to utilize text classification mould Type treats classifying text and classifies, and obtains the category of model label of text to be sorted, such as execute step S6.
Selecting unit 42 is configured as the preference pattern tag along sort mark language inconsistent with corresponding mark tag along sort Material as sample corpus, such as executes step S2.
Conversion unit 43 is configured as text being separately converted to word list.In some embodiments, conversion unit 43 It is configured as the text in each sample corpus being separately converted to word list, such as executes step S3.In other implementations In example, conversion unit 43 is configured as converting text to be sorted to word list, such as executes step S7.
Sort out unit 44 be configured as according to category of model label to the word combination extracted from each word list into Row is sorted out, and obtains the word combination under each category of model label, such as execute step S4.
Generation unit 45 is configured as generating classification adjustment template according to word combination, such as executes step S5.Such as preceding institute It states, the classification adjustment template of generation includes that original classification label, template content and adjustment tag along sort, the template content include The word combination, the original classification label are the corresponding category of model label of the word combination, and the adjustment tag along sort is The mark label of the corresponding sample corpus of the word combination.
In some embodiments, document sorting apparatus 4 further includes deleting unit 46.In some embodiments, unit is deleted 46 are configured as deleting while appearing in the word combination under multiple category of model labels, such as execute step S41.At other In embodiment, deletion unit 46 is configured as deleting the frequency of occurrence in sample corpus and is less than the word combination of threshold value, such as holds Row step S42.Utilize deletion unit, it is possible to reduce generate the workload of classification adjustment template, and not appreciably affect classification accurately The promotion of rate.
In further embodiments, document sorting apparatus 4 further includes matching unit 47, determination unit 48 and adjustment unit 49。
Matching unit 47 is configured as to meet the classification adjustment template of following conditions as matching result: text to be sorted Category of model label and the classification adjustment original classification label of template it is consistent, and taken out from the word list of text to be sorted At least one word combination taken is included in the template content of classification adjustment template.For example, matching unit 47 can execute Step S8.
Determination unit 48 is configured as determining matching classification adjustment template according to matching result, such as executes step S9.? In some embodiments, being greater than there are the correspondence priority of at least one matching result and the matching result of highest priority or In the case where equal to priority threshold value, the matching result of highest priority is determined as matching classification adjustment template.
Adjustment unit 49 is configured as being matching classification adjustment template by the category of model tag modification of text to be sorted Tag along sort is adjusted, as classification results, such as executes step S10.
Fig. 5 shows the block diagram of other embodiments of the document sorting apparatus according to the disclosure.
As shown in figure 5, the device 5 of the embodiment includes: memory 51 and the processor 52 for being coupled to the memory 51. Memory 51 is used to store the instruction for executing file classification method corresponding embodiment.Processor 52 is configured as being based on being stored in Instruction in reservoir 51 executes the file classification method in the disclosure in any some embodiments.
In the above-described embodiments, template is adjusted by the classification of document sorting apparatus to be adjusted classification results, it can be with Promote the accuracy rate of text classification.
Other than file classification method, device, it includes to calculate that the embodiment of the present disclosure, which also can be used in one or more, The form for the computer program product implemented on the non-volatile memory medium of machine program instruction.Therefore, the embodiment of the present disclosure is also Including a kind of computer readable storage medium, it is stored thereon with computer instruction, which realizes aforementioned when being executed by processor File classification method in any embodiment.
Fig. 6 is the block diagram for showing the computer system for realizing some embodiments of the disclosure.
As shown in fig. 6, computer system 60 can be showed in the form of universal computing device.Computer system 60 includes storage The bus 600 of device 610, processor 620 and the different system components of connection.
Memory 610 is such as may include system storage, non-volatile memory medium.System storage for example stores There are operating system, application program, Boot loader (Boot Loader) and other programs etc..System storage can wrap Include volatile storage medium, such as random access memory (RAM) and/or cache memory.Non-volatile memory medium Such as it is stored with the instruction for executing the corresponding embodiment of file classification method.Non-volatile memory medium includes but is not limited to disk Memory, optical memory, flash memory etc..
Processor 620 can with general processor, digital signal processor (DSP), application specific integrated circuit (ASIC), The discrete hardware components mode such as field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor is come It realizes.Correspondingly, each module of such as judgment module and determining module can be run by central processing unit (CPU) and be stored The instruction of corresponding steps is executed in device to realize, can also be realized by executing the special circuit of corresponding steps.
Any bus structures in a variety of bus structures can be used in bus 600.For example, bus structures include but is not limited to Industry standard architecture (ISA) bus, microchannel architecture (MCA) bus, peripheral component interconnection (PCI) bus.
Computer system 60 can also include input/output interface 630, network interface 640, memory interface 650 etc..These It can be connected by bus 600 between interface 630,640,650 and memory 610 and processor 620.Input/output interface 630 can provide connecting interface for input-output equipment such as display, mouse, keyboards.Network interface 640 is various networked devices Connecting interface is provided.The External memory equipments such as memory interface 640 is floppy disk, USB flash disk, SD card provide connecting interface.
Here, referring to according to the method, apparatus of the embodiment of the present disclosure and the flowchart and or block diagram of computer program product Describe various aspects of the disclosure.It should be appreciated that the combination of each frame and each frame of flowchart and or block diagram, is ok It is realized by computer-readable program instructions.
These computer-readable program instructions can provide general purpose computer, special purpose computer or other programmable texts point The processor of class device is realized so that executing instruction generation by processor in flowchart and or block diagram with generating a machine The device for the function of being specified in middle one or more frame.
These computer-readable program instructions may also be stored in computer-readable memory, these instructions are so that computer It works in a specific way, to generate a manufacture, including realizes and refer in one or more frames in flowchart and or block diagram The instruction of fixed function.
Complete hardware embodiment, complete software embodiment or implementation combining software and hardware aspects can be used in the disclosure The form of example.
So far, some embodiments of the present disclosure are described in detail by example.It should be understood that above example Merely to be illustrated, rather than in order to limit the scope of the present disclosure.Those skilled in the art can be to above embodiments It is changed, modifies, replacing, modification, combination, without departing from the scope of the present disclosure.

Claims (15)

1. a kind of file classification method, comprising:
Classified using textual classification model to multiple mark corpus, obtains the category of model label of each mark corpus;
The preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus;
Text in each sample corpus is separately converted to word list;
The word combination extracted from each word list is sorted out according to category of model label, obtains each category of model Word combination under label;
Generate classification adjustment template according to word combination, the classification adjust template include original classification label, template content and Tag along sort is adjusted, the template content includes the word combination, and the original classification label is the corresponding mould of the word combination Type tag along sort, the adjustment tag along sort are the mark label of the corresponding sample corpus of the word combination.
2. file classification method according to claim 1, further includes: delete while appearing under multiple category of model labels Word combination.
3. file classification method according to claim 1, further includes: delete the frequency of occurrence in sample corpus and be less than threshold value Word combination.
4. file classification method according to claim 3, wherein same word combination occurs more in a sample corpus It is secondary, only counted by primary.
5. file classification method according to claim 1, wherein the classification adjustment template further includes priority, described A possibility that priority reflection adjustment tag along sort is correct.
6. file classification method according to claim 5, wherein the priority list is shown asA, b respectively indicates described Word combination in the template content frequency of occurrence in the sample corpus under original classification label, adjustment tag along sort.
7. file classification method according to claim 6, wherein the priority list is shown asC is indicated in institute State the sum of the sample corpus under the original classification label of classification adjustment template.
8. file classification method according to claim 5, further includes:
Classifying text is treated using the textual classification model to classify, and obtains the category of model mark of the text to be sorted Label;
Word list is converted by the text to be sorted;
The classification adjustment template of following conditions will be met as matching result: the category of model label of the text to be sorted and this At least one word consistent, and extracted from the word list of the text to be sorted of original classification label of classification adjustment template Language combination is included in the template content of classification adjustment template;
There are the correspondence priority of at least one matching result and the matching result of highest priority be greater than or equal to priority In the case where threshold value, the matching result of highest priority is determined as matching classification adjustment template;
It is the adjustment tag along sort of the matching classification adjustment template by the category of model tag modification of the text to be sorted, makees For classification results.
9. file classification method according to any one of claim 1 to 8, wherein by the way that text is segmented and gone Stop words processing, converts the text to word list.
10. file classification method according to claim 1 to 8, wherein in word list between word Sequence to be identical in corresponding text.
11. a kind of document sorting apparatus, comprising:
Taxon is configured as classifying to multiple mark corpus using textual classification model, obtains each mark corpus Category of model label;
Selecting unit is configured as the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, makees For sample corpus;
Conversion unit is configured as the text in each sample corpus being separately converted to word list;
Sort out unit, is configured as returning the word combination extracted from each word list according to category of model label Class obtains the word combination under each category of model label;
Generation unit is configured as generating classification adjustment template according to word combination, and the classification adjustment template includes original point Class label, template content and adjustment tag along sort, the template content includes the word combination, and the original classification label is should The corresponding category of model label of word combination, the adjustment tag along sort are the mark mark of the corresponding sample corpus of the word combination Label.
12. document sorting apparatus according to claim 11, further includes:
Unit is deleted, be configured as deleting while appearing in the word combination under multiple category of model labels or is deleted in sample language Frequency of occurrence is less than the word combination of threshold value in material.
13. document sorting apparatus according to claim 11, further includes:
Matching unit is configured as to meet the classification adjustment template of following conditions as matching result: the text to be sorted Category of model label and the classification adjustment original classification label of template it is consistent, and from the word list of the text to be sorted At least one word combination of middle extraction is included in the template content of classification adjustment template;
Determination unit is configured as there are the correspondence of at least one matching result and the matching result of highest priority is preferential In the case that grade is greater than or equal to priority threshold value, the matching result of highest priority is determined as matching classification adjustment template;
Adjustment unit is configured as being matching classification adjustment template by the category of model tag modification of the text to be sorted Adjustment tag along sort, as classification results.
14. a kind of document sorting apparatus, comprising:
Memory;With
It is coupled to the processor of the memory, the processor is configured to the instruction based on storage in the memory, Execute such as file classification method of any of claims 1-10.
15. a kind of computer readable storage medium, is stored thereon with computer program, realized such as when which is executed by processor File classification method of any of claims 1-10.
CN201811035883.4A 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium Active CN109189932B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811035883.4A CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811035883.4A CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN109189932A true CN109189932A (en) 2019-01-11
CN109189932B CN109189932B (en) 2021-02-26

Family

ID=64914969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811035883.4A Active CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN109189932B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674263A (en) * 2019-12-04 2020-01-10 广联达科技股份有限公司 Method and device for automatically classifying model component files

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130138641A1 (en) * 2009-12-30 2013-05-30 Google Inc. Construction of text classifiers
CN104182423A (en) * 2013-05-27 2014-12-03 华东师范大学 Conditional random field-based automatic Chinese personal name recognition method
US20140358539A1 (en) * 2013-05-29 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
CN105955975A (en) * 2016-04-15 2016-09-21 北京大学 Knowledge recommendation method for academic literature
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN107894980A (en) * 2017-12-06 2018-04-10 陈件 A kind of multiple statement is to corpus of text sorting technique and grader
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130138641A1 (en) * 2009-12-30 2013-05-30 Google Inc. Construction of text classifiers
CN104182423A (en) * 2013-05-27 2014-12-03 华东师范大学 Conditional random field-based automatic Chinese personal name recognition method
US20140358539A1 (en) * 2013-05-29 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN105955975A (en) * 2016-04-15 2016-09-21 北京大学 Knowledge recommendation method for academic literature
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107894980A (en) * 2017-12-06 2018-04-10 陈件 A kind of multiple statement is to corpus of text sorting technique and grader
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674263A (en) * 2019-12-04 2020-01-10 广联达科技股份有限公司 Method and device for automatically classifying model component files
CN110674263B (en) * 2019-12-04 2022-02-08 广联达科技股份有限公司 Method and device for automatically classifying model component files

Also Published As

Publication number Publication date
CN109189932B (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
US11138250B2 (en) Method and device for extracting core word of commodity short text
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106445919A (en) Sentiment classifying method and device
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN111160452A (en) Multi-modal network rumor detection method based on pre-training language model
CN104142912A (en) Accurate corpus category marking method and device
CN109598307B (en) Data screening method and device, server and storage medium
CN109241297B (en) Content classification and aggregation method, electronic equipment, storage medium and engine
CN108090099B (en) Text processing method and device
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN109684476A (en) A kind of file classification method, document sorting apparatus and terminal device
CN112507704A (en) Multi-intention recognition method, device, equipment and storage medium
CN110110035A (en) Data processing method and device and computer readable storage medium
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
CN105446955A (en) Adaptive word segmentation method
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
CN103631874A (en) UGC label classification determining method and device for social platform
CN103020167A (en) Chinese text classification method for computer
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN110738046A (en) Viewpoint extraction method and device
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN109993216A (en) A kind of file classification method and its equipment based on K arest neighbors KNN
CN107704869B (en) Corpus data sampling method and model training method
CN109902157A (en) A kind of training sample validation checking method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant