CN109189932A - File classification method and device, computer readable storage medium - Google Patents
File classification method and device, computer readable storage medium Download PDFInfo
- Publication number
- CN109189932A CN109189932A CN201811035883.4A CN201811035883A CN109189932A CN 109189932 A CN109189932 A CN 109189932A CN 201811035883 A CN201811035883 A CN 201811035883A CN 109189932 A CN109189932 A CN 109189932A
- Authority
- CN
- China
- Prior art keywords
- classification
- label
- template
- adjustment
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This disclosure relates to file classification method and device, computer readable storage medium.File classification method, comprising: classified using textual classification model to multiple mark corpus, obtain the category of model label of each mark corpus;The preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus;Text in each sample corpus is separately converted to word list;The word combination extracted from each word list is sorted out according to category of model label, obtains the word combination under each category of model label;Classification adjustment template is generated according to word combination, the classification adjustment template includes original classification label, template content and adjustment tag along sort, the template content includes the word combination, the original classification label is the corresponding category of model label of the word combination, and the adjustment tag along sort is the mark label of the corresponding sample corpus of the word combination.
Description
Technical field
This disclosure relates to computer field, in particular to a kind of file classification method and device, computer-readable storage medium
Matter.
Background technique
Text Classification is widely used in electronic text information processing.The development of depth learning technology is further expanded
The application scenarios of text classification are opened up.
It is generally included based on the related text sorting technique of deep learning: determining classification standard;Collect and mark corpus, shape
At corpus;With corpus train classification models;Classified with trained disaggregated model to other texts.
Summary of the invention
Due to the limitation of corpus and deep learning itself, the accuracy rate of disaggregated model can not reach 100%, and lack
The part accuracy rate of mistake is difficult to effectively be promoted by the optimization of disaggregated model itself.
In consideration of it, the accuracy of text classification can be further increased the present disclosure proposes a kind of text classification scheme.
According to some embodiments of the present disclosure, a kind of file classification method is provided, comprising: utilize textual classification model pair
Multiple mark corpus are classified, and the category of model label of each mark corpus is obtained;Preference pattern tag along sort with it is corresponding
The inconsistent mark corpus of tag along sort is marked, as sample corpus;Text in each sample corpus is separately converted to word
Language list;The word combination extracted from each word list is sorted out according to category of model label, obtains each model
Word combination under tag along sort;Classification adjustment template is generated according to word combination, the classification adjustment template includes original point
Class label, template content and adjustment tag along sort, the template content includes the word combination, and the original classification label is should
The corresponding category of model label of word combination, the adjustment tag along sort are the mark mark of the corresponding sample corpus of the word combination
Label.
In some embodiments, the file classification method further include: delete while appearing in multiple category of model labels
Under word combination.
In some embodiments, the file classification method further include: delete the frequency of occurrence in sample corpus and be less than threshold
The word combination of value.
In some embodiments, same word combination occurs repeatedly in a sample corpus, only counts by primary.
In some embodiments, the classification adjustment template further includes priority, the priority reflection adjustment contingency table
A possibility that label are correct.
In some embodiments, the priority list is shown asA, b respectively indicates the word combination in the template content
The frequency of occurrence in the sample corpus under original classification label, adjustment tag along sort.
In some embodiments, the priority list is shown asC is indicated in the original of the classification adjustment template
The sum of sample corpus under tag along sort.
In some embodiments, the file classification method further include: using the textual classification model to text to be sorted
This is classified, and the category of model label of the text to be sorted is obtained;Word list is converted by the text to be sorted;It will
Meet the classification adjustment template of following conditions as matching result: the category of model label of the text to be sorted and the classification tune
The original classification label of mould preparation plate is consistent, and at least one word combination extracted from the word list of the text to be sorted
Included in the template content of classification adjustment template;There are the matching knots of at least one matching result and highest priority
In the case that the correspondence priority of fruit is greater than or equal to priority threshold value, the matching result of highest priority is determined as matching point
Class adjusts template;It is the adjustment classification of the matching classification adjustment template by the category of model tag modification of the text to be sorted
Label, as classification results.
In some embodiments, by being segmented to text and stop words being gone to handle, word list is converted the text to.
In some embodiments, in word list between word sequence to be identical in corresponding text.
According to other embodiments of the disclosure, a kind of document sorting apparatus is provided, comprising: taxon is configured as
Classified using textual classification model to multiple mark corpus, obtains the category of model label of each mark corpus;Selection is single
Member is configured as the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus;
Conversion unit is configured as the text in each sample corpus being separately converted to word list;Sort out unit, be configured as by
The word combination extracted from each word list is sorted out according to category of model label, is obtained under each category of model label
Word combination;Generation unit is configured as generating classification adjustment template according to word combination, and the classification adjustment template includes
Original classification label, template content and adjustment tag along sort, the template content includes the word combination, the original classification mark
Label are the corresponding category of model label of the word combination, and the adjustment tag along sort is the corresponding sample corpus of the word combination
Mark label.
In some embodiments, the document sorting apparatus further include: delete unit, be configured as deleting while appearing in
Word combination under multiple category of model labels deletes the word combination that the frequency of occurrence in sample corpus is less than threshold value.
In some embodiments, the document sorting apparatus further include: matching unit is configured as that following conditions will be met
Classification adjustment template as matching result: the category of model label of the text to be sorted and the classification adjust the original of template
Tag along sort is consistent, and at least one word combination extracted from the word list of the text to be sorted is included in the classification
In the template content for adjusting template;Determination unit is configured as there are of at least one matching result and highest priority
In the case that correspondence priority with result is greater than or equal to priority threshold value, the matching result of highest priority is determined as
Distribution sort adjusts template;Adjustment unit is configured as the category of model tag modification of the text to be sorted being the matching
The adjustment tag along sort of classification adjustment template, as classification results.
According to the other embodiment of the disclosure, a kind of document sorting apparatus is provided, comprising: memory and be coupled to institute
The processor of memory is stated, the processor is configured to executing above-mentioned based on the instruction being stored in the memory device
File classification method described in any one embodiment.
According to other embodiments of the disclosure, a kind of computer readable storage medium is provided, computer is stored thereon with
Program, the program realize file classification method described in any of the above-described a embodiment when being executed by processor.
In the above-described embodiments, it is reprocessed by the classification results to textual classification model, generates classification adjustment mould
Plate, to improve the accuracy of text classification.Classification adjustment template is generated so not to model training process and external call
Side has an impact, and is adapted to different model training modes.
Detailed description of the invention
The attached drawing for constituting part of specification describes embodiment of the disclosure, and together with the description for solving
Release the principle of the disclosure.
The disclosure can be more clearly understood according to following detailed description referring to attached drawing, in which:
Fig. 1 shows the flow chart of some embodiments of the file classification method according to the disclosure;
Fig. 2 shows the flow charts according to other embodiments of the file classification method of the disclosure;
Fig. 3 shows the flow chart of the other embodiment of the file classification method according to the disclosure;
Fig. 4 shows the block diagram of some embodiments of the document sorting apparatus according to the disclosure;
Fig. 5 shows the block diagram of other embodiments of the document sorting apparatus according to the disclosure;
Fig. 6 is the block diagram for showing the computer system for realizing some embodiments of the disclosure.
Specific embodiment
The various exemplary embodiments of the disclosure are described in detail now with reference to attached drawing.It should also be noted that unless in addition having
Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally
Scope of disclosure.
Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality
Proportionate relationship draw.
Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the disclosure
And its application or any restrictions used.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable
In the case of, the technology, method and apparatus should be considered as authorizing part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without
It is as limitation.Therefore, the other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
Fig. 1 shows the flow chart of some embodiments of the file classification method according to the disclosure.As shown in Figure 1, text point
Class method includes step S1-S5.
In step sl, classified using textual classification model to multiple mark corpus, obtain each mark corpus
Category of model label.
In some embodiments, based on the neural network of deep learning come training text disaggregated model.Marking corpus can be with
It is obtained from the corpus of textual classification model training.It may include the fields such as text and mark tag along sort in mark corpus.
In step s 2, the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as
Sample corpus.
It in some embodiments, can be by fields match come screening model tag along sort and corresponding mark tag along sort
Inconsistent mark corpus.In training text disaggregated model, can also compare mark corpus category of model label with it is corresponding
Tag along sort is marked, and adjusts textual classification model using comparison result, so that the consistent ratio of two kinds of labels increases.But
It is the scale for being limited to depth learning technology and corpus, two kinds of consistent ratios of label can not reach 100%, lead to text
The classification accuracy of disaggregated model is unable to reach desired value.The disclosure can be on the basis of the textual classification model trained up
On, the mark corpus inconsistent to two kinds of labels is further processed, and classification adjustment template is generated, so as to textual classification model
Classification results are adjusted, to further increase the accuracy rate of text classification.
In step s3, the text in each sample corpus is separately converted to word list.
In some embodiments, by being segmented to text and stop words being gone to handle, word list is converted the text to.
Stop words is, for example, not have influential word to the semanteme of text.
Each sample corpus corresponds to a word list.In word list, sequence between word in corresponding text
It is identical in this.For example, be " weather " for mark label, the sample corpus that text is " can wear and how much wear how many ", participle
The word list obtained afterwards is " energy ", " wearing ", " how many ", " wearing ", " how many ".It can be seen that may exist position in a list
" wearing ", " how many " in the identical word of different location, such as in above example.
In step s 4, the word combination extracted from each word list is sorted out according to category of model label,
Obtain the word combination under each category of model label.
In some embodiments, word combination is extracted from the word list of each sample corpus, keeps the sequence of word
It is constant, that is, keep the sequence of each word in a word combination constant.The length of the word combination of extraction can be according to reality
It needs to select, such as length can be 1 to 3.It is still " weather " to mark label, the sample that text is " can wear and how much wear how many "
For this corpus, it is assumed that only take the word combination that wherein length is 2, be then drawn into " can wear ", " wearing how many ", " how much wearing ", " wear
How much " etc. word combinations.
In step s 5, classification adjustment template is generated according to word combination.
Classification adjustment template is a kind of data acquisition system, for identifying how that the sub-model class result by the condition that meets is adjusted to
Another classification.Classification adjustment template includes original classification label, template content and adjustment tag along sort.Template content includes should
Word combination, original classification label are the corresponding category of model label of the word combination, and adjustment tag along sort is the word combination
The mark label of corresponding sample corpus.
The classification generated adjustment template is described below with reference to table 1-3.Table 1 shows one group of sample corpus and its corresponding mould
Type tag along sort and mark tag along sort.As shown in table 1,2 classification adjustment can be generated for word combination " temperature is very high "
Template, template content all include " temperature is very high ", and original classification label is all " weather ", but the 1st classification adjusts the adjustment of template
Tag along sort is " mobile phone ", and the adjustment tag along sort of the 2nd classification adjustment template is " universe ".
Category of model label | Mark tag along sort | Sample corpus |
Weather | Weather | Beijing temperature is very high |
Weather | Weather | How is weather tomorrow |
Weather | Mobile phone | Temperature is very high when playing game |
Weather | Mobile phone | Charging temperature is very high, feels to scald one's hand |
Weather | Universe | Sun surface temperature is very high |
Table 1
Classification adjustment template can also include priority.A possibility that priority reflection adjustment tag along sort is correct.?
In some embodiments, priority can be expressed asWherein, a, b respectively indicate the word combination in template content in original classification
Frequency of occurrence in sample corpus under label, adjustment tag along sort.According to the data shown in table 1, for the 1st classification adjustment template
There are a=4, b=2, i.e. priority is 0.5;And have a=4 for the 2nd classification adjustment template, and b=1, priority 0.25.
The 1st classification adjustment template of generation and the example of the 2nd classification adjustment template is shown respectively in table 2 and table 3.
Field name | Field meanings |
Original classification label | Weather |
Template content | Temperature is very high |
Adjust tag along sort | Mobile phone |
Priority | 0.5 |
Table 2
Field name | Field meanings |
Original classification label | Weather |
Template content | Temperature is very high |
Adjust tag along sort | Universe |
Priority | 0.25 |
Table 3
In further embodiments, priority list can also be shown asWherein c indicates to adjust mould in classification
The sum of sample corpus under the original classification label of plate.According to the data shown in table 1, c=5, then the 1st classification adjusts template
Priority is changed to about 0.45, and the priority of the 2nd classification adjustment template is changed to about 0.23.
It according to actual needs, can also be by manually adding template according to the format of classification adjustment template, as automatically generating
A kind of supplement of classification adjustment template.Manually the classification adjustment template generated in above-described embodiment can also be modified or be deleted
It removes, to promote the effect for adjustment template of classifying.There is the priority of the classification adjustment template manually accessed higher, can be set to 1.
Fig. 2 shows the flow charts according to other embodiments of the file classification method of the disclosure.The difference of Fig. 2 and Fig. 1
It is, after classification obtains the word combination under each category of model label, file classification method further includes step S41-S42.
In step S41, deletes while appearing in the word combination under multiple category of model labels.
There may be same word combinations in different sample corpus.Same word combination appears in different models
Under tag along sort, it is little to indicate that such word combination influences the classification results of text.For example, all may be used in many sample corpus
Can occur " I ", " " etc. words, but they are little to the semantic effect of text, therefore can delete such word combination.
That is, not generating classification adjustment template for such word combination.The workload for generating classification adjustment template can be reduced in this way,
And not appreciably affect the promotion of classification accuracy.
In step S42, the word combination that the frequency of occurrence in sample corpus is less than threshold value is deleted.
In some embodiments, when counting frequency of occurrence of the word combination in sample corpus, for same word group
It closes and occurs in a sample corpus repeatedly, only counted by primary.For example, being " weather " for aforementioned mark label, text is
The sample corpus of " how much can wear wear how many ", word combination " can wear ", " wearing how many ", " how much wearing ", " wearing how many " statistics knot
Fruit are as follows: " can wear "=1, " wearing how many "=1, " how much wearing "=1.
The number that word combination occurs in all sample corpus, which is less than threshold value (such as 5 times), indicates such word group
It is too low to close word frequency, for having little significance to classification adjustment.For example, in all sample corpus, in only 4 sample corpus
There is phrase combination " can wear ", in the case of threshold value is set as 5 times, the phrase can be combined and be deleted.That is, not in this way
Word combination generate classification adjustment template.It can also be reduced in the case of not appreciably affecting classification accuracy promotion in this way
Generate the workload of classification adjustment template.
It should be understood that one in step S41-S42 can also only be executed.Also, step S42 can also be in step S41
It executes, or is performed simultaneously with step S41 before.That is, the execution sequence between step S41 and step S42 is for realizing this public affairs
The text classification scheme opened does not influence.
In the above-described embodiments, it is reprocessed by the classification results to textual classification model, generates classification adjustment mould
Plate, to improve the accuracy of text classification.Classification adjustment template is generated so not to model training process and external call
Side has an impact, and is adapted to different model training modes.
Fig. 3 shows the flow chart of the other embodiment of the file classification method according to the disclosure.As shown in figure 3, text
Classification method further includes step S6-S10.
In step s 6, classifying text is treated using textual classification model to classify, obtain the model of text to be sorted
Tag along sort.
Classified using classifying text is treated to the same textual classification model of mark corpus classification, is obtained preliminary
Classification results.It include the fields such as text and category of model label in preliminary classification results.
It below will be to carry out answering for interpretive classification adjustment template for text to be sorted " charging temperature is very high, is easy explosion "
With.In step s 6, for example, being " weather " to the category of model label obtained after text classification.
In the step s 7, word list is converted by text to be sorted.Similar to step S3, can also by text into
Row segments and stop words is gone to handle, and converts the text to word list.For example, " charging temperature is very high, is easy quick-fried for text to be sorted
It is fried " word list of " charging ", " temperature ", " very high ", " easy ", " explosion " can be converted into.
In step s 8, the classification for meeting following conditions is adjusted into template as matching result: the model of text to be sorted
Tag along sort and the classification adjustment original classification label of template are consistent, and extracted from the word list of text to be sorted to
A few word combination is included in the template content of classification adjustment template.
The processing of word combination is extracted similar to the relevant treatment in step S4.For example, being arranged from the word of text to be sorted
The word combinations such as " charging temperature ", " temperature is very high ", " very high to be easy ", " being easy explosion " can be drawn into table.
As previously mentioned, including word combination in the template content of the 1st classification adjustment template and the 2nd classification adjustment template
" temperature is very high ", and original classification label is " weather ".Therefore, the two classification adjustment templates are all satisfied condition, Ke Yizuo
For matching result.Assuming that not finding matched classification adjustment template, then for be sorted in example for other word combinations
Text, available 2 matching results.
In step s 9, matching classification adjustment template is determined according to matching result.
There are at least one matching result, the matching result of highest priority is filtered out.In some implementations
In example, preferred value threshold value can be set, be only greater than or equal to the threshold value in the correspondence priority of the matching result of highest priority
When, the matching result of highest priority is just determined as matching classification adjustment template.Priority threshold value can be according to practical application
To be arranged.
According to table 2 and table 3, as the 1st classification adjustment template of matching result and the preferential fraction of the 2nd classification adjustment template
It Wei 0.5 and 0.25.In the case where priority threshold value is arranged to 0.5, the priority of the 1st classification adjustment template meets article
Part therefore can be by the matching result of highest priority, i.e., the 1st classification adjustment template is as matching classification adjustment template.
, whereas if priority threshold value is set to larger than 0.5, for example, 0.6, then the matching result in above-mentioned example is all
It is unsatisfactory for condition, that is, determines that matching classification adjustment template is not present, is adjusted without classification, and directly will be obtained in step S6
Category of model label is exported as classification results.
In the case where matching result is not present, it is determined that matching classification adjustment template is not present, and adjusts without classification,
Directly exported category of model label obtained in step S6 as classification results.In step slo, by text to be sorted
Category of model tag modification is the adjustment tag along sort of matching classification adjustment template, as classification results.
According to table 2, the adjustment tag along sort for the 1st classification adjustment template for adjusting template of classifying as matching is " mobile phone ",
" mobile phone " therefore can be used as to the classification results of text " charging temperature is very high, is easy explosion ".
In the above-described embodiments, by introducing classification adjustment template after textual classification model, classification results are adjusted
It is whole, the accuracy rate of text classification can be further promoted, classification error can also be purposefully corrected.
Fig. 4 shows the block diagram of some embodiments of the document sorting apparatus according to the disclosure.
As shown in figure 4, document sorting apparatus 4 includes taxon 41, selecting unit 42, conversion unit 43, sorts out unit
44 and generation unit 45.
Taxon 41 is configured as classifying to text using textual classification model.In some embodiments, classify
Unit 41 is configured as classifying to multiple mark corpus using textual classification model, obtains the model point of each mark corpus
Class label, such as execute step S1.In further embodiments, taxon 41 can be additionally configured to utilize text classification mould
Type treats classifying text and classifies, and obtains the category of model label of text to be sorted, such as execute step S6.
Selecting unit 42 is configured as the preference pattern tag along sort mark language inconsistent with corresponding mark tag along sort
Material as sample corpus, such as executes step S2.
Conversion unit 43 is configured as text being separately converted to word list.In some embodiments, conversion unit 43
It is configured as the text in each sample corpus being separately converted to word list, such as executes step S3.In other implementations
In example, conversion unit 43 is configured as converting text to be sorted to word list, such as executes step S7.
Sort out unit 44 be configured as according to category of model label to the word combination extracted from each word list into
Row is sorted out, and obtains the word combination under each category of model label, such as execute step S4.
Generation unit 45 is configured as generating classification adjustment template according to word combination, such as executes step S5.Such as preceding institute
It states, the classification adjustment template of generation includes that original classification label, template content and adjustment tag along sort, the template content include
The word combination, the original classification label are the corresponding category of model label of the word combination, and the adjustment tag along sort is
The mark label of the corresponding sample corpus of the word combination.
In some embodiments, document sorting apparatus 4 further includes deleting unit 46.In some embodiments, unit is deleted
46 are configured as deleting while appearing in the word combination under multiple category of model labels, such as execute step S41.At other
In embodiment, deletion unit 46 is configured as deleting the frequency of occurrence in sample corpus and is less than the word combination of threshold value, such as holds
Row step S42.Utilize deletion unit, it is possible to reduce generate the workload of classification adjustment template, and not appreciably affect classification accurately
The promotion of rate.
In further embodiments, document sorting apparatus 4 further includes matching unit 47, determination unit 48 and adjustment unit
49。
Matching unit 47 is configured as to meet the classification adjustment template of following conditions as matching result: text to be sorted
Category of model label and the classification adjustment original classification label of template it is consistent, and taken out from the word list of text to be sorted
At least one word combination taken is included in the template content of classification adjustment template.For example, matching unit 47 can execute
Step S8.
Determination unit 48 is configured as determining matching classification adjustment template according to matching result, such as executes step S9.?
In some embodiments, being greater than there are the correspondence priority of at least one matching result and the matching result of highest priority or
In the case where equal to priority threshold value, the matching result of highest priority is determined as matching classification adjustment template.
Adjustment unit 49 is configured as being matching classification adjustment template by the category of model tag modification of text to be sorted
Tag along sort is adjusted, as classification results, such as executes step S10.
Fig. 5 shows the block diagram of other embodiments of the document sorting apparatus according to the disclosure.
As shown in figure 5, the device 5 of the embodiment includes: memory 51 and the processor 52 for being coupled to the memory 51.
Memory 51 is used to store the instruction for executing file classification method corresponding embodiment.Processor 52 is configured as being based on being stored in
Instruction in reservoir 51 executes the file classification method in the disclosure in any some embodiments.
In the above-described embodiments, template is adjusted by the classification of document sorting apparatus to be adjusted classification results, it can be with
Promote the accuracy rate of text classification.
Other than file classification method, device, it includes to calculate that the embodiment of the present disclosure, which also can be used in one or more,
The form for the computer program product implemented on the non-volatile memory medium of machine program instruction.Therefore, the embodiment of the present disclosure is also
Including a kind of computer readable storage medium, it is stored thereon with computer instruction, which realizes aforementioned when being executed by processor
File classification method in any embodiment.
Fig. 6 is the block diagram for showing the computer system for realizing some embodiments of the disclosure.
As shown in fig. 6, computer system 60 can be showed in the form of universal computing device.Computer system 60 includes storage
The bus 600 of device 610, processor 620 and the different system components of connection.
Memory 610 is such as may include system storage, non-volatile memory medium.System storage for example stores
There are operating system, application program, Boot loader (Boot Loader) and other programs etc..System storage can wrap
Include volatile storage medium, such as random access memory (RAM) and/or cache memory.Non-volatile memory medium
Such as it is stored with the instruction for executing the corresponding embodiment of file classification method.Non-volatile memory medium includes but is not limited to disk
Memory, optical memory, flash memory etc..
Processor 620 can with general processor, digital signal processor (DSP), application specific integrated circuit (ASIC),
The discrete hardware components mode such as field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor is come
It realizes.Correspondingly, each module of such as judgment module and determining module can be run by central processing unit (CPU) and be stored
The instruction of corresponding steps is executed in device to realize, can also be realized by executing the special circuit of corresponding steps.
Any bus structures in a variety of bus structures can be used in bus 600.For example, bus structures include but is not limited to
Industry standard architecture (ISA) bus, microchannel architecture (MCA) bus, peripheral component interconnection (PCI) bus.
Computer system 60 can also include input/output interface 630, network interface 640, memory interface 650 etc..These
It can be connected by bus 600 between interface 630,640,650 and memory 610 and processor 620.Input/output interface
630 can provide connecting interface for input-output equipment such as display, mouse, keyboards.Network interface 640 is various networked devices
Connecting interface is provided.The External memory equipments such as memory interface 640 is floppy disk, USB flash disk, SD card provide connecting interface.
Here, referring to according to the method, apparatus of the embodiment of the present disclosure and the flowchart and or block diagram of computer program product
Describe various aspects of the disclosure.It should be appreciated that the combination of each frame and each frame of flowchart and or block diagram, is ok
It is realized by computer-readable program instructions.
These computer-readable program instructions can provide general purpose computer, special purpose computer or other programmable texts point
The processor of class device is realized so that executing instruction generation by processor in flowchart and or block diagram with generating a machine
The device for the function of being specified in middle one or more frame.
These computer-readable program instructions may also be stored in computer-readable memory, these instructions are so that computer
It works in a specific way, to generate a manufacture, including realizes and refer in one or more frames in flowchart and or block diagram
The instruction of fixed function.
Complete hardware embodiment, complete software embodiment or implementation combining software and hardware aspects can be used in the disclosure
The form of example.
So far, some embodiments of the present disclosure are described in detail by example.It should be understood that above example
Merely to be illustrated, rather than in order to limit the scope of the present disclosure.Those skilled in the art can be to above embodiments
It is changed, modifies, replacing, modification, combination, without departing from the scope of the present disclosure.
Claims (15)
1. a kind of file classification method, comprising:
Classified using textual classification model to multiple mark corpus, obtains the category of model label of each mark corpus;
The preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, as sample corpus;
Text in each sample corpus is separately converted to word list;
The word combination extracted from each word list is sorted out according to category of model label, obtains each category of model
Word combination under label;
Generate classification adjustment template according to word combination, the classification adjust template include original classification label, template content and
Tag along sort is adjusted, the template content includes the word combination, and the original classification label is the corresponding mould of the word combination
Type tag along sort, the adjustment tag along sort are the mark label of the corresponding sample corpus of the word combination.
2. file classification method according to claim 1, further includes: delete while appearing under multiple category of model labels
Word combination.
3. file classification method according to claim 1, further includes: delete the frequency of occurrence in sample corpus and be less than threshold value
Word combination.
4. file classification method according to claim 3, wherein same word combination occurs more in a sample corpus
It is secondary, only counted by primary.
5. file classification method according to claim 1, wherein the classification adjustment template further includes priority, described
A possibility that priority reflection adjustment tag along sort is correct.
6. file classification method according to claim 5, wherein the priority list is shown asA, b respectively indicates described
Word combination in the template content frequency of occurrence in the sample corpus under original classification label, adjustment tag along sort.
7. file classification method according to claim 6, wherein the priority list is shown asC is indicated in institute
State the sum of the sample corpus under the original classification label of classification adjustment template.
8. file classification method according to claim 5, further includes:
Classifying text is treated using the textual classification model to classify, and obtains the category of model mark of the text to be sorted
Label;
Word list is converted by the text to be sorted;
The classification adjustment template of following conditions will be met as matching result: the category of model label of the text to be sorted and this
At least one word consistent, and extracted from the word list of the text to be sorted of original classification label of classification adjustment template
Language combination is included in the template content of classification adjustment template;
There are the correspondence priority of at least one matching result and the matching result of highest priority be greater than or equal to priority
In the case where threshold value, the matching result of highest priority is determined as matching classification adjustment template;
It is the adjustment tag along sort of the matching classification adjustment template by the category of model tag modification of the text to be sorted, makees
For classification results.
9. file classification method according to any one of claim 1 to 8, wherein by the way that text is segmented and gone
Stop words processing, converts the text to word list.
10. file classification method according to claim 1 to 8, wherein in word list between word
Sequence to be identical in corresponding text.
11. a kind of document sorting apparatus, comprising:
Taxon is configured as classifying to multiple mark corpus using textual classification model, obtains each mark corpus
Category of model label;
Selecting unit is configured as the preference pattern tag along sort mark corpus inconsistent with corresponding mark tag along sort, makees
For sample corpus;
Conversion unit is configured as the text in each sample corpus being separately converted to word list;
Sort out unit, is configured as returning the word combination extracted from each word list according to category of model label
Class obtains the word combination under each category of model label;
Generation unit is configured as generating classification adjustment template according to word combination, and the classification adjustment template includes original point
Class label, template content and adjustment tag along sort, the template content includes the word combination, and the original classification label is should
The corresponding category of model label of word combination, the adjustment tag along sort are the mark mark of the corresponding sample corpus of the word combination
Label.
12. document sorting apparatus according to claim 11, further includes:
Unit is deleted, be configured as deleting while appearing in the word combination under multiple category of model labels or is deleted in sample language
Frequency of occurrence is less than the word combination of threshold value in material.
13. document sorting apparatus according to claim 11, further includes:
Matching unit is configured as to meet the classification adjustment template of following conditions as matching result: the text to be sorted
Category of model label and the classification adjustment original classification label of template it is consistent, and from the word list of the text to be sorted
At least one word combination of middle extraction is included in the template content of classification adjustment template;
Determination unit is configured as there are the correspondence of at least one matching result and the matching result of highest priority is preferential
In the case that grade is greater than or equal to priority threshold value, the matching result of highest priority is determined as matching classification adjustment template;
Adjustment unit is configured as being matching classification adjustment template by the category of model tag modification of the text to be sorted
Adjustment tag along sort, as classification results.
14. a kind of document sorting apparatus, comprising:
Memory;With
It is coupled to the processor of the memory, the processor is configured to the instruction based on storage in the memory,
Execute such as file classification method of any of claims 1-10.
15. a kind of computer readable storage medium, is stored thereon with computer program, realized such as when which is executed by processor
File classification method of any of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811035883.4A CN109189932B (en) | 2018-09-06 | 2018-09-06 | Text classification method and device and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811035883.4A CN109189932B (en) | 2018-09-06 | 2018-09-06 | Text classification method and device and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109189932A true CN109189932A (en) | 2019-01-11 |
CN109189932B CN109189932B (en) | 2021-02-26 |
Family
ID=64914969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811035883.4A Active CN109189932B (en) | 2018-09-06 | 2018-09-06 | Text classification method and device and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109189932B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674263A (en) * | 2019-12-04 | 2020-01-10 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130138641A1 (en) * | 2009-12-30 | 2013-05-30 | Google Inc. | Construction of text classifiers |
CN104182423A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Conditional random field-based automatic Chinese personal name recognition method |
US20140358539A1 (en) * | 2013-05-29 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for building a language model |
CN105955975A (en) * | 2016-04-15 | 2016-09-21 | 北京大学 | Knowledge recommendation method for academic literature |
US20160283583A1 (en) * | 2014-03-14 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN106951472A (en) * | 2017-03-06 | 2017-07-14 | 华侨大学 | A kind of multiple sensibility classification method of network text |
CN107291775A (en) * | 2016-04-11 | 2017-10-24 | 北京京东尚科信息技术有限公司 | The reparation language material generation method and device of error sample |
CN107894980A (en) * | 2017-12-06 | 2018-04-10 | 陈件 | A kind of multiple statement is to corpus of text sorting technique and grader |
CN108108355A (en) * | 2017-12-25 | 2018-06-01 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Text emotion analysis method and system based on deep learning |
-
2018
- 2018-09-06 CN CN201811035883.4A patent/CN109189932B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130138641A1 (en) * | 2009-12-30 | 2013-05-30 | Google Inc. | Construction of text classifiers |
CN104182423A (en) * | 2013-05-27 | 2014-12-03 | 华东师范大学 | Conditional random field-based automatic Chinese personal name recognition method |
US20140358539A1 (en) * | 2013-05-29 | 2014-12-04 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for building a language model |
US20160283583A1 (en) * | 2014-03-14 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN107291775A (en) * | 2016-04-11 | 2017-10-24 | 北京京东尚科信息技术有限公司 | The reparation language material generation method and device of error sample |
CN105955975A (en) * | 2016-04-15 | 2016-09-21 | 北京大学 | Knowledge recommendation method for academic literature |
CN106951472A (en) * | 2017-03-06 | 2017-07-14 | 华侨大学 | A kind of multiple sensibility classification method of network text |
CN107894980A (en) * | 2017-12-06 | 2018-04-10 | 陈件 | A kind of multiple statement is to corpus of text sorting technique and grader |
CN108108355A (en) * | 2017-12-25 | 2018-06-01 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Text emotion analysis method and system based on deep learning |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110674263A (en) * | 2019-12-04 | 2020-01-10 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
CN110674263B (en) * | 2019-12-04 | 2022-02-08 | 广联达科技股份有限公司 | Method and device for automatically classifying model component files |
Also Published As
Publication number | Publication date |
---|---|
CN109189932B (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804512B (en) | Text classification model generation device and method and computer readable storage medium | |
US11138250B2 (en) | Method and device for extracting core word of commodity short text | |
CN106055538B (en) | The automatic abstracting method of the text label that topic model and semantic analysis combine | |
CN106445919A (en) | Sentiment classifying method and device | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN111160452A (en) | Multi-modal network rumor detection method based on pre-training language model | |
CN104142912A (en) | Accurate corpus category marking method and device | |
CN109598307B (en) | Data screening method and device, server and storage medium | |
CN109241297B (en) | Content classification and aggregation method, electronic equipment, storage medium and engine | |
CN108090099B (en) | Text processing method and device | |
CN105205124A (en) | Semi-supervised text sentiment classification method based on random feature subspace | |
CN109684476A (en) | A kind of file classification method, document sorting apparatus and terminal device | |
CN112507704A (en) | Multi-intention recognition method, device, equipment and storage medium | |
CN110110035A (en) | Data processing method and device and computer readable storage medium | |
CN110717040A (en) | Dictionary expansion method and device, electronic equipment and storage medium | |
CN105446955A (en) | Adaptive word segmentation method | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN103631874A (en) | UGC label classification determining method and device for social platform | |
CN103020167A (en) | Chinese text classification method for computer | |
CN110287311A (en) | File classification method and device, storage medium, computer equipment | |
CN110738046A (en) | Viewpoint extraction method and device | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN109993216A (en) | A kind of file classification method and its equipment based on K arest neighbors KNN | |
CN107704869B (en) | Corpus data sampling method and model training method | |
CN109902157A (en) | A kind of training sample validation checking method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |