CN109145116A - A kind of file classification method, device, electronic equipment and storage medium - Google Patents
A kind of file classification method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN109145116A CN109145116A CN201811018629.3A CN201811018629A CN109145116A CN 109145116 A CN109145116 A CN 109145116A CN 201811018629 A CN201811018629 A CN 201811018629A CN 109145116 A CN109145116 A CN 109145116A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- sample
- training set
- sample class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of file classification method, device, electronic equipment and storage mediums, which comprises obtains the corresponding text vector of text to be sorted;Calculate the similarity in the text vector and training set between the center vector of each sample class;Total sample number based on each sample class is smoothed each corresponding similarity, obtains the ownership weighted value of each sample class in the text ownership training set to be sorted;Count the maximum value in the ownership weighted value;By the text classification to be sorted to the corresponding sample class of the maximum value.By using above-mentioned technical proposal, the calculation amount of assorting process is considerably reduced, classification effectiveness is improved, solves the problems, such as to lead to classification error because sample size of all categories in training set is unbalanced, to improve classification accuracy.
Description
Technical field
The present embodiments relate to technical field of data processing more particularly to a kind of file classification method, device, electronics to set
Standby and storage medium.
Background technique
With the development of internet information, the speed and precision of text classification is become more and more important.Text classification is very big
Solve the problems, such as that network information is disorderly and unsystematic, and family is able to use by text classification and fast and accurately navigates to oneself institute
The information needed.Text classification be by a kind of disaggregated model or function, will be to using with the training sample that largely marks
Classifying text is mapped to specified sample class.
Currently, typical text classification algorithm has KNN (K-NearnestNeighbour, K arest neighbors) algorithm, KNN algorithm
Principle of classification are as follows: by inquire known class training set text information, judge which class text information to be sorted belongs to
Not, specifically: calculate similar between the feature vector of text information to be sorted and the feature vector of sample each in training set
Degree, the size according to similarity selects the training sample of K arest neighbors, then to the affiliated class of the training sample of the K arest neighbors
Other quantity is counted, and text information to be sorted is finally divided to the most classification of statistical magnitude.
Therefore, it is seen then that existing KNN algorithm has following defects that on the one hand, due to needing to calculate text information to be sorted
Feature vector and training set in each sample feature vector between similarity, when the sample size in training set is more
When, calculation amount can be very big, causes the classification effectiveness of KNN algorithm lower;Secondly, when sample distribution of all categories in training set is uneven
When weighing apparatus, such as the sample size of a certain classification is very big, and when the sample size very little of other classifications, when treating classifying text information
When being classified, it is easy to occur in sample class belonging to the K training sample with the text information arest neighbors to be sorted
The very big classification of sample size occupies the majority, but text information to be sorted may and be not belonging to the classification to occupy the majority, therefore
It treats classifying text information and carries out wrong classification, lead to the accuracy of classification.
Summary of the invention
The embodiment of the present invention provides a kind of file classification method, device, electronic equipment and storage medium, passes through the method
Calculation amount is reduced, improves classification speed, and classification accuracy can be improved.
To achieve the above object, the embodiment of the present invention adopts the following technical scheme that
In a first aspect, the embodiment of the invention provides a kind of file classification methods, which comprises
Obtain the corresponding text vector of text to be sorted;
Calculate the similarity in the text vector and training set between the center vector of each sample class;
Total sample number based on each sample class is smoothed each corresponding similarity, obtains described wait divide
Class text belongs to the ownership weighted value of each sample class in training set;
Count the maximum value in the ownership weighted value;
By the text classification to be sorted to the corresponding sample class of the maximum value.
Further, similar between the calculating text vector and the center vector of sample class each in training set
Before degree, the method also includes:
Calculate the center vector of each sample class in training set.
Further, the center vector for calculating each sample class in training set, comprising:
The center vector of each sample class in training set is calculated according to following formula:
Wherein, OiIndicate the center vector of sample class i in training set, CinIndicate the total sample number in sample class i, wi
Indicate i-th of sample vector in sample class i.
Further, similar between the calculating text vector and the center vector of sample class each in training set
Degree, comprising:
It is calculated according to following formula similar between the text vector and the center vector of sample class each in training set
Degree:
Wherein, Sim (dj,Oi) indicate text vector djWith the center vector O of sample class i in training setiBetween it is similar
Degree, V are the dimension of the center vector of each sample class in the text vector and training set, wikFor center vector OiIn dimension
Numerical value on k, wjkFor text vector djNumerical value on dimension k, j are the mark of text vector.
Further, the total sample number based on each sample class smoothly locates each corresponding similarity
Reason obtains the ownership weighted value of each sample class in the text ownership training set to be sorted, comprising:
Each corresponding similarity is smoothed according to following formula, obtains the text ownership instruction to be sorted
Practice and concentrate ownership weighted value of all categories:
Wherein, g (d, Ci) indicate i-th of sample class C in text d ownership training set to be sortediOwnership weighted value, Sim
(dj,Oi) indicate the text vector d of text d to be sortedjWith i-th of sample class C in training setiCenter vector OiBetween phase
Like degree, CinFor i-th of sample class C in training setiIn total sample number.
It is further, described to obtain the corresponding text vector of text to be sorted, comprising:
The text to be sorted is handled by word2vec, obtains corresponding text vector.
Further, each sample in the training set is by word2vec treated sample vector.
Second aspect, the embodiment of the invention provides a kind of document sorting apparatus, described device includes:
Module is obtained, for obtaining the corresponding text vector of text to be sorted;
Computing module, it is similar between the text vector and the center vector of sample class each in training set for calculating
Degree;
Processing module smoothly locates each corresponding similarity for the total sample number based on each sample class
Reason obtains the ownership weighted value of each sample class in the text ownership training set to be sorted;
Statistical module, for counting the maximum value in the ownership weighted value;
Categorization module is used for the text classification to be sorted to the corresponding sample class of the maximum value.
The third aspect the embodiment of the invention provides a kind of electronic equipment, including memory, processor and is stored in storage
On device and the computer program that can run on a processor, the processor realizes such as above-mentioned the when executing the computer program
File classification method described in one side.
Fourth aspect, the embodiment of the invention provides a kind of storage medium comprising computer executable instructions, the meters
Calculation machine executable instruction realizes the file classification method as described in above-mentioned first aspect when being executed as computer processor.
A kind of file classification method provided in an embodiment of the present invention, by only calculating the corresponding text vector of text to be sorted
Similarity between the center vector of sample class each in training set considerably reduces the calculation amount of assorting process, improves
Classification effectiveness, is smoothed each corresponding similarity by the total sample number based on each sample class, obtains
The ownership weighted value of each sample class in the text ownership training set to be sorted, finally by the text classification to be sorted to institute
The corresponding sample class of maximum value in ownership weight is stated, solving causes because sample size of all categories in training set is unbalanced
The problem of classification error, to improve classification accuracy.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, institute in being described below to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without creative efforts, can also implement according to the present invention
The content of example and these attached drawings obtain other attached drawings.
Fig. 1 is a kind of file classification method flow diagram that the embodiment of the present invention one provides;
Fig. 2 is a kind of document sorting apparatus structural schematic diagram provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.
Specific embodiment
To keep the technical problems solved, the adopted technical scheme and the technical effect achieved by the invention clearer, below
It will the technical scheme of the embodiment of the invention will be described in further detail in conjunction with attached drawing, it is clear that described embodiment is only
It is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those skilled in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Embodiment one
Fig. 1 is a kind of file classification method flow diagram that the embodiment of the present invention one provides.Text disclosed in the present embodiment
This classification method is the improvement carried out to existing KNN sorting algorithm, it is intended to which the calculating for reducing existing KNN sorting algorithm is complicated
Spend and improve its accuracy of classifying.The file classification method can be executed by document sorting apparatus, and wherein the device can
It by software and or hardware realization, and is typically integrated in terminal, such as server etc..Referring specifically to shown in Fig. 1, this method packet
Include following steps:
110, the corresponding text vector of text to be sorted is obtained.
Wherein, the corresponding text vector of the text to be sorted refers to the digitized representations of the text to be sorted.Only will
The vector that words in natural language is converted to digital form could be identified by computer, and carry out sort operation to it.For example,
Digital conversion is carried out to the words in text according to certain transformation rule, the corresponding text vector of word " Hangzhou " may be
[000001] six-vector, the corresponding text vector of word " Shanghai " may be the six-vector of [010000].
Specifically, described obtain the corresponding text vector of text to be sorted, comprising:
The text to be sorted is handled by word2vec, obtains corresponding text vector.
Wherein, word2vec is a tool calculated for term vector, and the essence of word2vec is by a shallow-layer
Neural network calculates the word in text, and being by the essence that word2vec handles the text to be sorted will certainly
Word in right language is converted to term vector and is indicated.The term vector obtained by word2vec, can measure well word with
Similitude between word.
120, the similarity in the text vector and training set between the center vector of each sample class is calculated.
Wherein, the sample in each sample class in the training set be by word2vec treated sample to
Amount.Each sample class includes a certain number of samples for belonging to the category, for example, the sample class may include: news,
Sport, amusement and finance and economics, sample class " news " include the relevant text of a certain number of news, sample class " sport " packet
Include relevant text of a certain number of sport etc..Sample by treating known class in classifying text and training set is based on certain
Rule carries out similitude judgement, finally by text classification to be sorted into corresponding classification.
In traditional classification method, need to calculate each sample in the corresponding text vector of text to be sorted and training set
Similarity between vector, when the sample size in training set is excessive, then calculation amount can be very big, significantly reduces classification effect
Rate, the present embodiment are improved for this, when treating classifying text and being classified, need to only calculate the corresponding text of text to be sorted
It is complicated to significantly reduce calculating for similarity in this vector and training set between the center vector of each sample class
Degree.
Further, similar between the calculating text vector and the center vector of sample class each in training set
Before degree, the method also includes:
Calculate the center vector of each sample class in training set.
Specifically, calculating the center vector of each sample class in training set according to following formula:
Wherein, OiIndicate the center vector of sample class i in training set, CinIndicate the total sample number in sample class i, wi
Indicate i-th of sample vector in sample class i.
In obtaining training set on the basis of the center vector of each sample class, according to following formula calculate the text to
Similarity in amount and training set between the center vector of each sample class:
Wherein, Sim (dj,Oi) indicate text vector djWith the center vector O of sample class i in training setiBetween it is similar
Degree, V are the dimension of the center vector of each sample class in the text vector and training set, wikFor center vector OiIn dimension
Numerical value on k, wjkFor text vector djNumerical value on dimension k, j are the mark of text vector.
130, the total sample number based on each sample class is smoothed each corresponding similarity, obtains described
The ownership weighted value of each sample class in text ownership training set to be sorted.
Illustratively, the total sample number based on each sample class smoothly locates each corresponding similarity
Reason obtains the ownership weighted value of each sample class in the text ownership training set to be sorted, comprising:
Each corresponding similarity is smoothed according to following formula, obtains the text ownership instruction to be sorted
Practice and concentrate ownership weighted value of all categories:
Wherein, gi(d,Ci) indicate i-th of sample class C in text d ownership training set to be sortediOwnership weighted value,
Sim(dj,Oi) indicate the text vector d of text d to be sortedjWith i-th of sample class C in training setiCenter vector OiBetween
Similarity, CinFor i-th of sample class C in training setiIn total sample number.
It is to reduce and cause classification wrong because sample size is unbalanced to the purpose that each similarity is smoothed
Probability accidentally.If there are many sample size of some sample class in training set, and the sample size under other sample class is seldom,
The similarity then easily occurred between the center vector of many sample class of text vector and sample size to be sorted is maximum, such as
This is then by text classification to be sorted into many sample class of the sample size, still, in many cases, text to be sorted
And many sample class of the sample size are not belonging to, therefore the problem of classification error occur.In order to reduce because of sample unevenness
The probability that weighing apparatus causes classification error to occur, the present embodiment will be between text vector to be sorted and the center vector of each sample class
Similarity multiplied byBy the function characteristic of log function it is found that the total sample number C of sample classinIt is bigger, thenValue it is smaller, stillVariation than more gentle, therefore, the sample class that total sample number can be inhibited very big
The expansion of similarity, finds through test of many times between text to be sorted, can by being smoothed to each similarity
Greatly improve the classification accuracy for treating classifying text.
140, the maximum value in the ownership weighted value is counted.
Specifically, the maximum value in the statistics ownership weighted value are as follows:
Wherein, N indicates the sum of sample class in training set.
150, by the text classification to be sorted to the corresponding sample class of the maximum value.
For example, if the maximum value is third sample class in the corresponding text vector of text to be sorted and training set
Similarity between center vector, third sample class is " news " in the training set, then divides the text to be sorted
Class is into " news " class.
Illustrate above-mentioned text classification process:
Assuming that sample class A includes three there are two sample class, respectively sample class A and sample class B in training set
A sample, respectively sample a1, a2 and a3, sample class B include two samples, respectively sample b1 and b2;
Text collection C to be sorted includes two texts to be sorted, text c1 and c2 respectively to be sorted;
By word2vec to above-mentioned sample a1, a2 and a3, sample b1 and b2, text c1 and c2 to be sorted are located respectively
Reason, obtains sample vectorWithSample vectorWithText vector to be sortedWith
Firstly, calculating separately the center vector O of sample class AAAnd the center vector O of sample class BB:
Secondly, calculating text vector to be sortedRespectively with center vector OAAnd OBBetween similarity, text to be sorted to
AmountRespectively with center vector OAAnd OBBetween similarity:
Assuming that
Finally, being smoothed respectively to the similarity, obtain each in text c1 and c2 ownership training set to be sorted
The ownership weighted value of sample class:
It therefore will be wait divide by maximum value corresponding sample class of the text classification to be sorted into the ownership weighted value
Class text c1 is classified to sample class B, and text c2 to be sorted is classified to sample class A.
A kind of file classification method provided in this embodiment, by only calculating the corresponding text vector of text to be sorted and instruction
Practice the similarity between the center vector for concentrating each sample class, considerably reduce the calculation amount of assorting process, improves point
Class efficiency is smoothed each corresponding similarity by the total sample number based on each sample class, obtains described
The ownership weighted value of each sample class, finally returns the text classification to be sorted to described in text ownership training set to be sorted
Belong to the corresponding sample class of maximum value in weight, solve causes to classify because sample size of all categories in training set is unbalanced
The problem of mistake, to improve classification accuracy.
Embodiment two
Fig. 2 is a kind of document sorting apparatus structural schematic diagram provided by Embodiment 2 of the present invention.It is shown in Figure 2, it is described
Device includes: to obtain module 210, computing module 220, processing module 230, statistical module 240 and categorization module 250;
Wherein, module 210 is obtained, for obtaining the corresponding text vector of text to be sorted;
Computing module 220, for calculating in the text vector and training set between the center vector of each sample class
Similarity;
Processing module 230 carries out each corresponding similarity for the total sample number based on each sample class smooth
Processing obtains the ownership weighted value of each sample class in the text ownership training set to be sorted;
Statistical module 240, for counting the maximum value in the ownership weighted value;
Categorization module 250 is used for the text classification to be sorted to the corresponding sample class of the maximum value.
Further, described device further include:
Center vector computing module, for calculating the center vector of each sample class in training set.
Further, center vector computing module is specifically used for:
The center vector of each sample class in training set is calculated according to following formula:
Wherein, OiIndicate the center vector of sample class i in training set, CinIndicate the total sample number in sample class i, wi
Indicate i-th of sample vector in sample class i.
Further, computing module 220 is specifically used for:
It is calculated according to following formula similar between the text vector and the center vector of sample class each in training set
Degree:
Wherein, Sim (dj,Oi) indicate text vector djWith the center vector O of sample class i in training setiBetween it is similar
Degree, V are the dimension of the center vector of each sample class in the text vector and training set, wikFor center vector OiIn dimension
Numerical value on k, wjkFor text vector djNumerical value on dimension k.
Further, processing module 230 is specifically used for:
Each corresponding similarity is smoothed according to following formula, obtains the text ownership instruction to be sorted
Practice and concentrate ownership weighted value of all categories:
Wherein, g (d, Ci) indicate i-th of sample class C in text d ownership training set to be sortediOwnership weighted value, Sim
(dj,Oi) indicate the text vector d of text d to be sortedjWith i-th of sample class C in training setiCenter vector OiBetween phase
Like degree, CinFor i-th of sample class C in training setiIn total sample number.
Further, module 210 is obtained to be specifically used for: will the text to be sorted progress word2vec processing, obtain pair
The text vector answered.
Further, each sample in the training set is by word2vec treated sample vector.
Document sorting apparatus provided in this embodiment, by only calculating the corresponding text vector of text to be sorted and training set
In each sample class center vector between similarity, considerably reduce the calculation amount of assorting process, improve classification effect
Rate is smoothed each corresponding similarity by the total sample number based on each sample class, obtains described wait divide
Class text belongs to the ownership weighted value of each sample class in training set, finally by the text classification to be sorted to the right of attribution
The corresponding sample class of maximum value in weight, solving leads to classification error because sample size of all categories in training set is unbalanced
The problem of, to improve classification accuracy.
Embodiment three
Fig. 3 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention three provides.As shown in figure 3, the electronics is set
It is standby to include: processor 670, memory 671 and be stored in the computer journey that run on memory 671 and on processor 670
Sequence;Wherein, the quantity of processor 670 can be one or more, in Fig. 3 by taking a processor 670 as an example;Processor 670 is held
The file classification method as described in above-described embodiment one is realized when the row computer program.As shown in figure 3, the electronics is set
Standby can also include input unit 672 and output device 673.Processor 670, memory 671, input unit 672 and output dress
Setting 673 can be connected by bus or other modes, in Fig. 3 for being connected by bus.
Memory 671 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer
Sequence and module, as in the embodiment of the present invention document sorting apparatus/module (for example, the acquisition module in document sorting apparatus
210, computing module 220, processing module 230, statistical module 240 and categorization module 250 etc.).Processor 670 passes through operation storage
Software program, instruction and module in memory 671, at the various function application and data of electronic equipment
Reason, that is, realize above-mentioned file classification method.
Memory 671 can mainly include storing program area and storage data area, wherein storing program area can store operation system
Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This
Outside, memory 671 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one
Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 671 can be into one
Step includes the memory remotely located relative to processor 670, these remote memories can be set by network connection to electronics
Standby/storage medium.The example of above-mentioned network include but is not limited to internet, intranet, local area network, mobile radio communication and its
Combination.
Input unit 672 can be used for receiving the number or character information of input, and generates and set with the user of electronic equipment
It sets and the related key signals of function control inputs.Output device 673 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of storage medium comprising computer executable instructions, and the computer can be held
Row instruction is used to execute a kind of file classification method when being executed by computer processor, this method comprises:
Obtain the corresponding text vector of text to be sorted;
Calculate the similarity in the text vector and training set between the center vector of each sample class;
Total sample number based on each sample class is smoothed each corresponding similarity, obtains described wait divide
Class text belongs to the ownership weighted value of each sample class in training set;
Count the maximum value in the ownership weighted value;
By the text classification to be sorted to the corresponding sample class of the maximum value.
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention
The method operation that executable instruction is not limited to the described above, can also be performed text classification provided by any embodiment of the invention
Relevant operation.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more
Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which can store in computer readable storage medium
In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set
Standby (can be personal computer, storage medium or the network equipment etc.) executes described in each embodiment of the present invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (10)
1. a kind of file classification method characterized by comprising
Obtain the corresponding text vector of text to be sorted;
Calculate the similarity in the text vector and training set between the center vector of each sample class;
Total sample number based on each sample class is smoothed each corresponding similarity, obtains the text to be sorted
The ownership weighted value of each sample class in this ownership training set;
Count the maximum value in the ownership weighted value;
By the text classification to be sorted to the corresponding sample class of the maximum value.
2. the method according to claim 1, wherein described calculate each sample in the text vector and training set
Before similarity between the center vector of classification, the method also includes:
Calculate the center vector of each sample class in training set.
3. according to the method described in claim 2, it is characterized in that, it is described calculate training set in each sample class center to
Amount, comprising:
The center vector of each sample class in training set is calculated according to following formula:
Wherein, OiIndicate the center vector of sample class i in training set, CinIndicate the total sample number in sample class i, wiIt indicates
I-th of sample vector in sample class i.
4. the method according to claim 1, wherein described calculate each sample in the text vector and training set
Similarity between the center vector of classification, comprising:
The similarity in the text vector and training set between the center vector of each sample class is calculated according to following formula:
Wherein, Sim (dj,Oi) indicate text vector djWith the center vector O of sample class i in training setiBetween similarity, V
For the dimension of the center vector of each sample class in the text vector and training set, wikFor center vector OiOn dimension k
Numerical value, wjkFor text vector djNumerical value on dimension k, j are the mark of text vector.
5. the method according to claim 1, wherein the total sample number based on each sample class is to each correspondence
The similarity be smoothed, obtain the ownership weight of each sample class in the text to be sorted ownership training set
Value, comprising:
Each corresponding similarity is smoothed according to following formula, obtains the text ownership training set to be sorted
In ownership weighted value of all categories:
Wherein, g (d, Ci) indicate i-th of sample class C in text d ownership training set to be sortediOwnership weighted value, Sim (dj,
Oi) indicate the text vector d of text d to be sortedjWith i-th of sample class C in training setiCenter vector OiBetween it is similar
Degree, CinFor i-th of sample class C in training setiIn total sample number.
6. the method according to claim 1, wherein described obtain the corresponding text vector of text to be sorted, packet
It includes:
The text to be sorted is handled by word2vec, obtains corresponding text vector.
7. according to the method described in claim 6, it is characterized in that, each sample in the training set is by word2vec
Treated sample vector.
8. a kind of document sorting apparatus, which is characterized in that described device includes:
Module is obtained, for obtaining the corresponding text vector of text to be sorted;
Computing module, for calculating the similarity in the text vector and training set between the center vector of each sample class;
Processing module is smoothed each corresponding similarity for the total sample number based on each sample class, obtains
The ownership weighted value of each sample class into the text ownership training set to be sorted;
Statistical module, for counting the maximum value in the ownership weighted value;
Categorization module is used for the text classification to be sorted to the corresponding sample class of the maximum value.
9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor
Machine program, which is characterized in that the processor is realized as described in any one of claim 1-7 when executing the computer program
File classification method.
10. a kind of storage medium comprising computer executable instructions, the computer executable instructions are by computer disposal
Such as file classification method of any of claims 1-7 is realized when device executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811018629.3A CN109145116A (en) | 2018-09-03 | 2018-09-03 | A kind of file classification method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811018629.3A CN109145116A (en) | 2018-09-03 | 2018-09-03 | A kind of file classification method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109145116A true CN109145116A (en) | 2019-01-04 |
Family
ID=64826251
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811018629.3A Pending CN109145116A (en) | 2018-09-03 | 2018-09-03 | A kind of file classification method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109145116A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111813940A (en) * | 2020-07-14 | 2020-10-23 | 科大讯飞股份有限公司 | Text field classification method, device, equipment and storage medium |
CN112819018A (en) * | 2019-10-31 | 2021-05-18 | 北京沃东天骏信息技术有限公司 | Method and device for generating sample, electronic equipment and storage medium |
CN112988954A (en) * | 2021-05-17 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
WO2021151343A1 (en) * | 2020-09-09 | 2021-08-05 | 平安科技(深圳)有限公司 | Test sample category determination method and apparatus for siamese network, and terminal device |
CN113920373A (en) * | 2021-10-29 | 2022-01-11 | 平安银行股份有限公司 | Object classification method and device, terminal equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2459741B1 (en) * | 2009-07-28 | 2014-01-01 | Universiteit Maastricht | In vitro method for predicting whether a compound is genotoxic in vivo |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
US20170206904A1 (en) * | 2016-01-19 | 2017-07-20 | Knuedge Incorporated | Classifying signals using feature trajectories |
CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
CN108268458A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of semi-structured data sorting technique and device based on KNN algorithms |
-
2018
- 2018-09-03 CN CN201811018629.3A patent/CN109145116A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2459741B1 (en) * | 2009-07-28 | 2014-01-01 | Universiteit Maastricht | In vitro method for predicting whether a compound is genotoxic in vivo |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
US20170206904A1 (en) * | 2016-01-19 | 2017-07-20 | Knuedge Incorporated | Classifying signals using feature trajectories |
CN108268458A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of semi-structured data sorting technique and device based on KNN algorithms |
CN107609121A (en) * | 2017-09-14 | 2018-01-19 | 深圳市玛腾科技有限公司 | Newsletter archive sorting technique based on LDA and word2vec algorithms |
Non-Patent Citations (3)
Title |
---|
刘丛山 等: "一种基于近邻元分析的文本分类算法", 《计算机工程》 * |
王知津,李培,李颖等: "《知识组织理论与方法》", 30 June 2009 * |
郝秀兰 等: "KNN文本分类器类偏斜问题的一种处理对策", 《计算机研究与发展》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819018A (en) * | 2019-10-31 | 2021-05-18 | 北京沃东天骏信息技术有限公司 | Method and device for generating sample, electronic equipment and storage medium |
CN111813940A (en) * | 2020-07-14 | 2020-10-23 | 科大讯飞股份有限公司 | Text field classification method, device, equipment and storage medium |
CN111813940B (en) * | 2020-07-14 | 2023-01-17 | 科大讯飞股份有限公司 | Text field classification method, device, equipment and storage medium |
WO2021151343A1 (en) * | 2020-09-09 | 2021-08-05 | 平安科技(深圳)有限公司 | Test sample category determination method and apparatus for siamese network, and terminal device |
CN112988954A (en) * | 2021-05-17 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN113920373A (en) * | 2021-10-29 | 2022-01-11 | 平安银行股份有限公司 | Object classification method and device, terminal equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145116A (en) | A kind of file classification method, device, electronic equipment and storage medium | |
US20190243900A1 (en) | Automatic questioning and answering processing method and automatic questioning and answering system | |
CN107609121A (en) | Newsletter archive sorting technique based on LDA and word2vec algorithms | |
CN107844560A (en) | A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing | |
CN111159404B (en) | Text classification method and device | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
WO2020087774A1 (en) | Concept-tree-based intention recognition method and apparatus, and computer device | |
CN106446931A (en) | Feature extraction and classification method and system based on support vector data description | |
CN109508378A (en) | A kind of sample data processing method and processing device | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN110059156A (en) | Coordinate retrieval method, apparatus, equipment and readable storage medium storing program for executing based on conjunctive word | |
CN110263854A (en) | Live streaming label determines method, apparatus and storage medium | |
WO2022121163A1 (en) | User behavior tendency identification method, apparatus, and device, and storage medium | |
CN112287656B (en) | Text comparison method, device, equipment and storage medium | |
CN107885849A (en) | A kind of moos index analysis system based on text classification | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
CN109558587A (en) | A kind of classification method for the unbalanced public opinion orientation identification of category distribution | |
CN110309234A (en) | A kind of client of knowledge based map holds position method for early warning, device and storage medium | |
CN109960791A (en) | Judge the method and storage medium, terminal of text emotion | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN110069546A (en) | A kind of data classification method, device for classifying data and terminal device | |
CN109635755A (en) | Face extraction method, apparatus and storage medium | |
CN106372237A (en) | Fraudulent mail identification method and device | |
CN103218419B (en) | Web tab clustering method and system | |
CN108959568A (en) | Intelligent file dissemination system and distribution method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190104 |
|
RJ01 | Rejection of invention patent application after publication |