CN106897262A - A kind of file classification method and device and treating method and apparatus - Google Patents

A kind of file classification method and device and treating method and apparatus Download PDF

Info

Publication number
CN106897262A
CN106897262A CN201611128580.8A CN201611128580A CN106897262A CN 106897262 A CN106897262 A CN 106897262A CN 201611128580 A CN201611128580 A CN 201611128580A CN 106897262 A CN106897262 A CN 106897262A
Authority
CN
China
Prior art keywords
keyword
categories
data
word
mapping relations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611128580.8A
Other languages
Chinese (zh)
Inventor
周扬
何帝君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201611128580.8A priority Critical patent/CN106897262A/en
Publication of CN106897262A publication Critical patent/CN106897262A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns

Abstract

This application discloses a kind of file classification method and device and treating method and apparatus.Wherein, the file classification method includes:Treating prediction data carries out participle, obtains each word in the data to be predicted;According to default the first mapping relations, the second mapping relations of all categories and between expansion keyword and keyword seed between of all categories, it is determined that keyword seed corresponding with each word in the data to be predicted and/or expansion keyword are belonging respectively to possibility characterization value of all categories;And possibility characterization value of all categories is belonging respectively to according to keyword seed corresponding with each word in the data to be predicted and/or expansion keyword, the data to be predicted are classified;Wherein, second mapping relations are, according to first mapping relations and training data, to be set up using semi-supervised topic model.By the technical scheme of the application, the cost of text classification can be reduced, improve the efficiency of text classification.

Description

A kind of file classification method and device and treating method and apparatus
Technical field
The application is related to data analysis technique field, more particularly to a kind of file classification method and device and processing method And device.
Background technology
With Internet technology develop rapidly with popularization, how vast as the open sea data are classified, organized and managed Reason, has become a research topic with important use.And in these data, text data is again maximum one of quantity Class.In general, text classification refers to according to certain taxonomic hierarchies or standard, using computer technology according to content of text from The dynamic process for determining text categories.
Text classification usually uses machine learning techniques, machine learning techniques to be allowed using algorithm based on statistical theory Machine has automatic " study " ability as the similar mankind, and statistical analysis is done so as to obtain rule to known training data, then Unknown data is given a forecast analysis with rule.In general, including for the machine learning method of text classification:Mark --- A collection of document is labeled and classified using professional, as training set;Training --- using computer from training set The effective rule that can be used in classification is excavated, grader is generated;Classification --- the grader of generation is applied to be sorted Document, obtains the classification results of document.
In actual applications, such as online service platform, can usually receive substantial amounts of text message, such as user's consulting, anti- Feedforward information.In this case it is necessary to the opinion of user is analyzed and classified, to promote the understanding to client's demand, Targetedly solution is provided.In the prior art, for text message to be sorted, in the mark stage, it is necessary to professional people Member carries out the artificial mark of sentence level to text, i.e., manually determine the classification belonging to certain sentence.If for application-specific Occasion needs fairly large training set, then need manually to mark a large amount of sentences, for example, need special messenger to each sentence Read, understood, judged etc..Therefore, such artificial mark is wasted time and energy very much.
Accordingly, it would be desirable to a kind of file classification method and device, with solve text classification high cost present in prior art, The low problem of efficiency.
The content of the invention
In view of this, the embodiment of the present application provides a kind of file classification method and device and treating method and apparatus, with Solve the problems, such as that text classification high cost, efficiency present in prior art are low.
The embodiment of the present application provides a kind of file classification method, including:
Treating prediction data carries out participle, obtains each word in the data to be predicted;
According to default the first mapping relations and keyword seed between of all categories, of all categories and expand between keyword The second mapping relations, it is determined that keyword seed corresponding with each word in the data to be predicted and/or expanding keyword point Do not belong to possibility characterization value of all categories;And
It is belonging respectively to respectively according to keyword seed corresponding with each word in the data to be predicted and/or expansion keyword The data to be predicted are classified by the possibility characterization value of classification;
Wherein, second mapping relations are according to first mapping relations and training data, using semi-supervised master Topic model is set up.
The embodiment of the present application also provides a kind of document sorting apparatus, including:
Word-dividing mode, the word-dividing mode is used to treat prediction data and carry out participle, in obtaining the data to be predicted Each word;
Determining module, the determining module is used to be closed according to default the first mapping and keyword seed between of all categories System, the second mapping relations of all categories and between expansion keyword, it is determined that kind corresponding with each word in the data to be predicted Sub- keyword and/or expansion keyword are belonging respectively to possibility characterization value of all categories;And
Sort module, the sort module is used for according to keyword seed corresponding with each word in the data to be predicted And/or expansion keyword is belonging respectively to possibility characterization value of all categories, and the data to be predicted are classified;
Wherein, the document sorting apparatus also set up module including second, described second set up module for according to described in First mapping relations and training data, second mapping relations are set up using semi-supervised topic model.
The embodiment of the present application also provides a kind of processing method, including:
Set up default the first mapping relations and keyword seed between of all categories;
Participle is carried out to training data, each word in training data is obtained;And
Based on each word in default the first mapping relations and the training data and keyword seed between of all categories, By semi-supervised topic model, determine to expand keyword from each word in the training data, and opened up with described of all categories The second mapping relations are set up between exhibition keyword;
Wherein, first mapping relations and the second mapping relations are used for after data to be predicted are received, to described Data to be predicted carry out text classification.
The embodiment of the present application also provides a kind of processing unit, including:
First sets up module, described first set up module for set up it is default it is of all categories and keyword seed between the One mapping relations;
Word-dividing mode, the word-dividing mode is used to carry out participle to training data, obtains each word in training data;And
Second sets up module, described second set up module for based on it is default it is of all categories and keyword seed between the Each word in one mapping relations and the training data, by semi-supervised topic model, from each word in the training data It is determined that expanding keyword, and the second mapping relations are set up between the expansion keyword of all categories;
Wherein, the processing unit also include sort module, the sort module be used for receive data to be predicted it Afterwards, text classification is carried out to the data to be predicted according to first mapping relations and the second mapping relations.
According to embodiments herein, user presets some classifications and is belonging respectively to some seeds of all categories and closes Keyword, then selects the training data not marked of certain scale, semi-supervised topic model to automatically generate expansion as needed Keyword simultaneously sets up classification and expands mapping relations between keyword such that it is able to treats prediction data and is classified.According to this The embodiment of application, without manually being marked to a large amount of sentences.Additionally, by a small amount of keyword seed, expanding keyword And the various combining forms of keyword, word or the semantic coverage suitable with largely mark sentence can be covered.Therefore, according to this The embodiment of application, can reduce the cost of text classification, improve the efficiency of text classification.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In the accompanying drawings:
Fig. 1 is a kind of schematic flow sheet of the file classification method according to the embodiment of the present application.
Fig. 2 is the schematic diagram of the relation between specified classification and keyword seed according to the embodiment of the present application.
Fig. 3 is a kind of schematic diagram of the file classification method according to the embodiment of the present application, reflects the specific of text classification Flow and information flow.
Fig. 4 is a kind of structural representation of the document sorting apparatus according to the embodiment of the present application.
Fig. 5 is a kind of structural representation of the processing unit according to the embodiment of the present application.
Specific embodiment
In order to realize the purpose of the application, the embodiment of the present application provides a kind of file classification method and device, treats pre- Surveying data carries out participle, obtains each word in the data to be predicted;According to default of all categories and keyword seed between First mapping relations, it is of all categories and expand keyword between the second mapping relations, it is determined that with the data to be predicted in it is each The corresponding keyword seed of word and/or expansion keyword are belonging respectively to possibility characterization value of all categories;And according to it is described The corresponding keyword seed of each word and/or expansion keyword in data to be predicted are belonging respectively to possibility of all categories and characterize The data to be predicted are classified by value;Wherein, second mapping relations are according to first mapping relations and instruction Practice data, set up using semi-supervised topic model.
The file classification method that the embodiment of the present application is provided is related to the utilization of semi-supervised topic model.
Topic model (Topic Model) is in a series of document in the field such as machine learning and natural language processing A kind of middle statistical model for finding abstract theme.For directly perceived, if an article has a central idea, then some are specific Word can be frequent appearance.Truth also includes that an article generally comprises various themes, and each theme institute accounting Example is different.One topic model attempts to embody this feature of document with mathematical framework.Topic model is automatically analyzed often Individual document, the word in statistic document concludes current document and contains which theme, and each theme according to the information for counting Shared ratio is respectively how many.
Semi-supervised learning (Semi-Supervised Learning, SSL) is pattern-recognition and machine learning area research Important Problems, be a kind of learning method that supervised learning is combined with unsupervised learning.It is mainly considered how using a small amount of Mark sample and substantial amounts of do not mark the problem that sample is trained and classifies.
In general, topic model is unsupervised learning, for finding data design feature in itself, and the application is implemented The semi-supervised topic model that example is used is the thought for combining semi-supervised learning and topic model, and formation both need not largely mark number According to a class method of classification problem can be solved again.
The ordinal number " first " included in " the first mapping relations ", " the second mapping relations " described in the embodiment of the present application " second " without particular meaning, only for representing different mapping relations.
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Fig. 1 is a kind of schematic flow sheet of the file classification method according to the embodiment of the present application.
The file classification method of the embodiment according to Fig. 1 comprises the following steps.
Step S110, treating prediction data carries out participle, obtains each word in the data to be predicted.
Specifically, data to be predicted include the Chinese text for needing to be classified, and these Chinese texts are into rower Note.For convenience of description, hereinafter prediction data is treated in the form of a Chinese sentence to be illustrated.
For the Chinese sentence that data to be predicted include, it is possible to use its cutting is by space point by participle instrument Every word single one by one.Referred to herein as participle instrument can be any Chinese word segmentation well known by persons skilled in the art Instrument, including Chinese Word Automatic Segmentation, program etc..
For example, data to be predicted include sentence as similar " Yao Ming is retired from CBA ", participle is carried out using participle instrument Afterwards, the form of " Yao Ming is retired from CBA " can be obtained, that is, generates 4 words such as " Yao Ming ", " from ", " CBA ", " retired ".
Step S120, according to default the first mapping relations, of all categories and expansion and keyword seed between of all categories The second mapping relations between keyword, it is determined that keyword seed corresponding with each word in the data to be predicted and/or opening up Exhibition keyword is belonging respectively to possibility characterization value of all categories.
Specifically, the first mapping relations are the default mapping relations and keyword seed between of all categories, i.e. basis It is actually needed and application scenarios, presets some classifications and be belonging respectively to some keyword seeds of all categories.
In the prior art, in the mark stage, it is necessary to professional carries out the artificial mark of sentence level, i.e. people to text Work determines the classification belonging to certain sentence.Citing is as shown in table 1.
The classification and sentence of the prior art of table 1
That is, some classifications (being " physical culture " and " finance " two classifications shown in table 1) are set first, then to every Individual sentence (mark sample 1 and mark sample 2) carries out manual read, understands, and judges the classification belonging to the sentence (through manually sentencing Disconnected, mark sample 1 belongs to " physical culture " classification, and mark sample 2 belongs to " finance " classification).In order to obtain the training set of enough scales, Such work will be repeated many times.Then, for categorized sentence, participle can be carried out to it using participle instrument, After removal noise, the mapping relations of each word and generic are formed.
For the embodiment of the present application, the technological means of keyword seed is employed, specified of all categories and of all categories The keyword seed for including respectively.In other words, sample is labeled in word rank.Citing is as shown in table 2.
The classification of table 2 and keyword seed
Classification Keyword seed
Physical culture Yao Ming, match ... ...
Finance Yao Ming, investment ... ...
In table 2, specify " physical culture " and " finance " two classifications, and for " physical culture " classification specify " Yao Ming " and " match " two keyword seeds, are that " finance " classification specifies " Yao Ming " and " investment " two keyword seeds.This area skill It should be understood that above-mentioned classification and keyword seed are example, those skilled in the art can be according to different reality for art personnel Need and application scenarios, specify any number of classification and any number of keyword seed.
Schematic diagrames of the Fig. 2 according to the relation between the specified classification and keyword seed of the embodiment of the present application.Can be with from Fig. 2 Find out, the mapping relations of multi-to-multi are established between classification and keyword seed, for example, classification " physical culture " corresponds to multiple seeds Keyword (" Yao Ming ", " match " etc.), and keyword seed " Yao Ming " also corresponds to multiple classifications (" physical culture ", " finance " etc.). The mapping relations of this multi-to-multi are conducive to disclosing the degree of approximation between word, have important work for automatically obtaining expansion keyword With.
After the keyword seed for including respectively of all categories and of all categories is specified, the embodiment of the present application is calculated by mapping Method determines that keyword seed belongs to possibility characterization value of all categories.
Specifically, the embodiment of the present application is using TF-IDF (word frequency-reverse document frequency) mapping algorithm.TF-IDF is one Statistical method is planted, is used to assess important journey of the words for a copy of it file in a file set or a corpus Degree.The number of times (i.e. word frequency) that the importance of words occurs hereof with it is directly proportional increase, but simultaneously can be as it is in language Be inversely proportional the frequency (i.e. reverse document frequency) occurred in material storehouse decline.
In the embodiment of the present application, TF-IDF mapping algorithms be used to determining each keyword seed belongs to that predetermined class is other can Can property characterization value.Citing is as shown in table 3.
The classification of table 3 and keyword seed and possibility characterization value
Classification Keyword seed and possibility characterization value
Physical culture Yao Ming 0.5, match 0.8 ... ...
Finance Yao Ming 0.5, investment 0.9 ... ...
In table 3, it is determined that the possibility characterization value that keyword seed " Yao Ming " belongs to predetermined classification " physical culture " is 0.5, Similar, the possibility characterization value that " match " belongs to classification " physical culture " is 0.8.On the other hand, " Yao Ming " belongs to classification " finance " Possibility characterization value be 0.5, " investment " belong to classification " finance " possibility characterization value be 0.9.
The first mapping relations set up reflect the corresponding relation between classification and keyword seed, also comprising seed keywords Word belongs to the possibility of classification.According to the embodiment of the present application, this corresponding relation is multi-to-multi.Each classification includes multiple kind Sub- keyword;Each keyword seed belongs to one or more classifications, and each keyword seed has and belongs to classification Possibility characterization value, this possibility characterization value is represented with the numerical value of the similar probability in the range of 0 to 1.
It will be understood by those skilled in the art that without departing substantially from the spirit and scope of the present invention, it would however also be possible to employ other can be true Determine the mapping algorithm that keyword seed belongs to possibility of all categories, other forms possibility characterization value is also in the scope of the present invention It is interior.
Additionally, the second mapping relations are mapping relations that are of all categories and expanding between keyword.According to the implementation of the application Example, the second mapping relations are, according to first mapping relations and training data, to be set up using semi-supervised topic model.
The process for setting up the second mapping is a kind of training process, including carries out participle to training data, obtains training data In each word.Specifically, the training data not marked of certain scale is selected with application scenarios according to actual needs, for example, The some Chinese sentence not marked.Then, participle is carried out to sentence using participle instrument, is by the one of space-separated by its cutting Individual one single word.
Then, based on training data in each word generation expand keyword.Specifically, by the training data after participle (single word one by one) and the first mapping relations (corresponding relation between classification and keyword seed) set up before As the input of semi-supervised topic model, the relation of the multi-to-multi between classification and word is learnt using semi-supervised topic model, obtained To classification and the mapping relations expanded between keyword.
For example, training data includes following Chinese sentence:" Yao Ming beat CBA and NBA ", " Yao Ming invests the big shark in Shanghai Fish ", " Yao Ming obtains three pairs in upper bout ", " Yao Ming sells luxurious house ", " Yao Ming's assets went up compared with last year ", " stock Ticket drop Yao Ming property is shunk " etc..By the study of semi-supervised topic model, (trained after participle according to training data Each word in data), keyword is expanded in generation " NBA ", " CBA ", " shark ", " sale ", " rise ", " drop " etc..
Then, according to the degree of approximation expanded between keyword and keyword seed, it is determined that expanding the classification belonging to keyword And the expansion keyword belongs to possibility characterization value of all categories.
For example, with reference to shown in table 2 and table 3, for " physical culture " classification, it is already assigned to which the seed such as " Yao Ming ", " match " is closed Keyword;For " finance " classification, it is already assigned to the keyword seed such as " Yao Ming ", " investment ".According to expand keyword and these The degree of approximation of keyword seed, can determine that the classification and expansion keyword expanded belonging to keyword belong to the possibility of the category Property characterization value.
For example, expansion keyword " NBA " is higher with the degree of approximation of keyword seed " match " (belonging to " physical culture " classification), because This, it is determined that " NBA " belongs to " physical culture " classification, possibility characterization value is higher 0.8.And for example, expand keyword " sale " and plant The degree of approximation of sub- keyword " investment " (belonging to " finance " classification) is higher, accordingly, it is determined that " sale " belongs to " finance " classification, may Property characterization value is higher 0.8.Following table 4 lists classification and expands the example of keyword and possibility characterization value.
The classification of table 4 and expansion keyword and possibility characterization value
Classification Expand keyword and possibility characterization value
Physical culture NBA 0.8, CBA 0.8, shark 0.4 ... ...
Finance 0.8 is sold, go up 0.66, drop 0.66 ... ...
Step S130, according to keyword seed corresponding with each word in the data to be predicted and/or expansion keyword Possibility characterization value of all categories is belonging respectively to, the data to be predicted are classified.
Specifically, above-mentioned classification includes:For each classification, pair seed corresponding with each word in data to be predicted is closed The possibility characterization value that keyword and/or expansion keyword are belonging respectively to the category is sued for peace;Obtained according to for each classification Sum, determine the classification belonging to the data to be predicted.
For example, as described above, sentence as " Yao Ming is retired from CBA " that includes for data to be predicted is utilized Participle instrument is carried out after participle, 4 words such as generation " Yao Ming ", " from ", " CBA ", " retired ".According to each word after participle, Searched in keyword seed and in expanding keyword, searched in " physical culture " classification and obtain keyword seed " Yao Ming " and its can Energy property characterization value 0.5, and keyword " CBA " and its possibility characterization value 0.8 are expanded, by all possibilities of " physical culture " classification Characterization value is sued for peace, obtaining and be 0.5+0.8=1.3.
Similar, for each word after participle mentioned above, searched in " finance " classification and obtain keyword seed " Yao It is bright " and its possibility characterization value 0.5, all possibility characterization values of " finance " classification are sued for peace, it is obtaining and be 0.5.
" from " and " retired " does not occur in keyword seed and in expanding keyword, to the possibility characterization value of any classification All without contribution, do not calculate.
Due to " physical culture " classification and 1.3 more than " finance " classification and 0.5, accordingly, it is determined that the language in data to be predicted Classification described in sentence " Yao Ming is retired from CBA " is " physical culture ".
By similar mode, all sentences treated in prediction data carry out participle, using the first mapping relations and Two mapping relations, are sued for peace for possibility characterization value of all categories, and the magnitude relationship according to the sum for obtaining determines to be predicted The classification belonging to sentence in data.The classification of all sentences in data to be predicted can so be realized.
Compared with the mark of the sentence level of prior art, the other mark of word-level of the embodiment of the present application can effectively drop Low cost, improves the efficiency of text classification.First, from read, understand, judge cost from the point of view of, treatment word into This low cost than processing sentence.Secondly, user to a large amount of sentences without manually being marked, it is only necessary to if presetting Ganlei The keyword seed of other and relatively small amount, semi-supervised topic model combined training data automatically generate expansion keyword and foundation is reflected Penetrate relation, keyword seed and expand keyword and can produce substantial amounts of combining form, the word or semantic coverage of its covering with The scope of a large amount of mark sentences is suitable.And for the mark of the sentence level of prior art, the scope of each mark sentence It is basic determination, to cover larger scope can only be expanded by a large amount of sentences, can only simply be piled up between sentence, no There are various combining forms.
The idiographic flow and information flow of the file classification method according to the embodiment of the present application can also refer to Fig. 3.
In figure 3, treating prediction data 310 using participle instrument carries out participle, obtains each word in data to be predicted 312.By mapping algorithm, the first mapping relations 322 are set up based on the of all categories and keyword seed 320 specified.Using participle Instrument carries out participle to training data 330, obtains each word 332 in training data.By each word 332 in training data and build The first vertical mapping relations 322 set up the second mapping relations 342 as the input of semi-supervised topic model 340.
According to the first mapping relations 322 and the second mapping relations 342, it is determined that corresponding with each word 312 in data to be predicted Keyword seed and/or expand keyword and be belonging respectively to possibility characterization value of all categories, and possibility characterization value is sued for peace 350, the magnitude relationship according to the sum for obtaining determines the classification 352 belonging to data to be predicted.
Fig. 4 is a kind of structural representation of the document sorting apparatus according to the embodiment of the present application.
Document sorting apparatus 400 according to the embodiment of the present application include word-dividing mode 410, determining module 420, sort module 430th, first set up module 440 and second and set up module 450.
Word-dividing mode 410 carries out participle for treating prediction data, obtains each word in the data to be predicted.
Determining module 420 is used for according to default the first mapping relations and keyword seed between of all categories, of all categories With expand keyword between the second mapping relations, it is determined that keyword seed corresponding with each word in the data to be predicted And/or expansion keyword is belonging respectively to possibility characterization value of all categories.
Sort module 430 is used for according to keyword seed corresponding with each word in the data to be predicted and/or expansion Keyword is belonging respectively to possibility characterization value of all categories, and the data to be predicted are classified.
First sets up module 440 for setting up first mapping relations in the following manner:Specify of all categories and each The keyword seed that classification is included respectively;And determine that the keyword seed belongs to possibility of all categories by mapping algorithm Characterization value.
Second sets up module 450 for according to first mapping relations and training data, using semi-supervised theme mould Type sets up second mapping relations.
According to the embodiment of the present application, first sets up the mapping algorithm that module 440 uses includes TF-IDF mapping algorithms.
Second sets up module 450 is additionally operable to:Participle is carried out to the training data, obtains each in the training data Word;Keyword is expanded in each word generation in based on the training data;And closed with the seed according to the expansion keyword The degree of approximation between keyword, determine classification and the expansion keyword belonging to the expansion keyword belong to it is of all categories can Can property characterization value.
Sort module 430 is additionally operable to:For each classification, pair seed corresponding with each word in the data to be predicted is closed The possibility characterization value that keyword and/or expansion keyword are belonging respectively to the category is sued for peace;Obtained according to for each classification Sum, determine the classification belonging to the data to be predicted.
Additionally, the embodiment of the present application also provides a kind for the treatment of method and apparatus, default of all categories and seed keywords are set up The first mapping relations between word;Participle is carried out to training data, each word in training data is obtained;And based on default each Each word in the first mapping relations and the training data between classification and keyword seed, by semi-supervised topic model, Determine to expand keyword from each word in the training data, and second is set up between the expansion keyword of all categories Mapping relations;Wherein, first mapping relations and the second mapping relations are used for after data to be predicted are received, to described Data to be predicted carry out text classification.
Fig. 5 is a kind of structural representation of the processing unit according to the embodiment of the present application.
The processing unit 500 of the embodiment according to Fig. 5 includes that word-dividing mode 510, sort module 520, first are set up Module 530 and second sets up module 540.
First sets up module 530 for setting up default the first mapping relations and keyword seed between of all categories.
Word-dividing mode 510 is used to carry out participle to training data, obtains each word in training data.
Second sets up module 540 for based on default the first mapping relations and institute and keyword seed between of all categories Each word in training data is stated, by semi-supervised topic model, determines to expand keyword from each word in the training data, And set up the second mapping relations between the expansion keyword of all categories.
Sort module 520 is used for after data to be predicted are received, and is mapped according to first mapping relations and second Data to be predicted carry out text classification described in relation pair.
Additionally, first sets up module 530 and is additionally operable to:Specify the keyword seed for including respectively of all categories and of all categories; And determine that the keyword seed belongs to possibility characterization value of all categories by mapping algorithm.
According to the embodiment of the present application, first sets up the mapping algorithm that module 530 uses includes TF-IDF mapping algorithms.
Second sets up module 540 is additionally operable to:According to the degree of approximation between the expansion keyword and the keyword seed, Determine that classification and the expansion keyword belonging to the expansion keyword belong to possibility characterization value of all categories.
It should be noted that the executive agent that the embodiment of the present application provides each step of method may each be same and set It is standby, or, the method is also by distinct device as executive agent.Such as, the executive agent of step S110 and step S120 can be with It is equipment 1, the executive agent of step S130 can be equipment 2;Again such as, the executive agent of step S110 can be equipment 1, step The executive agent of rapid S120 and step S130 can be equipment 2;Etc..
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information Store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, can be used to store the information that can be accessed by a computing device.Defined according to herein, calculated Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of key elements not only include those key elements, but also wrapping Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment is intrinsic wants Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.And, the application can be used to be can use in one or more computers for wherein including computer usable program code and deposited The shape of the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent Replace, improve etc., within the scope of should be included in claims hereof.

Claims (18)

1. a kind of file classification method, it is characterised in that the file classification method includes:
Treating prediction data carries out participle, obtains each word in the data to be predicted;
According to default the first mapping relations and keyword seed between of all categories, it is of all categories and expand between keyword the Two mapping relations, it is determined that keyword seed corresponding with each word in the data to be predicted and/or expansion keyword belong to respectively In possibility characterization value of all categories;And
It is belonging respectively to according to keyword seed corresponding with each word in the data to be predicted and/or expansion keyword of all categories Possibility characterization value, the data to be predicted are classified;
Wherein, second mapping relations are according to first mapping relations and training data, using semi-supervised theme mould What type was set up.
2. file classification method according to claim 1, it is characterised in that default of all categories and keyword seed between The first mapping relations set up in the following manner:
Specify the keyword seed for including respectively of all categories and of all categories;And
Determine that the keyword seed belongs to possibility characterization value of all categories by mapping algorithm.
3. file classification method according to claim 2, it is characterised in that the mapping algorithm includes that TF-IDF maps and calculates Method.
4. file classification method according to claim 1, it is characterised in that the foundation of second mapping relations includes:
Participle is carried out to the training data, each word in the training data is obtained;
Keyword is expanded in each word generation in based on the training data;And
According to the degree of approximation between the expansion keyword and the keyword seed, the class belonging to the expansion keyword is determined Other and described expansion keyword belongs to possibility characterization value of all categories.
5. file classification method according to claim 4, it is characterised in that classification bag is carried out to the data to be predicted Include:
For each classification, pair keyword seed corresponding with each word in the data to be predicted and/or expansion keyword divide The possibility characterization value for not belonging to the category is sued for peace;
According to the sum obtained for each classification, the classification belonging to the data to be predicted is determined.
6. a kind of document sorting apparatus, it is characterised in that the document sorting apparatus include:
Word-dividing mode, the word-dividing mode carries out participle for treating prediction data, obtains each word in the data to be predicted;
Determining module, the determining module be used for according to default the first mapping relations and keyword seed between of all categories, The second mapping relations that are of all categories and expanding between keyword, it is determined that seed corresponding with each word in the data to be predicted is closed Keyword and/or expansion keyword are belonging respectively to possibility characterization value of all categories;And
Sort module, the sort module be used for according to keyword seed corresponding with each word in the data to be predicted and/ Or expansion keyword is belonging respectively to possibility characterization value of all categories, and the data to be predicted are classified;
Wherein, the document sorting apparatus also set up module including second, and described second sets up module for according to described first Mapping relations and training data, second mapping relations are set up using semi-supervised topic model.
7. document sorting apparatus according to claim 6, it is characterised in that the document sorting apparatus are also built including first Formwork erection block, described first sets up module for setting up first mapping relations in the following manner:
Specify the keyword seed for including respectively of all categories and of all categories;And
Determine that the keyword seed belongs to possibility characterization value of all categories by mapping algorithm.
8. document sorting apparatus according to claim 7, it is characterised in that the mapping algorithm includes that TF-IDF maps and calculates Method.
9. document sorting apparatus according to claim 6, it is characterised in that described second sets up module is additionally operable to:
Participle is carried out to the training data, each word in the training data is obtained;
Keyword is expanded in each word generation in based on the training data;And
According to the degree of approximation between the expansion keyword and the keyword seed, the class belonging to the expansion keyword is determined Other and described expansion keyword belongs to possibility characterization value of all categories.
10. document sorting apparatus according to claim 9, it is characterised in that the sort module is additionally operable to:
For each classification, pair keyword seed corresponding with each word in the data to be predicted and/or expansion keyword divide The possibility characterization value for not belonging to the category is sued for peace;
According to the sum obtained for each classification, the classification belonging to the data to be predicted is determined.
11. a kind of processing methods, it is characterised in that the treating method comprises:
Set up default the first mapping relations and keyword seed between of all categories;
Participle is carried out to training data, each word in training data is obtained;And
Based on each word in default the first mapping relations and the training data and keyword seed between of all categories, pass through Semi-supervised topic model, determines to expand keyword, and closed with described expansion of all categories from each word in the training data The second mapping relations are set up between keyword;
Wherein, first mapping relations and the second mapping relations are used for after data to be predicted are received, and treat pre- to described Surveying data carries out text classification.
12. processing methods according to claim 11, it is characterised in that set up it is default it is of all categories with keyword seed it Between the first mapping relations include:
Specify the keyword seed for including respectively of all categories and of all categories;And
Determine that the keyword seed belongs to possibility characterization value of all categories by mapping algorithm.
13. processing methods according to claim 12, it is characterised in that the mapping algorithm includes that TF-IDF maps and calculates Method.
14. processing methods according to claim 11, it is characterised in that built between the expansion keyword of all categories Vertical second mapping relations include:
According to the degree of approximation between the expansion keyword and the keyword seed, the class belonging to the expansion keyword is determined Other and described expansion keyword belongs to possibility characterization value of all categories.
15. a kind of processing units, it is characterised in that the processing unit includes:
First sets up module, and described first sets up module for setting up default first reflecting and keyword seed between of all categories Penetrate relation;
Word-dividing mode, the word-dividing mode is used to carry out participle to training data, obtains each word in training data;And
Second sets up module, and described second sets up module for based on default first reflecting and keyword seed between of all categories Each word penetrated in relation and the training data, by semi-supervised topic model, determines from each word in the training data Keyword is expanded, and the second mapping relations are set up between the expansion keyword of all categories;
Wherein, the processing unit also includes sort module, and the sort module is used for after data to be predicted are received, root Text classification is carried out to the data to be predicted according to first mapping relations and the second mapping relations.
16. processing units according to claim 15, it is characterised in that described first sets up module is additionally operable to:
Specify the keyword seed for including respectively of all categories and of all categories;And
Determine that the keyword seed belongs to possibility characterization value of all categories by mapping algorithm.
17. processing units according to claim 16, it is characterised in that the mapping algorithm includes that TF-IDF maps and calculates Method.
18. processing units according to claim 15, it is characterised in that described second sets up module is additionally operable to:
According to the degree of approximation between the expansion keyword and the keyword seed, the class belonging to the expansion keyword is determined Other and described expansion keyword belongs to possibility characterization value of all categories.
CN201611128580.8A 2016-12-09 2016-12-09 A kind of file classification method and device and treating method and apparatus Pending CN106897262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611128580.8A CN106897262A (en) 2016-12-09 2016-12-09 A kind of file classification method and device and treating method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611128580.8A CN106897262A (en) 2016-12-09 2016-12-09 A kind of file classification method and device and treating method and apparatus

Publications (1)

Publication Number Publication Date
CN106897262A true CN106897262A (en) 2017-06-27

Family

ID=59197828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611128580.8A Pending CN106897262A (en) 2016-12-09 2016-12-09 A kind of file classification method and device and treating method and apparatus

Country Status (1)

Country Link
CN (1) CN106897262A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844558A (en) * 2017-10-31 2018-03-27 金蝶软件(中国)有限公司 The determination method and relevant apparatus of a kind of classification information
CN108009248A (en) * 2017-11-30 2018-05-08 国信优易数据有限公司 A kind of data classification method and system
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108875072A (en) * 2018-07-05 2018-11-23 第四范式(北京)技术有限公司 File classification method, device, equipment and storage medium
CN109146395A (en) * 2018-06-29 2019-01-04 阿里巴巴集团控股有限公司 A kind of method, device and equipment of data processing
CN109460468A (en) * 2018-10-23 2019-03-12 出门问问信息科技有限公司 Classifying method, categorization arrangement and the corresponding electronic equipment of law related text
CN109766440A (en) * 2018-12-17 2019-05-17 航天信息股份有限公司 A kind of method and system for for the determining default categories information of object text description
CN111242237A (en) * 2020-01-20 2020-06-05 北京明略软件系统有限公司 Bert model training and classifying method, system, medium and computer equipment
CN111339290A (en) * 2018-11-30 2020-06-26 北京嘀嘀无限科技发展有限公司 Text classification method and system
CN111563165A (en) * 2020-05-11 2020-08-21 北京中科凡语科技有限公司 Statement classification method based on anchor word positioning and training statement augmentation
WO2021092871A1 (en) * 2019-11-13 2021-05-20 北京数字联盟网络科技有限公司 Application preference text classification method based on textrank
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770498A (en) * 2009-01-05 2010-07-07 李铭 Step searching method
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770498A (en) * 2009-01-05 2010-07-07 李铭 Step searching method
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844558A (en) * 2017-10-31 2018-03-27 金蝶软件(中国)有限公司 The determination method and relevant apparatus of a kind of classification information
CN108009248A (en) * 2017-11-30 2018-05-08 国信优易数据有限公司 A kind of data classification method and system
CN108415959A (en) * 2018-02-06 2018-08-17 北京捷通华声科技股份有限公司 A kind of file classification method and device
CN108415959B (en) * 2018-02-06 2021-06-25 北京捷通华声科技股份有限公司 Text classification method and device
CN109146395A (en) * 2018-06-29 2019-01-04 阿里巴巴集团控股有限公司 A kind of method, device and equipment of data processing
CN109146395B (en) * 2018-06-29 2022-04-05 创新先进技术有限公司 Data processing method, device and equipment
CN108875072A (en) * 2018-07-05 2018-11-23 第四范式(北京)技术有限公司 File classification method, device, equipment and storage medium
CN109460468A (en) * 2018-10-23 2019-03-12 出门问问信息科技有限公司 Classifying method, categorization arrangement and the corresponding electronic equipment of law related text
CN111339290A (en) * 2018-11-30 2020-06-26 北京嘀嘀无限科技发展有限公司 Text classification method and system
CN109766440A (en) * 2018-12-17 2019-05-17 航天信息股份有限公司 A kind of method and system for for the determining default categories information of object text description
CN109766440B (en) * 2018-12-17 2023-09-01 航天信息股份有限公司 Method and system for determining default classification information for object text description
WO2021092871A1 (en) * 2019-11-13 2021-05-20 北京数字联盟网络科技有限公司 Application preference text classification method based on textrank
CN111242237A (en) * 2020-01-20 2020-06-05 北京明略软件系统有限公司 Bert model training and classifying method, system, medium and computer equipment
CN111563165A (en) * 2020-05-11 2020-08-21 北京中科凡语科技有限公司 Statement classification method based on anchor word positioning and training statement augmentation
CN111563165B (en) * 2020-05-11 2020-12-18 北京中科凡语科技有限公司 Statement classification method based on anchor word positioning and training statement augmentation
CN113761911A (en) * 2021-03-17 2021-12-07 中科天玑数据科技股份有限公司 Domain text labeling method based on weak supervision

Similar Documents

Publication Publication Date Title
CN106897262A (en) A kind of file classification method and device and treating method and apparatus
CN106598999B (en) Method and device for calculating text theme attribution degree
CN110619044B (en) Emotion analysis method, system, storage medium and equipment
CN107229610A (en) The analysis method and device of a kind of affection data
CN104978356B (en) A kind of recognition methods of synonym and device
CN106610972A (en) Query rewriting method and apparatus
CN106874292A (en) Topic processing method and processing device
CN107133238A (en) A kind of text message clustering method and text message clustering system
CN106227756A (en) A kind of stock index forecasting method based on emotional semantic classification and system
CN108846120A (en) Method, system and storage medium for classifying to text set
CN107291840A (en) A kind of user property forecast model construction method and device
Montanes et al. Scoring and selecting terms for text categorization
CN113312480A (en) Scientific and technological thesis level multi-label classification method and device based on graph convolution network
CN112686025A (en) Chinese choice question interference item generation method based on free text
Yi-bin et al. Improvement of ID3 algorithm based on simplified information entropy and coordination degree
CN106997340A (en) The generation of dictionary and the Document Classification Method and device using dictionary
Ruambo et al. Towards enhancing information retrieval systems: A brief survey of strategies and challenges
Chen et al. Using latent Dirichlet allocation to improve text classification performance of support vector machine
Gillmann et al. Quantification of Economic Uncertainty: a deep learning approach
CN111339287B (en) Abstract generation method and device
Li et al. Evaluating BERT on cloud-edge time series forecasting and sentiment analysis via prompt learning
US10510013B2 (en) Mixed proposal based model training system
CN109117436A (en) Synonym automatic discovering method and its system based on topic model
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
Nakashima et al. Incremental learning of fuzzy rule-based classifiers for large data sets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201016

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170627