CN107844559A

CN107844559A - A kind of file classifying method, device and electronic equipment

Info

Publication number: CN107844559A
Application number: CN201711051376.5A
Authority: CN
Inventors: 张斌德; 夏耘海; 王甲樑
Original assignee: Guoxin Youe Data Co Ltd
Current assignee: Guoxin Youe Data Co Ltd
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-03-27

Abstract

The embodiment of the present invention provides a kind of file classifying method, device and electronic equipment, belongs to technical field of data processing.Methods described includes：Complaint text to be sorted is subjected to word segmentation processing, obtains multiple words to be matched；The dictionary of the multiple word to be matched complaint problem different from sign is matched respectively, obtains matching result；Classification is complained according to belonging to the matching result determines the complaint text to be sorted, to classify to above-mentioned complaint text to be sorted；Wherein, multiple history complaint text is trained to obtain by the dictionary for characterizing different complaint problems.The multiple dictionaries obtained in this method by training in advance, allow to multiple words to be matched and dictionary matching, it is hereby achieved that more accurate matching result, complaint text to be sorted can be subjected to Accurate classification, the complaint text realized for different complaint problems has higher nicety of grading, improves the performance of text classification.

Description

A kind of file classifying method, device and electronic equipment

Technical field

The present invention relates to technical field of data processing, is set in particular to a kind of file classifying method, device and electronics It is standby.

Background technology

With the development of computer technology, increasing enterprise, tissue and government organs etc. are dependent at computer All kinds of affairs are managed, in this course, continuously produce substantial amounts of electronic document.In routine duties or carry out archives During management, generally require and these electronic documents are divided into specific classification, still, the present of explosive increase is presented in data volume My god, some enterprises' possibility just produce several TB data in one day, correspond to thousands of electronic document, it is manually discriminated Undoubtedly efficiency is low for other and management, and as computer implemented automatic classification has brought very big facility, but due to text This classification has the characteristics that higher-dimension, high degree of rarefication, and the performance of text classification is not met by the actual demand of people, also had Very big room for improvement.

And as the fast development of E-Government, the center of gravity of Government Websites Construction are shifted, it is main from first stage of construction Send out news information resource various for each department of government, turned to for the purpose of the supervision function and service level that improve government, Should be from the real work of website, the working system of constituting criterion government website, lift service awareness and government website Capax negotii；Strengthen the cooperation of website and government affairs, expand government website and popular interaction；Establish efficient complaint body System, strengthen supervision.There is substantial amounts of complaint and suggestion text data with daily, so, how complaint text to be carried out soon Fast accurate classification is current urgent problem.

The content of the invention

In view of this, the purpose of the embodiment of the present invention is to provide a kind of file classifying method, device and electronic equipment, its Can effectively solve the problems, such as in the prior art can not be to complaining text classification accuracy low.

In a first aspect, the embodiments of the invention provide a kind of file classifying method, methods described includes：By complaint to be sorted Text carries out word segmentation processing, obtains multiple words to be matched；By the multiple word to be matched complaint problem different from sign Dictionary is matched respectively, obtains matching result；The complaint according to belonging to the matching result determines the complaint text to be sorted Classification；Wherein, multiple history complaint text is trained to obtain by the dictionary for characterizing different complaint problems.

Second aspect, the embodiments of the invention provide a kind of device for sorting document, described device includes：Word segmentation processing mould Block, for complaint text to be sorted to be carried out into word segmentation processing, obtain multiple words to be matched；Matching module, for will be described more The dictionary of individual word to be matched complaint problem different from sign is matched respectively, obtains matching result；Sort module, for root Determine to complain classification belonging to the complaint text to be sorted according to the matching result；Wherein, it is described to characterize different complaint problems Multiple history complaint text is trained to obtain by dictionary.

The third aspect, the embodiment of the present invention provide a kind of electronic equipment, and the electronic equipment includes processor and storage Device, the memory are couple to the processor, the memory store instruction, when executed by the processor The electronic equipment performs following operate：Complaint text to be sorted is subjected to word segmentation processing, obtains multiple words to be matched；By institute The dictionary for stating multiple words to be matched complaint problem different from sign is matched respectively, obtains matching result；According to described Determine to complain classification belonging to the complaint text to be sorted with result；Wherein, the dictionary for characterizing different complaint problems is to incite somebody to action Multiple history complain text to be trained what is obtained.

Fourth aspect, the embodiment of the present invention provide a kind of read/write memory medium, it is characterised in that described that storage can be read For media storage in computer, the read/write memory medium includes a plurality of instruction, and a plurality of instruction is configured so that meter Calculation machine performs the file classifying method provided such as first aspect.

The embodiment of the present invention provides a kind of file classifying method, device and electronic equipment, first by by complaint to be sorted Text carries out word segmentation processing, multiple words to be matched is obtained, then by multiple words to be matched complaint problem different from sign Dictionary is matched respectively, obtains matching result, wherein, the dictionary for characterizing different complaint problems is to complain multiple history Text is trained what is obtained, then classification is complained according to belonging to matching result determines the complaint text to be sorted, with to upper State complaint text to be sorted to be classified, the multiple dictionaries obtained by training in advance in this method so that can treat multiple Word and dictionary matching are matched, it is hereby achieved that more accurate matching result, it is accurate to carry out complaint text to be sorted Classification, the complaint text realized for different complaint problems have higher nicety of grading, improve the performance of text classification.

Other features and advantages of the present invention will illustrate in subsequent specification, also, partly become from specification It is clear that or by implementing understanding of the embodiment of the present invention.The purpose of the present invention and other advantages can be by saying what is write Specifically noted structure is realized and obtained in bright book, claims and accompanying drawing.

Brief description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this A little accompanying drawings obtain other related accompanying drawings.

Fig. 1 shows a kind of structured flowchart for the electronic equipment that can be applied in the embodiment of the present invention；

Fig. 2 is a kind of flow chart for file classifying method that first embodiment of the invention provides；

Fig. 3 is a kind of structured flowchart for device for sorting document that second embodiment of the invention provides；

Fig. 4 is a kind of structured flowchart for matching module that second embodiment of the invention provides；

Fig. 5 is the structural representation of another electronic equipment provided in an embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Generally exist The component of the embodiment of the present invention described and illustrated in accompanying drawing can be configured to arrange and design with a variety of herein.Cause This, the detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit claimed invention below Scope, but it is merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not doing The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.

It should be noted that：Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.Meanwhile the present invention's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.

Fig. 1 shows a kind of structured flowchart for the electronic equipment 100 that can be applied in the embodiment of the present invention.As shown in figure 1, Electronic equipment 100 includes memory 101, storage control 102, one or more (one is only shown in figure) processors 103, outer If interface 104, radio-frequency module 105, audio-frequency module 106, Touch Screen 107 etc..These components are total by one or more communication Line/signal wire 108 mutually communicates.

Memory 101 can be used for storage software program and module, such as the file classifying method pair in the embodiment of the present invention Programmed instruction/the module answered, processor 103 is stored in software program and module in memory 101 by operation, so as to hold Row various function application and data processing, such as file classifying method provided in an embodiment of the present invention.

Memory 101 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetic Property storage device, flash memory or other non-volatile solid state memories.Processor 103 and other possible components are to storage The access of device 101 can be carried out under the control of storage control 102.

Various input/output devices are coupled to processor 103 and memory 101 by Peripheral Interface 104.In some implementations In example, Peripheral Interface 104, processor 103 and storage control 102 can be realized in one single chip.In some other reality In example, they can be realized by independent chip respectively.

Radio-frequency module 105 is used to receiving and sending electromagnetic wave, realizes the mutual conversion of electromagnetic wave and electric signal, so that with Communication network or other equipment are communicated.

Audio-frequency module 106 provides a user COBBAIF, and it may include one or more microphones, one or more raises Sound device and voicefrequency circuit.

Touch Screen 107 provides an output and inputting interface simultaneously between electronic equipment 100 and user.Specifically, Touch Screen 107 shows video frequency output to user, and the contents of these video frequency outputs may include word, figure, video and its any Combination.

It is appreciated that the structure shown in Fig. 1 is only to illustrate, the electronic equipment 100 may also include more more than shown in Fig. 1 Either less component or there is the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can use hardware, software Or its combination is realized.

First embodiment

It refer to Fig. 2, a kind of flow chart for file classifying method that Fig. 2 provides for first embodiment of the invention, the side Method is applied to device for sorting document, and this document sorter runs on above-mentioned electronic equipment, and methods described includes：

Step S110：Complaint text to be sorted is subjected to word segmentation processing, obtains multiple words to be matched.

For electronic document, " keyword " can be used to represent analysis and understand all features involved during document, closed Keyword such as " taxi ", " share-car ", " fee register " etc., certainly, for different main bodys, such as bank, government organs and one As enterprise, determining the keyword of the classification when institute foundation of electronic document may differ, in the electronics being related to for some enterprises When document is classified, above-mentioned keyword can be rule of thumb predefined.

When needing the electronic document of multiple customer complaints to receiving to classify such as government organs, first to electronic document Pre-processed, i.e., word segmentation processing is carried out to the complaint text to be sorted of acquisition, the complaint text to be sorted is above-mentioned electronics Document.Wherein, word segmentation processing is carried out to the complaint text to be sorted, it is necessary first to identify minimum semantic primitive therein, make For a kind of embodiment, the Chinese Word Automatic Segmentation that Lucene search engines can be used to carry carries out word segmentation processing, and Lucene has The Chinese analysis device of their own, wherein mainly StandardAnalyzer and CJKAnalyzer.StandardAnalyzer Analyzer is using individual character participle method, and CJKAnalyzer analyzers use dichotomy.

Character string matching method most commonly is based in the Chinese Word Automatic Segmentation of Lucene search engines, it is basic herein Above there are a kind of positive word matching segmentation methods that most increase, the positive word matching segmentation methods that most increase realize that thought is to prepare one The dictionary of participle, then the sentence of input is from left to right scanned using algorithm, the purpose is to by the word in sentence Symbol string is matched one by one with the entry in dictionary.Matching field is since a word, constantly increases word in matching, until matching Untill not going down, each round terminates obtained result, and take maximum can be with the current matching field that the match is successful, for example, treating point Class complains a word for scanning in text as " today It is gloomy heavy ", have in dictionary " today ", " weather ", " my god ", Words such as " cloudy heavy ", then since " the present " word, is scanned successively backward, takes " the present ", " today ", " day today ", " today day respectively Gas ", " today, weather was cloudy ", " It is gloomy today ", " today, It is gloomy sank ", " today, It is gloomy sank " are matched, Most long matched character string is " today " in dictionary, then the word is split out, next since " my god " scan word, repeat Aforesaid operations, it is as a result " today/weather/cloudy heavy/", and for its each word mark part of speech, wherein, noun, verb, number The parts of speech such as word, adjective, preposition, auxiliary word, conjunction, punctuate mark is respectively the symbols such as n, v, m, a, p, u, c, wp, for example, will " today " is labeled as noun, then will should (today, weather, it is cloudy heavy) be used as initial word set, certainly, for subsequent match Accuracy, also need to delete initial word and concentrate word that is conventional and having little significance, be referred to as stop words, such as：, be, etc. Word, so being in the above-mentioned word that is obtained after stop words of removing：It is today, weather, cloudy heavy, then can using these words as A word to be sorted for complaining text carries out the word to be matched obtained after word segmentation processing, and in this approach, can obtain entire chapter should It is to be sorted to complain text to carry out the multiple words to be matched obtained after word segmentation processing.

Step S120：The dictionary of the multiple word to be matched complaint problem different from sign is matched respectively, obtained Take matching result.

Multiple history complaint text is trained to obtain by the dictionary for characterizing different complaint problems, and the dictionary is corresponding not With complaint Question Classification, and the weight of each word in each dictionary is in default proportion range.

For example, when complaining file to carry out classification processing to government, can be classified for different complaint problems, Such as：Taxi problem, communication medium problem, bus problem, parking problem etc., then each complaint problem is predefined more Individual keyword, e.g., confirmable multiple keywords are in taxi problem：Taxi, joining-person, fee register, call a taxi, raise the price Deng word, confirmable multiple keywords are in communication medium problem：Broadband, phone, UNICOM, CHINAUNICOM, dial Word, the word that above-mentioned keyword is concentrated for the semanteme of following determinations.

Then complain text to be trained multiple history of acquisition, the plurality of history is complained in text each gone through first History complains text to carry out above-mentioned word segmentation processing, the dictionary formed with the word of the different complaint problems of determination sign, for each Dictionary, semanteme and the dictionary of each word included according to the dictionary characterize the height of the correlation degree of complaint problem, will Each multiple semantic collection of word division, and can include for proportion range, each semantic collection corresponding to each semantic collection distribution with not It is to determine weight in proportion range corresponding to each word from affiliated semantic collection with the multiple keywords complained under Question Classification, Wherein, it is bigger to distribute proportion range respective weights for the semantic collection with complaining problem correlation degree higher.

For example in taxi Question Classification, the semantic collection 1 of division is (taxi, joining-person, fee register), semanteme collection 2 is (call a taxi, raise the price), the semanteme collect the correlation degree highest of each word and the taxi problem in 1, can be the power of its distribution Weight scope is 0.9-0.98, and semanteme integrates the proportion range of 2 distribution as 0.8-0.89, if calculating the power of the word in semantic collection 1 When again not in the range of 0.9-0.98, it is likely that represent the weight inaccuracy calculated, text classification mistake may be finally resulted in The problem of, it is possible to the weight that the semanteme is concentrated is redistributed, if for example, the weight for calculating " taxi " is 0.85, not in the range of above-mentioned 0.9-0.98, then the word " taxi " is redistributed into new weight so that " taxi " New weight be in the range of 0.9-0.98, wherein, can be in default proportion range as a kind of mode, i.e. 0.9-0.98 In the range of randomly select a weight as new weight distribution to " taxi ", if weight selection is 0.95 to distribute to and " hire out The new weight of car ", i.e. " taxi " is redefined as 0.95.

Furthermore it is also possible to which the semanteme to determine concentrates each word to distribute a proportion range in advance, e.g., asked in taxi In topic classification, word " taxi " may be considered a word for occurring that frequency is larger in such problem, so, can be with A larger proportion range is distributed for it, such as 0.9-0.98, the proportion range for being word " share-car " distribution is 0.87-0.89. If the weight for then calculating " taxi " is 0.85, then it represents that its weight is not in default proportion range, it is likely that represents meter The weight that calculates is inaccurate, the problem of may finally resulting in text classification mistake, it is possible to will " taxi " weight progress Redistribute, will the word " taxi " redistribute new weight so that the new weight of " taxi " is in default weight model In enclosing, i.e. 0.9-0.98, wherein, as a kind of mode, a weight can be randomly selected in default proportion range as new Weight distribution gives " taxi ", and if weight selection is 0.95 to distribute to " taxi ", i.e. the new weight of " taxi " redefines For 0.95.

In addition, as a kind of embodiment, an also settable computation rule, if for example, calculating the weight of " taxi " Not in default proportion range, then the weight of current " taxi " is added into a preset value, be used as new weight so that at new weight In in default proportion range.Certainly, the preset value can set smaller, and such as 0.1 or 0.05, if at current " taxi " Weight add after the preset value obtained new weight again without in default proportion range, then can also be in new weight On the basis of add preset value, the new weight to the last obtained is in default proportion range.

Certainly, alternatively embodiment, be also based on each history complain text determine to characterize first it is different The dictionary that the word of complaint problem is formed, each word in the dictionary semantic collect now without distribution weight to be each Corresponding proportion range is assigned, so again to determine weight in proportion range corresponding to each word from affiliated semantic collection, The proportion range for integrate 1 (taxi, joining-person, fee register) distribution such as semanteme is then each word of semanteme concentration as 0.9-0.98 The weight being randomly assigned in a 0.9-0.98 proportion range, such as it is that " taxi " distribution weight is 0.97, is distributed for " joining-person " Weight is 0.95, is that " fee register " distribution weight is 0.9.

By the above method, the semantic new weight for concentrating each word can be obtained, is then based on different classifications, described above Taxi problem, communication medium problem etc., establish multiple dictionaries, i.e., establish a dictionary under each classification, included in the dictionary Multiple words and its corresponding new weight.

Wherein, TF-IDF algorithms can be used to obtain each word to be matched in the complaint text to be sorted in the present embodiment The TF-IDF values of language, the weight using the TF-IDF values of word to be matched as the word to be matched.

TF-IDF (term frequency-inverse document frequency) be it is a kind of be used for information retrieval with The conventional weighting technique that information is prospected.TF-IDF is a kind of statistical method, to assess a words for a file set or one The significance level of a copy of it file in individual corpus.The number that the importance of words occurs hereof with it is directly proportional Increase, but the frequency that can occur simultaneously with it in corpus is inversely proportional decline.

TF-IDF main thought is：If the frequency TF that some word or phrase occur in an article is high, and Seldom occur in other articles, then it is assumed that this word has good class discrimination ability, is adapted to point to come.TF-IDF is actual On be TF*IDF, TF word frequency (Term Frequency), IDF inverse document frequencies (Inverse Document Frequency). TF represents the frequency that entry occurs in document d, and IDF main thought is：If the document comprising entry t is fewer, IDF is got over Greatly, then illustrate that entry t has good class discrimination ability.If the number of files comprising entry t is m in certain a kind of document C, and The total number of documents that other classes include t is k, it is clear that all number of files n=m+k comprising t, when m is big, according to IDF formula Obtained IDF value can be small, just illustrates that entry t class discriminations are indifferent.So in actual applications, if an entry Frequently occurred in the document of a class, then illustrate that the entry can represent the feature of text of this class, such word very well Bar should give them to assign higher weights, and select and be used as the Feature Words of the class text to distinguish and other class documents.

Specifically, the TF-IDF values for obtaining each word calculate and obtain each word in complaint text to be sorted and exist first It is affiliated it is to be sorted complaint text in word frequency TF, some word of word frequency TF=it is affiliated it is to be sorted complaint text in occurrence number/should The total word number to be sorted for complaining text, its calculation formula areWherein n_i,jIt is that the word goes out in affiliated text Existing number, denominator is represented in the text so the occurrence number sum of words, if word " taxi " is affiliated to be sorted It is 300 times to complain the occurrence number in text, and this is to be sorted, and to complain total word number of text be 1200, then the word " taxi " Word frequency TF=300/1200=0.25.Then the inverse document frequency IDF of each word, inverse document frequency IDF=log are obtained again (number of files+1 of the total number of documents of corpus/the include word), its calculation formula is Wherein | D | the total number of documents in corpus is represented, | { j:t_i∈d_j| represent the number of files for including the word.Each word is based on again The word frequency TF and inverse document frequency IDF of language, obtain the TF-IDF values of each word, i.e. TF-IDF values=word frequency TF* of word is inverse Document frequency IDF.

Thus the TF-IDF values to be sorted for complaining each word to be matched in text can be obtained, certainly, for one History complains text, and the TF-IDF values of each word in each dictionary can be also obtained by the above method, and the history is complained into text Each word in this carries out descending arrangement with TF-IDF values, wherein, as a kind of mode, it can use each history and complain in text 100 words for coming foremost form dictionary as semantic collection.

Tables 1 and 2 is can refer to, it is multiple dictionaries that some government organs establishes for different complaint problems, and table 1 is The multiple history obtained complain text, and table 2 is the multiple dictionaries established based on different complaint Question Classifications.

Table 1

Table 2

Then multiple words to be matched of above-mentioned acquisition are matched respectively with multiple dictionaries of above-mentioned foundation, will be treated Multiple words to be matched that classification complains text obtained after word segmentation processing are matched with the word in multiple dictionaries.Specifically Ground, weight of each word to be matched in above-mentioned complaint text to be sorted in multiple words to be matched is obtained first, will be each Word to be matched is vectorial as the first word frequency, for example, for sentence " this leather boots number is big, and that number is suitable ", will The sentence obtained after being segmented " this/leather boots/number/big, that/number/suitable ", calculate the word frequency of each word, i.e., Weight, weight corresponding to each of which word are：This 1, leather boots 1, number 2 is big by 1, that 1, suitable 1, not 0, it is small by 0, more 0.

Then each dictionary is directed to, is retrieved as the weight of each word distribution in the dictionary, obtains corresponding to the dictionary the Two word frequency vector, i.e., multiple second word frequency vectors are the word frequency vector for different complaint Question Classifications, and each classification obtains one Individual second word frequency vector, as shown in Table 2 above, according still further to default similarity mode algorithm, by first word frequency vector difference Corresponding second word frequency vector carries out similarity mode successively respectively with each dictionary, the second word frequency vector until determining matching Then stop continuing to match, and obtain matching result.

KNN (k-NearestNeighbor, nearest neighbor algorithm), simple pattra leaves can be used by carrying out the method for Similarity Measure This, SVMs, neutral net, decision tree, included angle cosine algorithm the methods of, in the present embodiment, preset described default similar Degree matching algorithm is included angle cosine algorithm, is illustrated below by taking included angle cosine algorithm as an example.

The vectorial included angle cosine between any second word frequency vector of first word frequency is determined in the following way, is completed Similarity mode：

If it is A=[A by the first word frequency vector representation₁,A₂...A_n], the second word frequency vector representation is B=[B₁, B₂...B_n], included angle cosine formula isSpecifically, if such as The weight of each word in the above-mentioned sentence " this leather boots number is big, and that number is suitable " calculated, as the first word frequency to Measure A=[1,1,2,1,1,1,0,0,0], if a certain first complain the lower word of classification be " this/small leather boots/number/or not, that Only/more/suitable ", its each self-corresponding weight in dictionary is：This 1, leather boots 1, number 1 is big by 0, that 1, suitable 1, no 1, it is small by 1, more 1, if it is " this/BMW/very/have type " to complain the word under classification another second, its in dictionary each Corresponding weight is：This 1, BMW 2, very 0, there is type 1；Then corresponding second word frequency vector can be B1=[1,1,1,0, 1,1,1,1,1] and B2=[1,2,0,1], then tried to achieve respectively according to above-mentioned included angle cosine formulaSo, it can be deduced that matching result is above-mentioned The value tried to achieve according to included angle cosine formula.

Step S130：Classification is complained according to belonging to the matching result determines the complaint text to be sorted.

In the matching result for obtaining multiple words to be matched and being matched respectively with dictionary, as obtained in step S120 The similarity of first word frequency vector and the second word frequency vector, wherein, multiple similarities of acquisition are compared, such as the first word frequency Vectorial A and the second word frequency vector B1 similarity is higher than with the second word frequency vector B2 similarity, then by complaint text to be sorted It is categorized into above-mentioned first to complain in classification, thus can completes the classification to complaint text to be sorted.

Or threshold value can be set, determine to treat if the similarity obtained with certain word frequency vector reaches the threshold value of setting point Class complains the classification of text to belong to classification corresponding to certain word frequency vector.

For example, its result classified by the above method to complaint text to be sorted is as shown in table 3 below.

Table 3

It can be seen that establishing dictionary with the above method, then complaint text to be sorted is classified with classification again, had Higher nicety of grading.

First embodiment of the invention provides a kind of file classifying method, first by the way that complaint text to be sorted is segmented Processing, multiple words to be matched are obtained, then carry out the dictionary of multiple words to be matched complaint problem different from sign respectively Matching, matching result is obtained, wherein, the dictionary for characterizing different complaint problems is to be trained multiple history complaint text Obtain, then classification is complained according to belonging to matching result determines the complaint text to be sorted, with to above-mentioned complaint to be sorted Text is classified, the multiple dictionaries obtained in this method by training in advance so that can be by multiple words to be matched and word Allusion quotation matches, it is hereby achieved that more accurate matching result, can carry out Accurate classification by complaint text to be sorted, realize pin Complaint text to different complaint problems has higher nicety of grading, improves the performance of text classification.

Second embodiment

It refer to Fig. 3, a kind of structured flowchart for device for sorting document 200 that Fig. 3 provides for second embodiment of the invention, institute The file classifying method that device is used to perform first embodiment offer is stated, described device includes：

Word segmentation processing module 210, for complaint text to be sorted to be carried out into word segmentation processing, obtain multiple words to be matched.

Matching module 220, for the dictionary of the multiple word to be matched complaint problem different from sign to be carried out respectively Matching, obtain matching result.

Sort module 230, for according to the matching result determine it is described it is to be sorted complaint text belonging to complain classification.

Wherein, multiple history complaint text is trained to obtain by the dictionary for characterizing different complaint problems.

Described device also includes：

Dictionary acquisition module, for complaining each history in text to complain text to carry out at participle to the multiple history Reason, it is determined that the dictionary that the word for characterizing different complaint problems is formed.

Weight distribution module, for for each dictionary, semanteme and the dictionary of each word included according to the dictionary The height of the correlation degree of characterized complaint problem, each word is divided into semantic collection, and to be weighed corresponding to each semantic collection distribution Weight scope；And weight determination module, for determine weight in proportion range corresponding to each word from affiliated semantic collection.

Wherein, it is bigger to distribute proportion range respective weights for the semantic collection with complaining problem correlation degree higher.

Fig. 4 is refer to, the matching module 220 includes：

First word frequency vector acquiring unit 221, for obtaining each word to be matched in the multiple word to be matched Weight in the complaint text to be sorted, using the weight of each word to be matched as the first word frequency vector.

The first word frequency vector acquiring unit 221, it is additionally operable to obtain the complaint text to be sorted using TF-IDF algorithms The TF-IDF values of each word to be matched in this, the weight using the TF-IDF values of word to be matched as the word to be matched will The weight of each word to be matched is as the first word frequency vector.

Second word frequency vector acquiring unit 222, for for each dictionary, being retrieved as each word distribution in the dictionary Weight, obtain the second word frequency vector corresponding to the dictionary.

Matching unit 223, for according to default similarity mode algorithm, by the first word frequency vector respectively with each dictionary The second word frequency vector carries out similarity mode successively corresponding to respectively, until determine matching the second word frequency vector then stop after Continuous matching, and obtain matching result.

Wherein, the default similarity mode algorithm is included angle cosine algorithm, and the matching unit 223 also includes angle Cosine-algorithm unit, for determining the vectorial angle between any second word frequency vector of first word frequency in the following way Cosine, complete similarity mode：

It is A=[A by the first word frequency vector representation₁,A₂...A_n], the second word frequency vector representation is B=[B₁, B₂...B_n], based on included angle cosine formulaCarry out similarity Match somebody with somebody, obtain matching result.

It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description Specific work process, may be referred to the corresponding process in preceding method, no longer excessively repeat herein.

In summary, the embodiment of the present invention provides a kind of file classifying method, device and electronic equipment, first by that will treat Classification complains text to carry out word segmentation processing, multiple words to be matched is obtained, then by multiple words to be matched and the different throwings of sign Tell that the dictionary of problem is matched respectively, obtain matching result, wherein, the dictionary for characterizing different complaint problems is will be multiple History complains text to be trained what is obtained, then complains class according to belonging to matching result determines the complaint text to be sorted Not, to classify to above-mentioned complaint text to be sorted, multiple dictionaries that training in advance obtains are passed through in this method so that can be with By multiple words to be matched and dictionary matching, it is hereby achieved that more accurate matching result, can be by complaint text to be sorted Accurate classification is carried out, the complaint text realized for different complaint problems has higher nicety of grading, improves text classification Performance.

Corresponding to the file classifying method in Fig. 2, the embodiment of the present application additionally provides a kind of electronic equipment, as shown in figure 5, The equipment includes memory 1000, processor 2000 and is stored on the memory 1000 and can manage in this place to run on device 2000 Computer program, wherein, above-mentioned processor 2000 realizes the step of above-mentioned file classifying method when performing above computer program Suddenly.

Specifically, above-mentioned memory 1000 and processor 2000 can be general memory and processor, not do here It is specific to limit, when the computer program of the run memory 1000 of processor 2000 storage, it is able to carry out above-mentioned document classification side Method, so as to clearly be visually known in multiple urban nodes, two urban nodes are planned to combine scenic spot data point Probability, further can be with scientific and reasonable to city so as to improve the tourism data analysis efficiency of industry and enterprise Tourism planning is instructed, and promotes the development of tourist industry.

Corresponding to the file classifying method in Fig. 1, the embodiment of the present application additionally provides a kind of computer-readable recording medium, Computer program is stored with the computer-readable recording medium, the computer program performs above-mentioned file when being run by processor The step of sorting technique.

Specifically, the storage medium can be general storage medium, such as mobile disk, hard disk, in the storage medium Computer program when being run, above-mentioned file classifying method is able to carry out, so as to clearly be visually known multiple In urban node, two urban nodes are planned to combine the probability of scenic spot data point, so as to improve the tourism of industry and enterprise Data analysis efficiency, city tourism planning can further be instructed with scientific and reasonable, promote the development of tourist industry.

In several embodiments provided herein, it should be understood that disclosed apparatus and method, can also pass through Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing Show the device of multiple embodiments according to the present invention, method and computer program product architectural framework in the cards, Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of a module, program segment or code Part, a part for the module, program segment or code include one or more and are used to realize holding for defined logic function Row instruction.It should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also with different from The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially perform substantially in parallel, they are sometimes It can perform in the opposite order, this is depending on involved function.It is it is also noted that every in block diagram and/or flow chart The combination of individual square frame and block diagram and/or the square frame in flow chart, function or the special base of action as defined in performing can be used Realize, or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.

In addition, each functional module in each embodiment of the present invention can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.

If the function is realized in the form of software function module and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to be contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are causing a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment methods described of the present invention. And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that：Similar label and letter exists Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing It is further defined and explained.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention described should be defined by scope of the claims.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.

Claims

1. a kind of file classifying method, it is characterised in that methods described includes：

Complaint text to be sorted is subjected to word segmentation processing, obtains multiple words to be matched；

The dictionary of the multiple word to be matched complaint problem different from sign is matched respectively, obtains matching result；

Classification is complained according to belonging to the matching result determines the complaint text to be sorted；

2. according to the method for claim 1, it is characterised in that also include：

Each history in text is complained to complain text to carry out word segmentation processing to the multiple history, it is determined that characterizing different complaint problems Word form dictionary；

For each dictionary, what the semanteme of each word included according to the dictionary and the dictionary characterized complaint problem associates journey The height of degree, each word is divided into semantic collection, and be proportion range corresponding to each semantic collection distribution；And

To determine weight in proportion range corresponding to each word from affiliated semantic collection；

3. according to the method for claim 2, it is characterised in that ask the complaint different from sign of the multiple word to be matched The dictionary of topic is matched respectively, is obtained matching result, is specifically included：

The weight in the complaint text to be sorted of each word to be matched in the multiple word to be matched is obtained, will be every The weight of individual word to be matched is as the first word frequency vector；

For each dictionary, be retrieved as the weight of each word distribution in the dictionary, obtain the second word frequency corresponding to the dictionary to Amount；

According to default similarity mode algorithm, by the first word frequency vector respectively with each dictionary respectively corresponding second word frequency to Amount carries out similarity mode successively, until determining that the second word frequency vector of matching then stops continuing to match, and obtains matching knot Fruit.

4. according to the method for claim 3, it is characterised in that obtain each word to be matched in the multiple word to be matched The weight in the complaint text to be sorted of language, is specifically included：

The TF-IDF values to be sorted for complaining each word to be matched in text are obtained using TF-IDF algorithms, by word to be matched Weight of the TF-IDF values of language as the word to be matched.

5. according to the method described in any claim in claim 3-4, it is characterised in that the default similarity mode algorithm is Included angle cosine algorithm；

The vectorial included angle cosine between any second word frequency vector of first word frequency is determined in the following way, is completed similar Degree matching：

It is A=[A by the first word frequency vector representation₁,A₂...A_n], the second word frequency vector representation is B=[B₁,B₂...B_n], Based on included angle cosine formulaCarry out similarity mode, acquisition With result.

6. a kind of device for sorting document, it is characterised in that described device includes：

Word segmentation processing module, for complaint text to be sorted to be carried out into word segmentation processing, obtain multiple words to be matched；

Matching module, for the dictionary of the multiple word to be matched complaint problem different from sign to be matched respectively, obtain Take matching result；

Sort module, for according to the matching result determine it is described it is to be sorted complaint text belonging to complain classification；

7. device according to claim 6, it is characterised in that described device also includes：

Dictionary acquisition module, for complaining each history in text to complain text to carry out word segmentation processing to the multiple history, really Surely the dictionary that the word of different complaint problems is formed is characterized；

Weight distribution module, for for each dictionary, semanteme and the dictionary institute table of each word included according to the dictionary The height of the correlation degree of complaint problem is levied, each word is divided into semantic collection, and be weight model corresponding to each semantic collection distribution Enclose；And

Weight determination module, for determine weight in proportion range corresponding to each word from affiliated semantic collection；

8. device according to claim 7, it is characterised in that the matching module includes：

First word frequency vector acquiring unit, for obtaining being treated described for each word to be matched in the multiple word to be matched The weight in text is complained in classification, using the weight of each word to be matched as the first word frequency vector；

Second word frequency vector acquiring unit, for for each dictionary, being retrieved as the weight of each word distribution in the dictionary, obtaining To the second word frequency vector corresponding to the dictionary；

Matching unit, it is for according to default similarity mode algorithm, the first word frequency vector is right respectively with each dictionary respectively The the second word frequency vector answered carries out similarity mode successively, until determining that the second word frequency vector of matching then stops continuation Match somebody with somebody, and obtain matching result.

9. a kind of electronic equipment, it is characterised in that the electronic equipment includes processor and memory, the memory coupling To the processor, the memory store instruction, when executed by the processor the electronic equipment execution Operate below：

A kind of 10. read/write memory medium, it is characterised in that the read/write memory medium is stored in computer, it is described can Reading storage medium includes a plurality of instruction, and a plurality of instruction is configured so that computer is performed as claim 1-5 is any Item methods described.