CN110321557A

CN110321557A - A kind of file classification method, device, electronic equipment and storage medium

Info

Publication number: CN110321557A
Application number: CN201910519424.1A
Authority: CN
Inventors: 徐波
Original assignee: GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD; Multi Benefit Network Co Ltd; Guangzhou Duoyi Network Co Ltd
Current assignee: GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD; Multi Benefit Network Co Ltd; Guangzhou Duoyi Network Co Ltd
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2019-10-11

Abstract

The invention discloses a kind of file classification method, device, electronic equipment and storage mediums, the method comprise the steps that obtaining text to be sorted, and pre-process to the text to be sorted；According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted；Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector；The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the classification results of the text to be sorted are obtained；Wherein, the textual classification model learns according to preset samples of text and text categories corresponding with samples of text.The present invention can be improved the accuracy of text classification.

Description

A kind of file classification method, device, electronic equipment and storage medium

Technical field

The present invention relates to natural language processing technique field more particularly to a kind of file classification method, device, electronic equipments And storage medium.

Background technique

With the continuous development of science and technology, the mankind have come into the artificial intelligence epoch, it is often necessary to intellectual product into Row interaction, makes intellectual product user provide service.And text interaction be still user instantly interacted with intellectual product it is important One of means, the text that intellectual product provides identification user's input such as identify to text, classify at the processing.

In certain application scenarios, intellectual product is needed to classify text, is provided according to classification results for user Service, such as the speech for forbidding user to input abusing property, need to classify to text, judge whether to belong to abusing property text. However current Text Classification is substantially classified according to word, the word in text, once the text of user's input exists The misspelling of words is easy for causing text classification mistake, and text classification accuracy is not high.

Summary of the invention

The technical problem to be solved by the embodiment of the invention is that providing a kind of file classification method, device, electronic equipment And storage medium, it can be improved the accuracy of text classification.

In a first aspect, the embodiment of the invention provides a kind of file classification methods, which comprises

Text to be sorted is obtained, and the text to be sorted is pre-processed；

According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted；Wherein, institute Stating Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector；

The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is obtained described to be sorted The classification results of text；Wherein, the textual classification model is according to preset samples of text and text corresponding with samples of text This Category Learning.

Preferably, the Text eigenvector further includes character quantity feature vector；Wherein, the character quantity feature to Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment Element.

Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous Body Chinese character classification.

Preferably, described according to the pretreated text to be sorted, extract the text feature of the text to be sorted Vector specifically includes:

According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to described Pinyin sequence extracts the phonetic feature vector；

Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and root According to word sequence signature vector described in the word sequential extraction procedures；

By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and according to the word order Column extract the word sequence feature vector；

Word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum is melted according to preset Conjunction mode carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.

Preferably, the segmenter is stammerer segmenter；The phonetic feature vector, word sequence signature vector sum institute Predicate sequence signature vector passes through TF-IDF technology and extracts.

Preferably, the Text eigenvector by the text to be sorted inputs in preset textual classification model, obtains The classification results for obtaining the text to be sorted specifically include:

The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is calculated described to be sorted The Text eigenvector of text is respectively the probability value of preset each text categories；

The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset Text categories corresponding with the most probable value in each text categories.

Preferably, it is described to the text to be sorted carry out pretreatment specifically include:

According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted；

If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed；

It is described that the text to be sorted is pre-processed further include:

All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format；

If there are the characters of full-shape format in the text to be sorted, by the full-shape format in the text to be sorted Character is converted to the character of half width form.

Second aspect, the embodiment of the invention also provides a kind of document sorting apparatus, described device includes:

Preprocessing module is pre-processed for obtaining text to be sorted, and to the text to be sorted；

Text eigenvector extraction module, for extracting described to be sorted according to the pretreated text to be sorted The Text eigenvector of text；Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order Column feature vector；

Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset text classification In model, the classification results of the text to be sorted are obtained；Wherein, the textual classification model according to preset samples of text with And text categories study corresponding with samples of text.

The third aspect the embodiment of the invention also provides a kind of electronic equipment, including processor, memory and is stored in In the memory and it is configured as the computer program executed by the processor, the processor executes the computer journey The file classification method as described in first aspect any one is realized when sequence.

Fourth aspect, it is described computer-readable to deposit the embodiment of the invention also provides a kind of computer readable storage medium Storage media includes the computer program of storage, wherein controls the computer-readable storage in computer program operation Equipment executes the file classification method as described in first aspect any one where medium.

A kind of file classification method, device, electronic equipment and the storage medium of above-mentioned offer, the text to be sorted of extraction Include phonetic feature vector, word sequence signature vector sum word sequence feature vector in Text eigenvector, increases the phonetic of text Granularity carries out comprehensive descision to the semantic of text, and avoiding the text that user inputs, there are carry out mistake point to text when wrong word Class improves the accuracy of text classification.

Detailed description of the invention

Fig. 1 is the flow diagram of a preferred embodiment of file classification method provided in an embodiment of the present invention；

Fig. 2 is the structural schematic diagram of a preferred embodiment of document sorting apparatus provided in an embodiment of the present invention；

Fig. 3 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

File classification method provided in an embodiment of the present invention is suitable for classifying to text according to scheduled category attribute Scene in, such as forbid businessman to peddle violated product in electric business platform, or forbid publication with the text of sensitive vocabulary, It needs to classify to text；For another example forbid user in Homepage Publishing abusing property text, safeguard good network environment, it is also desirable to Classify to text.In embodiments of the present invention with by text according to sensitive word text and non-sensitive both texts of word text Implementation process of the invention is illustrated for classification, but does not limit specific text categories of the invention.

The embodiment of the invention provides a kind of file classification methods, referring to Fig. 1, Fig. 1 is provided in an embodiment of the present invention The flow diagram of one preferred embodiment of file classification method；Specifically, the described method includes:

S1, text to be sorted is obtained, and the text to be sorted is pre-processed；

S2, according to the pretreated text to be sorted, extract the Text eigenvector of the text to be sorted；Its In, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector；

S3, the Text eigenvector of the text to be sorted is inputted in preset textual classification model, obtain it is described to The classification results of classifying text；Wherein, the textual classification model is according to preset samples of text and corresponding with samples of text Text categories study.

Specifically, the mode that user inputs text to be sorted can be handwriting input device or typewriting input equipment etc., Wrong word may can be inputted unintentionally or deliberately input phonogram to avoid sensitive word, such as obtain national authentication nti-freckle The beauty product of effect could write the effect of " nti-freckle " exactly in products propaganda, and some businessmans may deliberately input be by it " qu classes ", " removing & spot " or other explicit consumer can be allowed to know that its semantic vocabulary carrys out the sensitive word detection of avoidance system.With with Family inputs for " removing & spot ", after inputting text to be sorted, obtains the text to be sorted of user's input, treats classifying text progress Pretreatment, to reduce the noise of text to be sorted, refines the key message of text, such as pretreated text to be sorted is " despeckle "；According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted, Text eigenvector includes Phonetic feature vector, word sequence signature vector sum word sequence feature vector input the Text eigenvector of text to be sorted pre- If textual classification model in, the treatment processes such as calculating, analysis by textual classification model " are gone " not only for text granularity The semantic analysis of " spot " and vocabulary granularity " going " and " spot " divides text also directed to the phonetic granularity of text to be sorted " despeckle " can be carried out semantic analysis according to phonetic granularity by analysis, and the classification results for obtaining text to be sorted are sensitive word text, from And corresponding measure can be executed according to actual needs, such as the text with sensitive word that upload user can not be submitted to input, Or need user's raising national authentication proof that can just issue the publicity text.

When the text of user's input is there are when wrong word, although the vocabulary grain size analysis of text to be sorted can be impacted, no " despeckle " can be gone to analyze as vocabulary according to text, in vocabulary granularity also can only according to text by text to be sorted according to vocabulary " going " and vocabulary " spot " are analyzed, and are carried out text classification according to the scheme of the prior art, are had disengaged from the primitive of text to be sorted Justice be easy to cause text classification mistake, but a kind of file classification method provided in an embodiment of the present invention can also be according to phonetic grain Degree analyzes " despeckle ", obtains the classification results of text.

A kind of file classification method provided in an embodiment of the present invention wraps in the Text eigenvector of the text to be sorted of extraction Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, semanteme of the phonetic granularity to text of text is increased Comprehensive descision is carried out, avoiding the text that user inputs, there are the Semantic judgements that text is influenced when wrong word, to avoid to text This progress mistake classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.

It should be noted that in order to increase the granularity of text analyzing, Text eigenvector can also include stroke feature to The feature vectors such as amount, phrase feature vector, synonymous word feature vector, treat the analysis that classifying text carries out more granularities, comprehensive The text categories for judging text to be sorted obtain more accurate classification results.

Specifically, analysis text to be sorted can be further increased when Text eigenvector includes character quantity feature vector This granularity, the classification of character and the quantity of each character class have certain difference in the text of different text categories, such as " this product is dispelled * * * spot effect！", increase character quantity feature vector granularity can analyze out " this product dispel * * * spot effect Fruit！" this text be sensitive word text probability it is larger.

A kind of file classification method provided in an embodiment of the present invention, Text eigenvector further include character quantity feature to Amount, can be further improved the accuracy of text classification.

Optionally, character quantity feature vector further includes the element for being used to indicate the total quantity of character in text to be sorted.

Optionally, preset character class can be arranged according to the actual application, such as punctuation character classification, special symbol Number classification, simplified Chinese character classification, traditional Chinese character classification, emoticon classification, English symbol classification etc. may include The character class of any one or any combination.

Wherein, amalgamation mode refers to that the feature vector by each granularity is fused to the mode of Text eigenvector, such as according to The feature vector of each granularity of preset order transverse direction arrangement, obtains whole Text eigenvector；Or by the feature vector of each granularity Text eigenvector is used as after addition.Segmenter refers to the text analyzing of input into the tool for meeting certain logic of language, example Such as cook's segmenter, stammerer segmenter, easily segmenter, as long as can be suitably used for technical solution of the present invention.

Specifically, more rapidly, accurately the vocabulary of text can be divided using stammerer segmenter, using TF-IDF technology Key message in text can be extracted, realize phonetic feature vector, word sequence signature vector sum word sequence feature vector Extraction.

It should be noted that if Text eigenvector further includes other feature vectors, as long as being suitable for TF-IDF technology The feature of extraction can be extracted using TF-IDF technology.

Specifically, file classification method provided in an embodiment of the present invention, determines text to be sorted based on maximization Classification results input preset textual classification model, can be exported as a result, the output result includes text difference to be sorted There are [sensitive word text categories, non-sensitive word for the probability value of each text categories, such as one one-dimensional vector of output, text categories Text categories], output result is [90%, 10%], then obtains the most probable value 90% in probability value, and by most probable value Classification results of the sensitive word text categories as text to be sorted corresponding to 90%.

It is described that the text to be sorted is pre-processed further include:

Wherein, stop words refers to the words of not physical meaning, usually function word, such as conjunction, auxiliary words of mood, Jie Word etc., deactivating in vocabulary (or being stop words dictionary) includes several stop words.

Specifically, treating the pretreatment that classifying text is removed stop words, text length can be reduced, reduce text point The noise of analysis is conducive to the feature for accurately refining text, improves the accuracy of text classification；Treat the full-shape format of classifying text Character is converted into half width form character, can reduce the hardware resource requirements of storage text.It should be noted that by text to be sorted This full-shape layout character is converted into half width form character, be to the characters such as English alphabet, number key, the symbolic key that can be converted into Row conversion, and for Chinese character, although double byte character, can not be converted.

When it is implemented, obtaining text to be sorted, and treat classifying text and located in advance after user inputs text to be sorted Reason；According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted；Wherein, Text eigenvector packet Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, it is comprehensive to be carried out by the feature of more granularities to text Close analysis；The Text eigenvector of text to be sorted is inputted in preset textual classification model, point of text to be sorted is obtained Class result；Wherein, textual classification model learns according to preset samples of text and text categories corresponding with samples of text.

The embodiment of the invention also provides a kind of document sorting apparatus, referring to Fig. 2, Fig. 2 is that the embodiment of the present invention provides Document sorting apparatus a preferred embodiment structural schematic diagram；Specifically, described device includes:

Preprocessing module 11 is pre-processed for obtaining text to be sorted, and to the text to be sorted；

Text eigenvector extraction module 12, for extracting described wait divide according to the pretreated text to be sorted The Text eigenvector of class text；Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word Sequence signature vector；

Classification results obtain module 13, for the Text eigenvector of the text to be sorted to be inputted preset text point In class model, the classification results of the text to be sorted are obtained；Wherein, the textual classification model is according to preset samples of text And text categories study corresponding with samples of text.

Preferably, the Text eigenvector extraction module 12 specifically includes:

Phonetic feature vector extraction unit, for the spelling according to each text in the pretreated text to be sorted Sound obtains pinyin sequence, and extracts the phonetic feature vector according to the pinyin sequence；

Word sequence signature vector extraction unit, for being carried out to each character in the pretreated text to be sorted Serializing processing obtains word sequence, and the word sequence signature vector according to the word sequential extraction procedures；

Word sequence characteristic vector pickup unit is used for the pretreated preset participle of text input to be sorted Device obtains word sequence, and extracts the word sequence feature vector according to the word sequence；

Integrated unit, for by word sequence feature described in the phonetic feature vector, the word sequence signature vector sum to Amount carries out Fusion Features according to preset amalgamation mode, obtains the Text eigenvector of the text to be sorted.

Preferably, the classification results obtain module 13 and specifically include:

Computing unit, for the Text eigenvector of the text to be sorted to be inputted in preset textual classification model, The Text eigenvector for calculating the text to be sorted is respectively the probability value of preset each text categories；

Classification results determination unit, for obtaining the most probable value in the probability value, and by the text to be sorted Classification results be determined as text categories corresponding with the most probable value in preset each text categories.

Preferably, when pre-processing to the text to be sorted, the preprocessing module 11 is specifically included:

Stop words detection unit, for according to preset deactivated vocabulary, detecting, which whether there is in the text to be sorted, to stop Word；

Stop words removal unit, if for there are stop words in the text to be sorted, it will be in the text to be sorted Stop words removal；

When being pre-processed to the text to be sorted, the preprocessing module 11 further include:

Format detecting unit detects whether that there are full-shape formats for traversing all characters of the text to be sorted Character；

Converting unit, if for there are the characters of full-shape format in the text to be sorted, by the text to be sorted In the character of full-shape format be converted to the character of half width form.

Document sorting apparatus provided in an embodiment of the present invention obtains text to be sorted by preprocessing module 11, and treats Classifying text is pre-processed；For pretreated text to be sorted, by Text eigenvector extraction module 12 extract to The Text eigenvector of classifying text；Wherein, Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order Column feature vector；Module 13 is obtained by classification results, and the Text eigenvector of text to be sorted is inputted into preset text classification In model, the classification results of text to be sorted are obtained；Wherein, textual classification model is according to preset samples of text and and text The corresponding text categories study of sample.

Document sorting apparatus provided in an embodiment of the present invention includes phonetic feature in the text to be sorted that can be extracted to The Text eigenvector of amount, word sequence signature vector sum word sequence feature vector, increases language of the phonetic granularity to text of text Justice carries out comprehensive descision, and avoiding the text of user's input, there are the Semantic judgements that text is influenced when wrong word, to avoid pair Text carries out wrong classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.

It should be noted that the document sorting apparatus provided in an embodiment of the present invention is for executing described in above-described embodiment The step of file classification method, the working principle and beneficial effect of the two correspond, thus repeat no more.

It will be understood by those skilled in the art that the schematic diagram of the document sorting apparatus is only showing for document sorting apparatus Example, does not constitute the restriction to document sorting apparatus, may include than illustrating more or fewer components, or the certain portions of combination Part or different components, for example, the document sorting apparatus can also include input-output equipment, it is network access equipment, total Line etc..

The embodiment of the invention also provides a kind of electronic equipment, referring to Fig. 3, Fig. 3 is electricity provided in an embodiment of the present invention The structural schematic diagram of sub- equipment, specifically, the electronic equipment includes processor 10, memory 20 and is stored in the storage In device and it is configured as the computer program executed by the processor, the processor is realized when executing the computer program Any file classification method as provided above.

Specifically, the processor, memory in the electronic equipment may each be one or more, electronic equipment be can be Computer, mobile phone, plate, server, cloud device etc..

The electronic equipment of the present embodiment include: processor, memory and storage in the memory and can be described The computer program run on processor.The processor realizes text provided by the above embodiment when executing the computer program Step in this classification method, such as step S1 shown in FIG. 1, acquisition text to be sorted, and the text to be sorted is carried out Pretreatment.Alternatively, the processor realizes the function of each module in above-mentioned each Installation practice when executing the computer program, Such as preprocessing module 11, it is pre-processed for obtaining text to be sorted, and to the text to be sorted.

Illustratively, the computer program can be divided into one or more module/unit (meters as shown in Figure 3 Calculation machine program 1, computer program 2......), one or more of module/units are stored in the memory, and It is executed by the processor, to complete the present invention.One or more of module/units, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing execution of the computer program in the electronic equipment Journey.It is obtained for example, the computer program can be divided into preprocessing module 11, Text eigenvector module 12 and classification results Modulus block 13, each module concrete function are as follows:

The processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of electronic equipment.The memory can mainly include storing program area and storage data area, wherein storing program area It can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function etc.； Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, storage Device may include high-speed random access memory, can also be hard including nonvolatile memory, such as hard disk, memory, plug-in type Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.

Wherein, if module/unit that the electronic equipment integrates is realized in the form of SFU software functional unit and as only Vertical product when selling or using, can store in a computer readable storage medium.Based on this understanding, this hair The bright all or part of the process realized in file classification method provided by the above embodiment, can also be referred to by computer program Relevant hardware is enabled to complete, the computer program can be stored in computer readable storage medium, the computer program When being executed by processor, it can be achieved that any of the above-described embodiment provide file classification method the step of.Wherein, the computer Program includes computer program code, and the computer program code can be source code form, object identification code form, can be performed File or certain intermediate forms etc..The computer-readable medium may include: that can carry the computer program code Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications letter Number and software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be managed according to the administration of justice Local legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent Practice, computer-readable medium does not include electric carrier signal and telecommunication signal.

It should be noted that above-mentioned electronic equipment may include, but it is not limited only to, processor, memory, those skilled in the art Member is appreciated that Fig. 3 structural schematic diagram is only the example of above-mentioned electronic equipment, does not constitute the restriction to electronic equipment, can To include perhaps combining certain components or different components than illustrating more or fewer components.

The embodiment of the invention also provides a kind of computer readable storage medium, the computer readable storage medium includes The computer program of storage, wherein control in computer program operation and set where the computer readable storage medium It is standby to execute any file classification method as provided above.

The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. a kind of file classification method, which is characterized in that the described method includes:

Text to be sorted is obtained, and the text to be sorted is pre-processed；

According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted；Wherein, the text Eigen vector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector；

The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is obtained Classification results；Wherein, the textual classification model is according to preset samples of text and text class corresponding with samples of text Do not learn.

2. file classification method as described in claim 1, which is characterized in that the Text eigenvector further includes character quantity Feature vector；Wherein, the character quantity feature vector include in the text to be sorted being used to indicate after the pre-treatment with it is each The element of the quantity of the corresponding character of the preset character class of kind.

3. file classification method as claimed in claim 2, which is characterized in that the character class include punctuation character classification, Additional character classification, simplified Chinese character classification and traditional Chinese character classification.

4. file classification method as described in claim 1, which is characterized in that described according to the pretreated text to be sorted This, the Text eigenvector for extracting the text to be sorted specifically includes:

According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to the phonetic Phonetic feature vector described in sequential extraction procedures；

Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and according to institute State word sequence signature vector described in word sequential extraction procedures；

By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and is mentioned according to the word sequence Take the word sequence feature vector；

By word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum according to preset fusion side Formula carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.

5. file classification method as claimed in claim 4, which is characterized in that the segmenter is stammerer segmenter；The spelling Word sequence feature vector described in sound feature vector, the word sequence signature vector sum passes through TF-IDF technology and extracts.

6. file classification method as described in claim 1, which is characterized in that the text feature by the text to be sorted Vector inputs in preset textual classification model, and the classification results for obtaining the text to be sorted specifically include:

The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is calculated Text eigenvector be respectively preset each text categories probability value；

The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset each Text categories corresponding with the most probable value in text categories.

7. file classification method as described in claim 1, which is characterized in that described to be pre-processed to the text to be sorted It specifically includes:

It is described that the text to be sorted is pre-processed further include:

If there are the characters of full-shape format in the text to be sorted, by the character of the full-shape format in the text to be sorted It is converted to the character of half width form.

8. a kind of document sorting apparatus, which is characterized in that described device includes:

Text eigenvector extraction module, for extracting the text to be sorted according to the pretreated text to be sorted Text eigenvector；Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence spy Levy vector；

Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset textual classification model In, obtain the classification results of the text to be sorted；Wherein, the textual classification model according to preset samples of text and with The corresponding text categories study of samples of text.

9. a kind of electronic equipment, which is characterized in that including processor, memory and store in the memory and be configured For the computer program executed by the processor, the processor realizes such as claim 1 when executing the computer program To file classification method described in any one of 7.

10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium in computer program operation is executed as weighed Benefit require any one of 1 to 7 described in file classification method.