CN110321557A - A kind of file classification method, device, electronic equipment and storage medium - Google Patents

A kind of file classification method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110321557A
CN110321557A CN201910519424.1A CN201910519424A CN110321557A CN 110321557 A CN110321557 A CN 110321557A CN 201910519424 A CN201910519424 A CN 201910519424A CN 110321557 A CN110321557 A CN 110321557A
Authority
CN
China
Prior art keywords
text
sorted
classification
eigenvector
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910519424.1A
Other languages
Chinese (zh)
Inventor
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Multi Benefit Network Co Ltd
Guangzhou Duoyi Network Co Ltd
Original Assignee
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Multi Benefit Network Co Ltd
Guangzhou Duoyi Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD, Multi Benefit Network Co Ltd, Guangzhou Duoyi Network Co Ltd filed Critical GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Priority to CN201910519424.1A priority Critical patent/CN110321557A/en
Publication of CN110321557A publication Critical patent/CN110321557A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of file classification method, device, electronic equipment and storage mediums, the method comprise the steps that obtaining text to be sorted, and pre-process to the text to be sorted;According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model learns according to preset samples of text and text categories corresponding with samples of text.The present invention can be improved the accuracy of text classification.

Description

A kind of file classification method, device, electronic equipment and storage medium
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of file classification method, device, electronic equipments And storage medium.
Background technique
With the continuous development of science and technology, the mankind have come into the artificial intelligence epoch, it is often necessary to intellectual product into Row interaction, makes intellectual product user provide service.And text interaction be still user instantly interacted with intellectual product it is important One of means, the text that intellectual product provides identification user's input such as identify to text, classify at the processing.
In certain application scenarios, intellectual product is needed to classify text, is provided according to classification results for user Service, such as the speech for forbidding user to input abusing property, need to classify to text, judge whether to belong to abusing property text. However current Text Classification is substantially classified according to word, the word in text, once the text of user's input exists The misspelling of words is easy for causing text classification mistake, and text classification accuracy is not high.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of file classification method, device, electronic equipment And storage medium, it can be improved the accuracy of text classification.
In a first aspect, the embodiment of the invention provides a kind of file classification methods, which comprises
Text to be sorted is obtained, and the text to be sorted is pre-processed;
According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, institute Stating Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is obtained described to be sorted The classification results of text;Wherein, the textual classification model is according to preset samples of text and text corresponding with samples of text This Category Learning.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment Element.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous Body Chinese character classification.
Preferably, described according to the pretreated text to be sorted, extract the text feature of the text to be sorted Vector specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to described Pinyin sequence extracts the phonetic feature vector;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and root According to word sequence signature vector described in the word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and according to the word order Column extract the word sequence feature vector;
Word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum is melted according to preset Conjunction mode carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute Predicate sequence signature vector passes through TF-IDF technology and extracts.
Preferably, the Text eigenvector by the text to be sorted inputs in preset textual classification model, obtains The classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is calculated described to be sorted The Text eigenvector of text is respectively the probability value of preset each text categories;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset Text categories corresponding with the most probable value in each text categories.
Preferably, it is described to the text to be sorted carry out pretreatment specifically include:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the full-shape format in the text to be sorted Character is converted to the character of half width form.
Second aspect, the embodiment of the invention also provides a kind of document sorting apparatus, described device includes:
Preprocessing module is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module, for extracting described to be sorted according to the pretreated text to be sorted The Text eigenvector of text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order Column feature vector;
Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset text classification In model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model according to preset samples of text with And text categories study corresponding with samples of text.
The third aspect the embodiment of the invention also provides a kind of electronic equipment, including processor, memory and is stored in In the memory and it is configured as the computer program executed by the processor, the processor executes the computer journey The file classification method as described in first aspect any one is realized when sequence.
Fourth aspect, it is described computer-readable to deposit the embodiment of the invention also provides a kind of computer readable storage medium Storage media includes the computer program of storage, wherein controls the computer-readable storage in computer program operation Equipment executes the file classification method as described in first aspect any one where medium.
A kind of file classification method, device, electronic equipment and the storage medium of above-mentioned offer, the text to be sorted of extraction Include phonetic feature vector, word sequence signature vector sum word sequence feature vector in Text eigenvector, increases the phonetic of text Granularity carries out comprehensive descision to the semantic of text, and avoiding the text that user inputs, there are carry out mistake point to text when wrong word Class improves the accuracy of text classification.
Detailed description of the invention
Fig. 1 is the flow diagram of a preferred embodiment of file classification method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram of a preferred embodiment of document sorting apparatus provided in an embodiment of the present invention;
Fig. 3 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
File classification method provided in an embodiment of the present invention is suitable for classifying to text according to scheduled category attribute Scene in, such as forbid businessman to peddle violated product in electric business platform, or forbid publication with the text of sensitive vocabulary, It needs to classify to text;For another example forbid user in Homepage Publishing abusing property text, safeguard good network environment, it is also desirable to Classify to text.In embodiments of the present invention with by text according to sensitive word text and non-sensitive both texts of word text Implementation process of the invention is illustrated for classification, but does not limit specific text categories of the invention.
The embodiment of the invention provides a kind of file classification methods, referring to Fig. 1, Fig. 1 is provided in an embodiment of the present invention The flow diagram of one preferred embodiment of file classification method;Specifically, the described method includes:
S1, text to be sorted is obtained, and the text to be sorted is pre-processed;
S2, according to the pretreated text to be sorted, extract the Text eigenvector of the text to be sorted;Its In, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
S3, the Text eigenvector of the text to be sorted is inputted in preset textual classification model, obtain it is described to The classification results of classifying text;Wherein, the textual classification model is according to preset samples of text and corresponding with samples of text Text categories study.
Specifically, the mode that user inputs text to be sorted can be handwriting input device or typewriting input equipment etc., Wrong word may can be inputted unintentionally or deliberately input phonogram to avoid sensitive word, such as obtain national authentication nti-freckle The beauty product of effect could write the effect of " nti-freckle " exactly in products propaganda, and some businessmans may deliberately input be by it " qu classes ", " removing & spot " or other explicit consumer can be allowed to know that its semantic vocabulary carrys out the sensitive word detection of avoidance system.With with Family inputs for " removing & spot ", after inputting text to be sorted, obtains the text to be sorted of user's input, treats classifying text progress Pretreatment, to reduce the noise of text to be sorted, refines the key message of text, such as pretreated text to be sorted is " despeckle ";According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted, Text eigenvector includes Phonetic feature vector, word sequence signature vector sum word sequence feature vector input the Text eigenvector of text to be sorted pre- If textual classification model in, the treatment processes such as calculating, analysis by textual classification model " are gone " not only for text granularity The semantic analysis of " spot " and vocabulary granularity " going " and " spot " divides text also directed to the phonetic granularity of text to be sorted " despeckle " can be carried out semantic analysis according to phonetic granularity by analysis, and the classification results for obtaining text to be sorted are sensitive word text, from And corresponding measure can be executed according to actual needs, such as the text with sensitive word that upload user can not be submitted to input, Or need user's raising national authentication proof that can just issue the publicity text.
When the text of user's input is there are when wrong word, although the vocabulary grain size analysis of text to be sorted can be impacted, no " despeckle " can be gone to analyze as vocabulary according to text, in vocabulary granularity also can only according to text by text to be sorted according to vocabulary " going " and vocabulary " spot " are analyzed, and are carried out text classification according to the scheme of the prior art, are had disengaged from the primitive of text to be sorted Justice be easy to cause text classification mistake, but a kind of file classification method provided in an embodiment of the present invention can also be according to phonetic grain Degree analyzes " despeckle ", obtains the classification results of text.
A kind of file classification method provided in an embodiment of the present invention wraps in the Text eigenvector of the text to be sorted of extraction Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, semanteme of the phonetic granularity to text of text is increased Comprehensive descision is carried out, avoiding the text that user inputs, there are the Semantic judgements that text is influenced when wrong word, to avoid to text This progress mistake classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
It should be noted that in order to increase the granularity of text analyzing, Text eigenvector can also include stroke feature to The feature vectors such as amount, phrase feature vector, synonymous word feature vector, treat the analysis that classifying text carries out more granularities, comprehensive The text categories for judging text to be sorted obtain more accurate classification results.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment Element.
Specifically, analysis text to be sorted can be further increased when Text eigenvector includes character quantity feature vector This granularity, the classification of character and the quantity of each character class have certain difference in the text of different text categories, such as " this product is dispelled * * * spot effect!", increase character quantity feature vector granularity can analyze out " this product dispel * * * spot effect Fruit!" this text be sensitive word text probability it is larger.
A kind of file classification method provided in an embodiment of the present invention, Text eigenvector further include character quantity feature to Amount, can be further improved the accuracy of text classification.
Optionally, character quantity feature vector further includes the element for being used to indicate the total quantity of character in text to be sorted.
Optionally, preset character class can be arranged according to the actual application, such as punctuation character classification, special symbol Number classification, simplified Chinese character classification, traditional Chinese character classification, emoticon classification, English symbol classification etc. may include The character class of any one or any combination.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous Body Chinese character classification.
Preferably, described according to the pretreated text to be sorted, extract the text feature of the text to be sorted Vector specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to described Pinyin sequence extracts the phonetic feature vector;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and root According to word sequence signature vector described in the word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and according to the word order Column extract the word sequence feature vector;
Word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum is melted according to preset Conjunction mode carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
Wherein, amalgamation mode refers to that the feature vector by each granularity is fused to the mode of Text eigenvector, such as according to The feature vector of each granularity of preset order transverse direction arrangement, obtains whole Text eigenvector;Or by the feature vector of each granularity Text eigenvector is used as after addition.Segmenter refers to the text analyzing of input into the tool for meeting certain logic of language, example Such as cook's segmenter, stammerer segmenter, easily segmenter, as long as can be suitably used for technical solution of the present invention.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute Predicate sequence signature vector passes through TF-IDF technology and extracts.
Specifically, more rapidly, accurately the vocabulary of text can be divided using stammerer segmenter, using TF-IDF technology Key message in text can be extracted, realize phonetic feature vector, word sequence signature vector sum word sequence feature vector Extraction.
It should be noted that if Text eigenvector further includes other feature vectors, as long as being suitable for TF-IDF technology The feature of extraction can be extracted using TF-IDF technology.
Preferably, the Text eigenvector by the text to be sorted inputs in preset textual classification model, obtains The classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is calculated described to be sorted The Text eigenvector of text is respectively the probability value of preset each text categories;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset Text categories corresponding with the most probable value in each text categories.
Specifically, file classification method provided in an embodiment of the present invention, determines text to be sorted based on maximization Classification results input preset textual classification model, can be exported as a result, the output result includes text difference to be sorted There are [sensitive word text categories, non-sensitive word for the probability value of each text categories, such as one one-dimensional vector of output, text categories Text categories], output result is [90%, 10%], then obtains the most probable value 90% in probability value, and by most probable value Classification results of the sensitive word text categories as text to be sorted corresponding to 90%.
Preferably, it is described to the text to be sorted carry out pretreatment specifically include:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the full-shape format in the text to be sorted Character is converted to the character of half width form.
Wherein, stop words refers to the words of not physical meaning, usually function word, such as conjunction, auxiliary words of mood, Jie Word etc., deactivating in vocabulary (or being stop words dictionary) includes several stop words.
Specifically, treating the pretreatment that classifying text is removed stop words, text length can be reduced, reduce text point The noise of analysis is conducive to the feature for accurately refining text, improves the accuracy of text classification;Treat the full-shape format of classifying text Character is converted into half width form character, can reduce the hardware resource requirements of storage text.It should be noted that by text to be sorted This full-shape layout character is converted into half width form character, be to the characters such as English alphabet, number key, the symbolic key that can be converted into Row conversion, and for Chinese character, although double byte character, can not be converted.
When it is implemented, obtaining text to be sorted, and treat classifying text and located in advance after user inputs text to be sorted Reason;According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted;Wherein, Text eigenvector packet Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, it is comprehensive to be carried out by the feature of more granularities to text Close analysis;The Text eigenvector of text to be sorted is inputted in preset textual classification model, point of text to be sorted is obtained Class result;Wherein, textual classification model learns according to preset samples of text and text categories corresponding with samples of text.
A kind of file classification method provided in an embodiment of the present invention wraps in the Text eigenvector of the text to be sorted of extraction Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, semanteme of the phonetic granularity to text of text is increased Comprehensive descision is carried out, avoiding the text that user inputs, there are the Semantic judgements that text is influenced when wrong word, to avoid to text This progress mistake classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
The embodiment of the invention also provides a kind of document sorting apparatus, referring to Fig. 2, Fig. 2 is that the embodiment of the present invention provides Document sorting apparatus a preferred embodiment structural schematic diagram;Specifically, described device includes:
Preprocessing module 11 is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module 12, for extracting described wait divide according to the pretreated text to be sorted The Text eigenvector of class text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word Sequence signature vector;
Classification results obtain module 13, for the Text eigenvector of the text to be sorted to be inputted preset text point In class model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model is according to preset samples of text And text categories study corresponding with samples of text.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment Element.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous Body Chinese character classification.
Preferably, the Text eigenvector extraction module 12 specifically includes:
Phonetic feature vector extraction unit, for the spelling according to each text in the pretreated text to be sorted Sound obtains pinyin sequence, and extracts the phonetic feature vector according to the pinyin sequence;
Word sequence signature vector extraction unit, for being carried out to each character in the pretreated text to be sorted Serializing processing obtains word sequence, and the word sequence signature vector according to the word sequential extraction procedures;
Word sequence characteristic vector pickup unit is used for the pretreated preset participle of text input to be sorted Device obtains word sequence, and extracts the word sequence feature vector according to the word sequence;
Integrated unit, for by word sequence feature described in the phonetic feature vector, the word sequence signature vector sum to Amount carries out Fusion Features according to preset amalgamation mode, obtains the Text eigenvector of the text to be sorted.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute Predicate sequence signature vector passes through TF-IDF technology and extracts.
Preferably, the classification results obtain module 13 and specifically include:
Computing unit, for the Text eigenvector of the text to be sorted to be inputted in preset textual classification model, The Text eigenvector for calculating the text to be sorted is respectively the probability value of preset each text categories;
Classification results determination unit, for obtaining the most probable value in the probability value, and by the text to be sorted Classification results be determined as text categories corresponding with the most probable value in preset each text categories.
Preferably, when pre-processing to the text to be sorted, the preprocessing module 11 is specifically included:
Stop words detection unit, for according to preset deactivated vocabulary, detecting, which whether there is in the text to be sorted, to stop Word;
Stop words removal unit, if for there are stop words in the text to be sorted, it will be in the text to be sorted Stop words removal;
When being pre-processed to the text to be sorted, the preprocessing module 11 further include:
Format detecting unit detects whether that there are full-shape formats for traversing all characters of the text to be sorted Character;
Converting unit, if for there are the characters of full-shape format in the text to be sorted, by the text to be sorted In the character of full-shape format be converted to the character of half width form.
Document sorting apparatus provided in an embodiment of the present invention obtains text to be sorted by preprocessing module 11, and treats Classifying text is pre-processed;For pretreated text to be sorted, by Text eigenvector extraction module 12 extract to The Text eigenvector of classifying text;Wherein, Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order Column feature vector;Module 13 is obtained by classification results, and the Text eigenvector of text to be sorted is inputted into preset text classification In model, the classification results of text to be sorted are obtained;Wherein, textual classification model is according to preset samples of text and and text The corresponding text categories study of sample.
Document sorting apparatus provided in an embodiment of the present invention includes phonetic feature in the text to be sorted that can be extracted to The Text eigenvector of amount, word sequence signature vector sum word sequence feature vector, increases language of the phonetic granularity to text of text Justice carries out comprehensive descision, and avoiding the text of user's input, there are the Semantic judgements that text is influenced when wrong word, to avoid pair Text carries out wrong classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
It should be noted that the document sorting apparatus provided in an embodiment of the present invention is for executing described in above-described embodiment The step of file classification method, the working principle and beneficial effect of the two correspond, thus repeat no more.
It will be understood by those skilled in the art that the schematic diagram of the document sorting apparatus is only showing for document sorting apparatus Example, does not constitute the restriction to document sorting apparatus, may include than illustrating more or fewer components, or the certain portions of combination Part or different components, for example, the document sorting apparatus can also include input-output equipment, it is network access equipment, total Line etc..
The embodiment of the invention also provides a kind of electronic equipment, referring to Fig. 3, Fig. 3 is electricity provided in an embodiment of the present invention The structural schematic diagram of sub- equipment, specifically, the electronic equipment includes processor 10, memory 20 and is stored in the storage In device and it is configured as the computer program executed by the processor, the processor is realized when executing the computer program Any file classification method as provided above.
Specifically, the processor, memory in the electronic equipment may each be one or more, electronic equipment be can be Computer, mobile phone, plate, server, cloud device etc..
The electronic equipment of the present embodiment include: processor, memory and storage in the memory and can be described The computer program run on processor.The processor realizes text provided by the above embodiment when executing the computer program Step in this classification method, such as step S1 shown in FIG. 1, acquisition text to be sorted, and the text to be sorted is carried out Pretreatment.Alternatively, the processor realizes the function of each module in above-mentioned each Installation practice when executing the computer program, Such as preprocessing module 11, it is pre-processed for obtaining text to be sorted, and to the text to be sorted.
Illustratively, the computer program can be divided into one or more module/unit (meters as shown in Figure 3 Calculation machine program 1, computer program 2......), one or more of module/units are stored in the memory, and It is executed by the processor, to complete the present invention.One or more of module/units, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing execution of the computer program in the electronic equipment Journey.It is obtained for example, the computer program can be divided into preprocessing module 11, Text eigenvector module 12 and classification results Modulus block 13, each module concrete function are as follows:
Preprocessing module 11 is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module 12, for extracting described wait divide according to the pretreated text to be sorted The Text eigenvector of class text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word Sequence signature vector;
Classification results obtain module 13, for the Text eigenvector of the text to be sorted to be inputted preset text point In class model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model is according to preset samples of text And text categories study corresponding with samples of text.
The processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of electronic equipment.The memory can mainly include storing program area and storage data area, wherein storing program area It can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function etc.; Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, storage Device may include high-speed random access memory, can also be hard including nonvolatile memory, such as hard disk, memory, plug-in type Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
Wherein, if module/unit that the electronic equipment integrates is realized in the form of SFU software functional unit and as only Vertical product when selling or using, can store in a computer readable storage medium.Based on this understanding, this hair The bright all or part of the process realized in file classification method provided by the above embodiment, can also be referred to by computer program Relevant hardware is enabled to complete, the computer program can be stored in computer readable storage medium, the computer program When being executed by processor, it can be achieved that any of the above-described embodiment provide file classification method the step of.Wherein, the computer Program includes computer program code, and the computer program code can be source code form, object identification code form, can be performed File or certain intermediate forms etc..The computer-readable medium may include: that can carry the computer program code Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications letter Number and software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be managed according to the administration of justice Local legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent Practice, computer-readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that above-mentioned electronic equipment may include, but it is not limited only to, processor, memory, those skilled in the art Member is appreciated that Fig. 3 structural schematic diagram is only the example of above-mentioned electronic equipment, does not constitute the restriction to electronic equipment, can To include perhaps combining certain components or different components than illustrating more or fewer components.
The embodiment of the invention also provides a kind of computer readable storage medium, the computer readable storage medium includes The computer program of storage, wherein control in computer program operation and set where the computer readable storage medium It is standby to execute any file classification method as provided above.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (10)

1. a kind of file classification method, which is characterized in that the described method includes:
Text to be sorted is obtained, and the text to be sorted is pre-processed;
According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, the text Eigen vector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is obtained Classification results;Wherein, the textual classification model is according to preset samples of text and text class corresponding with samples of text Do not learn.
2. file classification method as described in claim 1, which is characterized in that the Text eigenvector further includes character quantity Feature vector;Wherein, the character quantity feature vector include in the text to be sorted being used to indicate after the pre-treatment with it is each The element of the quantity of the corresponding character of the preset character class of kind.
3. file classification method as claimed in claim 2, which is characterized in that the character class include punctuation character classification, Additional character classification, simplified Chinese character classification and traditional Chinese character classification.
4. file classification method as described in claim 1, which is characterized in that described according to the pretreated text to be sorted This, the Text eigenvector for extracting the text to be sorted specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to the phonetic Phonetic feature vector described in sequential extraction procedures;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and according to institute State word sequence signature vector described in word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and is mentioned according to the word sequence Take the word sequence feature vector;
By word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum according to preset fusion side Formula carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
5. file classification method as claimed in claim 4, which is characterized in that the segmenter is stammerer segmenter;The spelling Word sequence feature vector described in sound feature vector, the word sequence signature vector sum passes through TF-IDF technology and extracts.
6. file classification method as described in claim 1, which is characterized in that the text feature by the text to be sorted Vector inputs in preset textual classification model, and the classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is calculated Text eigenvector be respectively preset each text categories probability value;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset each Text categories corresponding with the most probable value in text categories.
7. file classification method as described in claim 1, which is characterized in that described to be pre-processed to the text to be sorted It specifically includes:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the character of the full-shape format in the text to be sorted It is converted to the character of half width form.
8. a kind of document sorting apparatus, which is characterized in that described device includes:
Preprocessing module is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module, for extracting the text to be sorted according to the pretreated text to be sorted Text eigenvector;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence spy Levy vector;
Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset textual classification model In, obtain the classification results of the text to be sorted;Wherein, the textual classification model according to preset samples of text and with The corresponding text categories study of samples of text.
9. a kind of electronic equipment, which is characterized in that including processor, memory and store in the memory and be configured For the computer program executed by the processor, the processor realizes such as claim 1 when executing the computer program To file classification method described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium in computer program operation is executed as weighed Benefit require any one of 1 to 7 described in file classification method.
CN201910519424.1A 2019-06-14 2019-06-14 A kind of file classification method, device, electronic equipment and storage medium Pending CN110321557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910519424.1A CN110321557A (en) 2019-06-14 2019-06-14 A kind of file classification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910519424.1A CN110321557A (en) 2019-06-14 2019-06-14 A kind of file classification method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110321557A true CN110321557A (en) 2019-10-11

Family

ID=68119671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910519424.1A Pending CN110321557A (en) 2019-06-14 2019-06-14 A kind of file classification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110321557A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084337A (en) * 2020-09-17 2020-12-15 腾讯科技(深圳)有限公司 Training method of text classification model, and text classification method and equipment
CN112151008A (en) * 2020-09-22 2020-12-29 中用科技有限公司 Voice synthesis method and system and computer equipment
CN114443840A (en) * 2021-12-27 2022-05-06 天翼云科技有限公司 Text classification method, device and equipment
CN115146619A (en) * 2022-05-12 2022-10-04 恒安嘉新(北京)科技股份公司 Abnormal short message detection method and device, computer equipment and storage medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013097327A1 (en) * 2011-12-29 2013-07-04 盈世信息科技(北京)有限公司 Spam filtering method
CN103605694A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for detecting similar texts
US20160283582A1 (en) * 2013-11-04 2016-09-29 Beijing Qihoo Technology Company Limited Device and method for detecting similar text, and application
CN107066560A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of text classification
CN107180022A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 object classification method and device
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning
CN108304468A (en) * 2017-12-27 2018-07-20 中国银联股份有限公司 A kind of file classification method and document sorting apparatus
CN109271627A (en) * 2018-09-03 2019-01-25 深圳市腾讯网络信息技术有限公司 Text analyzing method, apparatus, computer equipment and storage medium
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device
CN109710919A (en) * 2018-11-27 2019-05-03 杭州电子科技大学 A kind of neural network event extraction method merging attention mechanism
CN109726285A (en) * 2018-12-18 2019-05-07 广州多益网络股份有限公司 A kind of file classification method, device, storage medium and terminal device
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN109783794A (en) * 2017-11-14 2019-05-21 北大方正集团有限公司 File classification method and device
CN109858034A (en) * 2019-02-25 2019-06-07 武汉大学 A kind of text sentiment classification method based on attention model and sentiment dictionary
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013097327A1 (en) * 2011-12-29 2013-07-04 盈世信息科技(北京)有限公司 Spam filtering method
CN103605694A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for detecting similar texts
US20160283582A1 (en) * 2013-11-04 2016-09-29 Beijing Qihoo Technology Company Limited Device and method for detecting similar text, and application
CN107180022A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 object classification method and device
CN107066560A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of text classification
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning
CN109783794A (en) * 2017-11-14 2019-05-21 北大方正集团有限公司 File classification method and device
CN108304468A (en) * 2017-12-27 2018-07-20 中国银联股份有限公司 A kind of file classification method and document sorting apparatus
CN109271627A (en) * 2018-09-03 2019-01-25 深圳市腾讯网络信息技术有限公司 Text analyzing method, apparatus, computer equipment and storage medium
CN109710919A (en) * 2018-11-27 2019-05-03 杭州电子科技大学 A kind of neural network event extraction method merging attention mechanism
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device
CN109726285A (en) * 2018-12-18 2019-05-07 广州多益网络股份有限公司 A kind of file classification method, device, storage medium and terminal device
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN109858034A (en) * 2019-02-25 2019-06-07 武汉大学 A kind of text sentiment classification method based on attention model and sentiment dictionary
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084337A (en) * 2020-09-17 2020-12-15 腾讯科技(深圳)有限公司 Training method of text classification model, and text classification method and equipment
CN112084337B (en) * 2020-09-17 2024-02-09 腾讯科技(深圳)有限公司 Training method of text classification model, text classification method and equipment
CN112151008A (en) * 2020-09-22 2020-12-29 中用科技有限公司 Voice synthesis method and system and computer equipment
CN114443840A (en) * 2021-12-27 2022-05-06 天翼云科技有限公司 Text classification method, device and equipment
CN115146619A (en) * 2022-05-12 2022-10-04 恒安嘉新(北京)科技股份公司 Abnormal short message detection method and device, computer equipment and storage medium
CN115146619B (en) * 2022-05-12 2024-10-01 恒安嘉新(北京)科技股份公司 Abnormal short message detection method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110444198B (en) Retrieval method, retrieval device, computer equipment and storage medium
CN110321557A (en) A kind of file classification method, device, electronic equipment and storage medium
CN109685056B (en) Method and device for acquiring document information
CN112784578B (en) Legal element extraction method and device and electronic equipment
CN107491435B (en) Method and device for automatically identifying user emotion based on computer
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
Ketcham et al. Segmentation of overlapping Isan Dhamma character on palm leaf manuscript’s with neural network
CN110610003B (en) Method and system for assisting text annotation
US20230073602A1 (en) System of and method for automatically detecting sarcasm of a batch of text
CN113722492A (en) Intention identification method and device
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
US11295175B1 (en) Automatic document separation
CN109189965A (en) Pictograph search method and system
CN112464927B (en) Information extraction method, device and system
CN112052424B (en) Content auditing method and device
ALBayari et al. Cyberbullying classification methods for Arabic: A systematic review
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN112528653A (en) Short text entity identification method and system
CN114218945A (en) Entity identification method, device, server and storage medium
CN109101487A (en) Conversational character differentiating method, device, terminal device and storage medium
CN116842515A (en) Source code classification model robustness enhancement method, system and processor
CN117278675A (en) Outbound method, device, equipment and medium based on intention classification
Yasin et al. Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191011