CN110321557A - A kind of file classification method, device, electronic equipment and storage medium - Google Patents
A kind of file classification method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110321557A CN110321557A CN201910519424.1A CN201910519424A CN110321557A CN 110321557 A CN110321557 A CN 110321557A CN 201910519424 A CN201910519424 A CN 201910519424A CN 110321557 A CN110321557 A CN 110321557A
- Authority
- CN
- China
- Prior art keywords
- text
- sorted
- classification
- eigenvector
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000003860 storage Methods 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 105
- 238000013145 classification model Methods 0.000 claims abstract description 26
- 238000004590 computer program Methods 0.000 claims description 26
- 238000000605 extraction Methods 0.000 claims description 18
- 239000000284 extract Substances 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 5
- 238000002203 pretreatment Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 4
- 235000019580 granularity Nutrition 0.000 description 17
- 238000004458 analytical method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- GOLXNESZZPUPJE-UHFFFAOYSA-N spiromesifen Chemical compound CC1=CC(C)=CC(C)=C1C(C(O1)=O)=C(OC(=O)CC(C)(C)C)C11CCCC1 GOLXNESZZPUPJE-UHFFFAOYSA-N 0.000 description 3
- 208000003351 Melanosis Diseases 0.000 description 2
- 238000005267 amalgamation Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of file classification method, device, electronic equipment and storage mediums, the method comprise the steps that obtaining text to be sorted, and pre-process to the text to be sorted;According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model learns according to preset samples of text and text categories corresponding with samples of text.The present invention can be improved the accuracy of text classification.
Description
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of file classification method, device, electronic equipments
And storage medium.
Background technique
With the continuous development of science and technology, the mankind have come into the artificial intelligence epoch, it is often necessary to intellectual product into
Row interaction, makes intellectual product user provide service.And text interaction be still user instantly interacted with intellectual product it is important
One of means, the text that intellectual product provides identification user's input such as identify to text, classify at the processing.
In certain application scenarios, intellectual product is needed to classify text, is provided according to classification results for user
Service, such as the speech for forbidding user to input abusing property, need to classify to text, judge whether to belong to abusing property text.
However current Text Classification is substantially classified according to word, the word in text, once the text of user's input exists
The misspelling of words is easy for causing text classification mistake, and text classification accuracy is not high.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of file classification method, device, electronic equipment
And storage medium, it can be improved the accuracy of text classification.
In a first aspect, the embodiment of the invention provides a kind of file classification methods, which comprises
Text to be sorted is obtained, and the text to be sorted is pre-processed;
According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, institute
Stating Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is obtained described to be sorted
The classification results of text;Wherein, the textual classification model is according to preset samples of text and text corresponding with samples of text
This Category Learning.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to
Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment
Element.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous
Body Chinese character classification.
Preferably, described according to the pretreated text to be sorted, extract the text feature of the text to be sorted
Vector specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to described
Pinyin sequence extracts the phonetic feature vector;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and root
According to word sequence signature vector described in the word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and according to the word order
Column extract the word sequence feature vector;
Word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum is melted according to preset
Conjunction mode carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute
Predicate sequence signature vector passes through TF-IDF technology and extracts.
Preferably, the Text eigenvector by the text to be sorted inputs in preset textual classification model, obtains
The classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is calculated described to be sorted
The Text eigenvector of text is respectively the probability value of preset each text categories;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset
Text categories corresponding with the most probable value in each text categories.
Preferably, it is described to the text to be sorted carry out pretreatment specifically include:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the full-shape format in the text to be sorted
Character is converted to the character of half width form.
Second aspect, the embodiment of the invention also provides a kind of document sorting apparatus, described device includes:
Preprocessing module is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module, for extracting described to be sorted according to the pretreated text to be sorted
The Text eigenvector of text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order
Column feature vector;
Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset text classification
In model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model according to preset samples of text with
And text categories study corresponding with samples of text.
The third aspect the embodiment of the invention also provides a kind of electronic equipment, including processor, memory and is stored in
In the memory and it is configured as the computer program executed by the processor, the processor executes the computer journey
The file classification method as described in first aspect any one is realized when sequence.
Fourth aspect, it is described computer-readable to deposit the embodiment of the invention also provides a kind of computer readable storage medium
Storage media includes the computer program of storage, wherein controls the computer-readable storage in computer program operation
Equipment executes the file classification method as described in first aspect any one where medium.
A kind of file classification method, device, electronic equipment and the storage medium of above-mentioned offer, the text to be sorted of extraction
Include phonetic feature vector, word sequence signature vector sum word sequence feature vector in Text eigenvector, increases the phonetic of text
Granularity carries out comprehensive descision to the semantic of text, and avoiding the text that user inputs, there are carry out mistake point to text when wrong word
Class improves the accuracy of text classification.
Detailed description of the invention
Fig. 1 is the flow diagram of a preferred embodiment of file classification method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram of a preferred embodiment of document sorting apparatus provided in an embodiment of the present invention;
Fig. 3 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
File classification method provided in an embodiment of the present invention is suitable for classifying to text according to scheduled category attribute
Scene in, such as forbid businessman to peddle violated product in electric business platform, or forbid publication with the text of sensitive vocabulary,
It needs to classify to text;For another example forbid user in Homepage Publishing abusing property text, safeguard good network environment, it is also desirable to
Classify to text.In embodiments of the present invention with by text according to sensitive word text and non-sensitive both texts of word text
Implementation process of the invention is illustrated for classification, but does not limit specific text categories of the invention.
The embodiment of the invention provides a kind of file classification methods, referring to Fig. 1, Fig. 1 is provided in an embodiment of the present invention
The flow diagram of one preferred embodiment of file classification method;Specifically, the described method includes:
S1, text to be sorted is obtained, and the text to be sorted is pre-processed;
S2, according to the pretreated text to be sorted, extract the Text eigenvector of the text to be sorted;Its
In, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
S3, the Text eigenvector of the text to be sorted is inputted in preset textual classification model, obtain it is described to
The classification results of classifying text;Wherein, the textual classification model is according to preset samples of text and corresponding with samples of text
Text categories study.
Specifically, the mode that user inputs text to be sorted can be handwriting input device or typewriting input equipment etc.,
Wrong word may can be inputted unintentionally or deliberately input phonogram to avoid sensitive word, such as obtain national authentication nti-freckle
The beauty product of effect could write the effect of " nti-freckle " exactly in products propaganda, and some businessmans may deliberately input be by it
" qu classes ", " removing & spot " or other explicit consumer can be allowed to know that its semantic vocabulary carrys out the sensitive word detection of avoidance system.With with
Family inputs for " removing & spot ", after inputting text to be sorted, obtains the text to be sorted of user's input, treats classifying text progress
Pretreatment, to reduce the noise of text to be sorted, refines the key message of text, such as pretreated text to be sorted is
" despeckle ";According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted, Text eigenvector includes
Phonetic feature vector, word sequence signature vector sum word sequence feature vector input the Text eigenvector of text to be sorted pre-
If textual classification model in, the treatment processes such as calculating, analysis by textual classification model " are gone " not only for text granularity
The semantic analysis of " spot " and vocabulary granularity " going " and " spot " divides text also directed to the phonetic granularity of text to be sorted
" despeckle " can be carried out semantic analysis according to phonetic granularity by analysis, and the classification results for obtaining text to be sorted are sensitive word text, from
And corresponding measure can be executed according to actual needs, such as the text with sensitive word that upload user can not be submitted to input,
Or need user's raising national authentication proof that can just issue the publicity text.
When the text of user's input is there are when wrong word, although the vocabulary grain size analysis of text to be sorted can be impacted, no
" despeckle " can be gone to analyze as vocabulary according to text, in vocabulary granularity also can only according to text by text to be sorted according to vocabulary
" going " and vocabulary " spot " are analyzed, and are carried out text classification according to the scheme of the prior art, are had disengaged from the primitive of text to be sorted
Justice be easy to cause text classification mistake, but a kind of file classification method provided in an embodiment of the present invention can also be according to phonetic grain
Degree analyzes " despeckle ", obtains the classification results of text.
A kind of file classification method provided in an embodiment of the present invention wraps in the Text eigenvector of the text to be sorted of extraction
Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, semanteme of the phonetic granularity to text of text is increased
Comprehensive descision is carried out, avoiding the text that user inputs, there are the Semantic judgements that text is influenced when wrong word, to avoid to text
This progress mistake classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
It should be noted that in order to increase the granularity of text analyzing, Text eigenvector can also include stroke feature to
The feature vectors such as amount, phrase feature vector, synonymous word feature vector, treat the analysis that classifying text carries out more granularities, comprehensive
The text categories for judging text to be sorted obtain more accurate classification results.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to
Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment
Element.
Specifically, analysis text to be sorted can be further increased when Text eigenvector includes character quantity feature vector
This granularity, the classification of character and the quantity of each character class have certain difference in the text of different text categories, such as
" this product is dispelled * * * spot effect!", increase character quantity feature vector granularity can analyze out " this product dispel * * * spot effect
Fruit!" this text be sensitive word text probability it is larger.
A kind of file classification method provided in an embodiment of the present invention, Text eigenvector further include character quantity feature to
Amount, can be further improved the accuracy of text classification.
Optionally, character quantity feature vector further includes the element for being used to indicate the total quantity of character in text to be sorted.
Optionally, preset character class can be arranged according to the actual application, such as punctuation character classification, special symbol
Number classification, simplified Chinese character classification, traditional Chinese character classification, emoticon classification, English symbol classification etc. may include
The character class of any one or any combination.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous
Body Chinese character classification.
Preferably, described according to the pretreated text to be sorted, extract the text feature of the text to be sorted
Vector specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to described
Pinyin sequence extracts the phonetic feature vector;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and root
According to word sequence signature vector described in the word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and according to the word order
Column extract the word sequence feature vector;
Word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum is melted according to preset
Conjunction mode carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
Wherein, amalgamation mode refers to that the feature vector by each granularity is fused to the mode of Text eigenvector, such as according to
The feature vector of each granularity of preset order transverse direction arrangement, obtains whole Text eigenvector;Or by the feature vector of each granularity
Text eigenvector is used as after addition.Segmenter refers to the text analyzing of input into the tool for meeting certain logic of language, example
Such as cook's segmenter, stammerer segmenter, easily segmenter, as long as can be suitably used for technical solution of the present invention.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute
Predicate sequence signature vector passes through TF-IDF technology and extracts.
Specifically, more rapidly, accurately the vocabulary of text can be divided using stammerer segmenter, using TF-IDF technology
Key message in text can be extracted, realize phonetic feature vector, word sequence signature vector sum word sequence feature vector
Extraction.
It should be noted that if Text eigenvector further includes other feature vectors, as long as being suitable for TF-IDF technology
The feature of extraction can be extracted using TF-IDF technology.
Preferably, the Text eigenvector by the text to be sorted inputs in preset textual classification model, obtains
The classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, is calculated described to be sorted
The Text eigenvector of text is respectively the probability value of preset each text categories;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset
Text categories corresponding with the most probable value in each text categories.
Specifically, file classification method provided in an embodiment of the present invention, determines text to be sorted based on maximization
Classification results input preset textual classification model, can be exported as a result, the output result includes text difference to be sorted
There are [sensitive word text categories, non-sensitive word for the probability value of each text categories, such as one one-dimensional vector of output, text categories
Text categories], output result is [90%, 10%], then obtains the most probable value 90% in probability value, and by most probable value
Classification results of the sensitive word text categories as text to be sorted corresponding to 90%.
Preferably, it is described to the text to be sorted carry out pretreatment specifically include:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the full-shape format in the text to be sorted
Character is converted to the character of half width form.
Wherein, stop words refers to the words of not physical meaning, usually function word, such as conjunction, auxiliary words of mood, Jie
Word etc., deactivating in vocabulary (or being stop words dictionary) includes several stop words.
Specifically, treating the pretreatment that classifying text is removed stop words, text length can be reduced, reduce text point
The noise of analysis is conducive to the feature for accurately refining text, improves the accuracy of text classification;Treat the full-shape format of classifying text
Character is converted into half width form character, can reduce the hardware resource requirements of storage text.It should be noted that by text to be sorted
This full-shape layout character is converted into half width form character, be to the characters such as English alphabet, number key, the symbolic key that can be converted into
Row conversion, and for Chinese character, although double byte character, can not be converted.
When it is implemented, obtaining text to be sorted, and treat classifying text and located in advance after user inputs text to be sorted
Reason;According to pretreated text to be sorted, the Text eigenvector of text to be sorted is extracted;Wherein, Text eigenvector packet
Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, it is comprehensive to be carried out by the feature of more granularities to text
Close analysis;The Text eigenvector of text to be sorted is inputted in preset textual classification model, point of text to be sorted is obtained
Class result;Wherein, textual classification model learns according to preset samples of text and text categories corresponding with samples of text.
A kind of file classification method provided in an embodiment of the present invention wraps in the Text eigenvector of the text to be sorted of extraction
Phonetic feature vector, word sequence signature vector sum word sequence feature vector are included, semanteme of the phonetic granularity to text of text is increased
Comprehensive descision is carried out, avoiding the text that user inputs, there are the Semantic judgements that text is influenced when wrong word, to avoid to text
This progress mistake classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
The embodiment of the invention also provides a kind of document sorting apparatus, referring to Fig. 2, Fig. 2 is that the embodiment of the present invention provides
Document sorting apparatus a preferred embodiment structural schematic diagram;Specifically, described device includes:
Preprocessing module 11 is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module 12, for extracting described wait divide according to the pretreated text to be sorted
The Text eigenvector of class text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word
Sequence signature vector;
Classification results obtain module 13, for the Text eigenvector of the text to be sorted to be inputted preset text point
In class model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model is according to preset samples of text
And text categories study corresponding with samples of text.
Preferably, the Text eigenvector further includes character quantity feature vector;Wherein, the character quantity feature to
Amount includes being used to indicate the quantity of character corresponding with each preset character class in text to be sorted after the pre-treatment
Element.
Preferably, the character class includes punctuation character classification, additional character classification, simplified Chinese character classification and numerous
Body Chinese character classification.
Preferably, the Text eigenvector extraction module 12 specifically includes:
Phonetic feature vector extraction unit, for the spelling according to each text in the pretreated text to be sorted
Sound obtains pinyin sequence, and extracts the phonetic feature vector according to the pinyin sequence;
Word sequence signature vector extraction unit, for being carried out to each character in the pretreated text to be sorted
Serializing processing obtains word sequence, and the word sequence signature vector according to the word sequential extraction procedures;
Word sequence characteristic vector pickup unit is used for the pretreated preset participle of text input to be sorted
Device obtains word sequence, and extracts the word sequence feature vector according to the word sequence;
Integrated unit, for by word sequence feature described in the phonetic feature vector, the word sequence signature vector sum to
Amount carries out Fusion Features according to preset amalgamation mode, obtains the Text eigenvector of the text to be sorted.
Preferably, the segmenter is stammerer segmenter;The phonetic feature vector, word sequence signature vector sum institute
Predicate sequence signature vector passes through TF-IDF technology and extracts.
Preferably, the classification results obtain module 13 and specifically include:
Computing unit, for the Text eigenvector of the text to be sorted to be inputted in preset textual classification model,
The Text eigenvector for calculating the text to be sorted is respectively the probability value of preset each text categories;
Classification results determination unit, for obtaining the most probable value in the probability value, and by the text to be sorted
Classification results be determined as text categories corresponding with the most probable value in preset each text categories.
Preferably, when pre-processing to the text to be sorted, the preprocessing module 11 is specifically included:
Stop words detection unit, for according to preset deactivated vocabulary, detecting, which whether there is in the text to be sorted, to stop
Word;
Stop words removal unit, if for there are stop words in the text to be sorted, it will be in the text to be sorted
Stop words removal;
When being pre-processed to the text to be sorted, the preprocessing module 11 further include:
Format detecting unit detects whether that there are full-shape formats for traversing all characters of the text to be sorted
Character;
Converting unit, if for there are the characters of full-shape format in the text to be sorted, by the text to be sorted
In the character of full-shape format be converted to the character of half width form.
Document sorting apparatus provided in an embodiment of the present invention obtains text to be sorted by preprocessing module 11, and treats
Classifying text is pre-processed;For pretreated text to be sorted, by Text eigenvector extraction module 12 extract to
The Text eigenvector of classifying text;Wherein, Text eigenvector includes phonetic feature vector, word sequence signature vector sum word order
Column feature vector;Module 13 is obtained by classification results, and the Text eigenvector of text to be sorted is inputted into preset text classification
In model, the classification results of text to be sorted are obtained;Wherein, textual classification model is according to preset samples of text and and text
The corresponding text categories study of sample.
Document sorting apparatus provided in an embodiment of the present invention includes phonetic feature in the text to be sorted that can be extracted to
The Text eigenvector of amount, word sequence signature vector sum word sequence feature vector, increases language of the phonetic granularity to text of text
Justice carries out comprehensive descision, and avoiding the text of user's input, there are the Semantic judgements that text is influenced when wrong word, to avoid pair
Text carries out wrong classification, realizes more grain size characteristics to analyze text, improves the accuracy of text classification.
It should be noted that the document sorting apparatus provided in an embodiment of the present invention is for executing described in above-described embodiment
The step of file classification method, the working principle and beneficial effect of the two correspond, thus repeat no more.
It will be understood by those skilled in the art that the schematic diagram of the document sorting apparatus is only showing for document sorting apparatus
Example, does not constitute the restriction to document sorting apparatus, may include than illustrating more or fewer components, or the certain portions of combination
Part or different components, for example, the document sorting apparatus can also include input-output equipment, it is network access equipment, total
Line etc..
The embodiment of the invention also provides a kind of electronic equipment, referring to Fig. 3, Fig. 3 is electricity provided in an embodiment of the present invention
The structural schematic diagram of sub- equipment, specifically, the electronic equipment includes processor 10, memory 20 and is stored in the storage
In device and it is configured as the computer program executed by the processor, the processor is realized when executing the computer program
Any file classification method as provided above.
Specifically, the processor, memory in the electronic equipment may each be one or more, electronic equipment be can be
Computer, mobile phone, plate, server, cloud device etc..
The electronic equipment of the present embodiment include: processor, memory and storage in the memory and can be described
The computer program run on processor.The processor realizes text provided by the above embodiment when executing the computer program
Step in this classification method, such as step S1 shown in FIG. 1, acquisition text to be sorted, and the text to be sorted is carried out
Pretreatment.Alternatively, the processor realizes the function of each module in above-mentioned each Installation practice when executing the computer program,
Such as preprocessing module 11, it is pre-processed for obtaining text to be sorted, and to the text to be sorted.
Illustratively, the computer program can be divided into one or more module/unit (meters as shown in Figure 3
Calculation machine program 1, computer program 2......), one or more of module/units are stored in the memory, and
It is executed by the processor, to complete the present invention.One or more of module/units, which can be, can complete specific function
Series of computation machine program instruction section, the instruction segment is for describing execution of the computer program in the electronic equipment
Journey.It is obtained for example, the computer program can be divided into preprocessing module 11, Text eigenvector module 12 and classification results
Modulus block 13, each module concrete function are as follows:
Preprocessing module 11 is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module 12, for extracting described wait divide according to the pretreated text to be sorted
The Text eigenvector of class text;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word
Sequence signature vector;
Classification results obtain module 13, for the Text eigenvector of the text to be sorted to be inputted preset text point
In class model, the classification results of the text to be sorted are obtained;Wherein, the textual classification model is according to preset samples of text
And text categories study corresponding with samples of text.
The processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng the processor is the control centre of the electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection
A part.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes
Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization
The various functions of electronic equipment.The memory can mainly include storing program area and storage data area, wherein storing program area
It can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function etc.;
Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, storage
Device may include high-speed random access memory, can also be hard including nonvolatile memory, such as hard disk, memory, plug-in type
Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
Wherein, if module/unit that the electronic equipment integrates is realized in the form of SFU software functional unit and as only
Vertical product when selling or using, can store in a computer readable storage medium.Based on this understanding, this hair
The bright all or part of the process realized in file classification method provided by the above embodiment, can also be referred to by computer program
Relevant hardware is enabled to complete, the computer program can be stored in computer readable storage medium, the computer program
When being executed by processor, it can be achieved that any of the above-described embodiment provide file classification method the step of.Wherein, the computer
Program includes computer program code, and the computer program code can be source code form, object identification code form, can be performed
File or certain intermediate forms etc..The computer-readable medium may include: that can carry the computer program code
Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications letter
Number and software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be managed according to the administration of justice
Local legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent
Practice, computer-readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that above-mentioned electronic equipment may include, but it is not limited only to, processor, memory, those skilled in the art
Member is appreciated that Fig. 3 structural schematic diagram is only the example of above-mentioned electronic equipment, does not constitute the restriction to electronic equipment, can
To include perhaps combining certain components or different components than illustrating more or fewer components.
The embodiment of the invention also provides a kind of computer readable storage medium, the computer readable storage medium includes
The computer program of storage, wherein control in computer program operation and set where the computer readable storage medium
It is standby to execute any file classification method as provided above.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (10)
1. a kind of file classification method, which is characterized in that the described method includes:
Text to be sorted is obtained, and the text to be sorted is pre-processed;
According to the pretreated text to be sorted, the Text eigenvector of the text to be sorted is extracted;Wherein, the text
Eigen vector includes phonetic feature vector, word sequence signature vector sum word sequence feature vector;
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is obtained
Classification results;Wherein, the textual classification model is according to preset samples of text and text class corresponding with samples of text
Do not learn.
2. file classification method as described in claim 1, which is characterized in that the Text eigenvector further includes character quantity
Feature vector;Wherein, the character quantity feature vector include in the text to be sorted being used to indicate after the pre-treatment with it is each
The element of the quantity of the corresponding character of the preset character class of kind.
3. file classification method as claimed in claim 2, which is characterized in that the character class include punctuation character classification,
Additional character classification, simplified Chinese character classification and traditional Chinese character classification.
4. file classification method as described in claim 1, which is characterized in that described according to the pretreated text to be sorted
This, the Text eigenvector for extracting the text to be sorted specifically includes:
According to the phonetic of each text in the pretreated text to be sorted, pinyin sequence is obtained, and according to the phonetic
Phonetic feature vector described in sequential extraction procedures;
Serializing processing is carried out to each character in the pretreated text to be sorted, obtains word sequence, and according to institute
State word sequence signature vector described in word sequential extraction procedures;
By the pretreated preset segmenter of text input to be sorted, word sequence is obtained, and is mentioned according to the word sequence
Take the word sequence feature vector;
By word sequence feature vector described in the phonetic feature vector, the word sequence signature vector sum according to preset fusion side
Formula carries out Fusion Features, obtains the Text eigenvector of the text to be sorted.
5. file classification method as claimed in claim 4, which is characterized in that the segmenter is stammerer segmenter;The spelling
Word sequence feature vector described in sound feature vector, the word sequence signature vector sum passes through TF-IDF technology and extracts.
6. file classification method as described in claim 1, which is characterized in that the text feature by the text to be sorted
Vector inputs in preset textual classification model, and the classification results for obtaining the text to be sorted specifically include:
The Text eigenvector of the text to be sorted is inputted in preset textual classification model, the text to be sorted is calculated
Text eigenvector be respectively preset each text categories probability value;
The most probable value in the probability value is obtained, and the classification results of the text to be sorted are determined as preset each
Text categories corresponding with the most probable value in text categories.
7. file classification method as described in claim 1, which is characterized in that described to be pre-processed to the text to be sorted
It specifically includes:
According to preset deactivated vocabulary, detects and whether there is stop words in the text to be sorted;
If there are stop words in the text to be sorted, the stop words in the text to be sorted is removed;
It is described that the text to be sorted is pre-processed further include:
All characters for traversing the text to be sorted detect whether that there are the characters of full-shape format;
If there are the characters of full-shape format in the text to be sorted, by the character of the full-shape format in the text to be sorted
It is converted to the character of half width form.
8. a kind of document sorting apparatus, which is characterized in that described device includes:
Preprocessing module is pre-processed for obtaining text to be sorted, and to the text to be sorted;
Text eigenvector extraction module, for extracting the text to be sorted according to the pretreated text to be sorted
Text eigenvector;Wherein, the Text eigenvector includes phonetic feature vector, word sequence signature vector sum word sequence spy
Levy vector;
Classification results obtain module, for the Text eigenvector of the text to be sorted to be inputted preset textual classification model
In, obtain the classification results of the text to be sorted;Wherein, the textual classification model according to preset samples of text and with
The corresponding text categories study of samples of text.
9. a kind of electronic equipment, which is characterized in that including processor, memory and store in the memory and be configured
For the computer program executed by the processor, the processor realizes such as claim 1 when executing the computer program
To file classification method described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage
Machine program, wherein equipment where controlling the computer readable storage medium in computer program operation is executed as weighed
Benefit require any one of 1 to 7 described in file classification method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910519424.1A CN110321557A (en) | 2019-06-14 | 2019-06-14 | A kind of file classification method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910519424.1A CN110321557A (en) | 2019-06-14 | 2019-06-14 | A kind of file classification method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110321557A true CN110321557A (en) | 2019-10-11 |
Family
ID=68119671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910519424.1A Pending CN110321557A (en) | 2019-06-14 | 2019-06-14 | A kind of file classification method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321557A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112084337A (en) * | 2020-09-17 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Training method of text classification model, and text classification method and equipment |
CN112151008A (en) * | 2020-09-22 | 2020-12-29 | 中用科技有限公司 | Voice synthesis method and system and computer equipment |
CN114443840A (en) * | 2021-12-27 | 2022-05-06 | 天翼云科技有限公司 | Text classification method, device and equipment |
CN115146619A (en) * | 2022-05-12 | 2022-10-04 | 恒安嘉新(北京)科技股份公司 | Abnormal short message detection method and device, computer equipment and storage medium |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013097327A1 (en) * | 2011-12-29 | 2013-07-04 | 盈世信息科技(北京)有限公司 | Spam filtering method |
CN103605694A (en) * | 2013-11-04 | 2014-02-26 | 北京奇虎科技有限公司 | Device and method for detecting similar texts |
US20160283582A1 (en) * | 2013-11-04 | 2016-09-29 | Beijing Qihoo Technology Company Limited | Device and method for detecting similar text, and application |
CN107066560A (en) * | 2017-03-30 | 2017-08-18 | 东软集团股份有限公司 | The method and apparatus of text classification |
CN107180022A (en) * | 2016-03-09 | 2017-09-19 | 阿里巴巴集团控股有限公司 | object classification method and device |
CN107885853A (en) * | 2017-11-14 | 2018-04-06 | 同济大学 | A kind of combined type file classification method based on deep learning |
CN108304468A (en) * | 2017-12-27 | 2018-07-20 | 中国银联股份有限公司 | A kind of file classification method and document sorting apparatus |
CN109271627A (en) * | 2018-09-03 | 2019-01-25 | 深圳市腾讯网络信息技术有限公司 | Text analyzing method, apparatus, computer equipment and storage medium |
US20190095432A1 (en) * | 2017-09-26 | 2019-03-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for building text classification model, and text classification method and apparatus |
CN109684476A (en) * | 2018-12-07 | 2019-04-26 | 中科恒运股份有限公司 | A kind of file classification method, document sorting apparatus and terminal device |
CN109710919A (en) * | 2018-11-27 | 2019-05-03 | 杭州电子科技大学 | A kind of neural network event extraction method merging attention mechanism |
CN109726285A (en) * | 2018-12-18 | 2019-05-07 | 广州多益网络股份有限公司 | A kind of file classification method, device, storage medium and terminal device |
CN109739986A (en) * | 2018-12-28 | 2019-05-10 | 合肥工业大学 | A kind of complaint short text classification method based on Deep integrating study |
CN109783794A (en) * | 2017-11-14 | 2019-05-21 | 北大方正集团有限公司 | File classification method and device |
CN109858034A (en) * | 2019-02-25 | 2019-06-07 | 武汉大学 | A kind of text sentiment classification method based on attention model and sentiment dictionary |
CN109858039A (en) * | 2019-03-01 | 2019-06-07 | 北京奇艺世纪科技有限公司 | A kind of text information identification method and identification device |
-
2019
- 2019-06-14 CN CN201910519424.1A patent/CN110321557A/en active Pending
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013097327A1 (en) * | 2011-12-29 | 2013-07-04 | 盈世信息科技(北京)有限公司 | Spam filtering method |
CN103605694A (en) * | 2013-11-04 | 2014-02-26 | 北京奇虎科技有限公司 | Device and method for detecting similar texts |
US20160283582A1 (en) * | 2013-11-04 | 2016-09-29 | Beijing Qihoo Technology Company Limited | Device and method for detecting similar text, and application |
CN107180022A (en) * | 2016-03-09 | 2017-09-19 | 阿里巴巴集团控股有限公司 | object classification method and device |
CN107066560A (en) * | 2017-03-30 | 2017-08-18 | 东软集团股份有限公司 | The method and apparatus of text classification |
US20190095432A1 (en) * | 2017-09-26 | 2019-03-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for building text classification model, and text classification method and apparatus |
CN107885853A (en) * | 2017-11-14 | 2018-04-06 | 同济大学 | A kind of combined type file classification method based on deep learning |
CN109783794A (en) * | 2017-11-14 | 2019-05-21 | 北大方正集团有限公司 | File classification method and device |
CN108304468A (en) * | 2017-12-27 | 2018-07-20 | 中国银联股份有限公司 | A kind of file classification method and document sorting apparatus |
CN109271627A (en) * | 2018-09-03 | 2019-01-25 | 深圳市腾讯网络信息技术有限公司 | Text analyzing method, apparatus, computer equipment and storage medium |
CN109710919A (en) * | 2018-11-27 | 2019-05-03 | 杭州电子科技大学 | A kind of neural network event extraction method merging attention mechanism |
CN109684476A (en) * | 2018-12-07 | 2019-04-26 | 中科恒运股份有限公司 | A kind of file classification method, document sorting apparatus and terminal device |
CN109726285A (en) * | 2018-12-18 | 2019-05-07 | 广州多益网络股份有限公司 | A kind of file classification method, device, storage medium and terminal device |
CN109739986A (en) * | 2018-12-28 | 2019-05-10 | 合肥工业大学 | A kind of complaint short text classification method based on Deep integrating study |
CN109858034A (en) * | 2019-02-25 | 2019-06-07 | 武汉大学 | A kind of text sentiment classification method based on attention model and sentiment dictionary |
CN109858039A (en) * | 2019-03-01 | 2019-06-07 | 北京奇艺世纪科技有限公司 | A kind of text information identification method and identification device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112084337A (en) * | 2020-09-17 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Training method of text classification model, and text classification method and equipment |
CN112084337B (en) * | 2020-09-17 | 2024-02-09 | 腾讯科技(深圳)有限公司 | Training method of text classification model, text classification method and equipment |
CN112151008A (en) * | 2020-09-22 | 2020-12-29 | 中用科技有限公司 | Voice synthesis method and system and computer equipment |
CN114443840A (en) * | 2021-12-27 | 2022-05-06 | 天翼云科技有限公司 | Text classification method, device and equipment |
CN115146619A (en) * | 2022-05-12 | 2022-10-04 | 恒安嘉新(北京)科技股份公司 | Abnormal short message detection method and device, computer equipment and storage medium |
CN115146619B (en) * | 2022-05-12 | 2024-10-01 | 恒安嘉新(北京)科技股份公司 | Abnormal short message detection method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110444198B (en) | Retrieval method, retrieval device, computer equipment and storage medium | |
CN110321557A (en) | A kind of file classification method, device, electronic equipment and storage medium | |
CN109685056B (en) | Method and device for acquiring document information | |
CN112784578B (en) | Legal element extraction method and device and electronic equipment | |
CN107491435B (en) | Method and device for automatically identifying user emotion based on computer | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
Ketcham et al. | Segmentation of overlapping Isan Dhamma character on palm leaf manuscript’s with neural network | |
CN110610003B (en) | Method and system for assisting text annotation | |
US20230073602A1 (en) | System of and method for automatically detecting sarcasm of a batch of text | |
CN113722492A (en) | Intention identification method and device | |
Sheshikala et al. | Natural language processing and machine learning classifier used for detecting the author of the sentence | |
CN115544240B (en) | Text sensitive information identification method and device, electronic equipment and storage medium | |
US11295175B1 (en) | Automatic document separation | |
CN109189965A (en) | Pictograph search method and system | |
CN112464927B (en) | Information extraction method, device and system | |
CN112052424B (en) | Content auditing method and device | |
ALBayari et al. | Cyberbullying classification methods for Arabic: A systematic review | |
CN113761377A (en) | Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium | |
CN115840808A (en) | Scientific and technological project consultation method, device, server and computer-readable storage medium | |
CN112528653A (en) | Short text entity identification method and system | |
CN114218945A (en) | Entity identification method, device, server and storage medium | |
CN109101487A (en) | Conversational character differentiating method, device, terminal device and storage medium | |
CN116842515A (en) | Source code classification model robustness enhancement method, system and processor | |
CN117278675A (en) | Outbound method, device, equipment and medium based on intention classification | |
Yasin et al. | Transformer-Based Neural Machine Translation for Post-OCR Error Correction in Cursive Text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191011 |