CN107894980A - A kind of multiple statement is to corpus of text sorting technique and grader - Google Patents
A kind of multiple statement is to corpus of text sorting technique and grader Download PDFInfo
- Publication number
- CN107894980A CN107894980A CN201711276465.XA CN201711276465A CN107894980A CN 107894980 A CN107894980 A CN 107894980A CN 201711276465 A CN201711276465 A CN 201711276465A CN 107894980 A CN107894980 A CN 107894980A
- Authority
- CN
- China
- Prior art keywords
- languages
- word
- convolutional neural
- chinese
- english
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
Corpus of text sorting technique and grader, the sorting technique are included the invention discloses a kind of multiple statement:Data to be predicted are inputted, carry out languages separation;Word corresponding to each languages is upset at random;Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain prediction result;Prediction result corresponding to multiple languages is interacted into checking;Final judged result is exported according to validation-cross result.The sorting technique of the present invention can improve nicety of grading.
Description
Technical field
The present invention relates to areas of information technology, more particularly to a kind of multiple statement to corpus of text sorting technique and classification
Device.
Background technology
Text corpus be also using electronic computer as carrier carry linguistry basic resource, complete spoken language materials
Storehouse is used for language model structure, lexicography and text classification etc., text classification computer to text set (or other entities or
Object) according to certain taxonomic hierarchies or the automatic key words sorting of standard progress.
Text classification problem can be attributed to according to be sorted with other classification problems without difference substantially, its method
Some features of data are matched, and certainly complete matching is unlikely that, it is therefore necessary to(According to certain evaluation mark
It is accurate)Optimal matching result is selected, so as to complete to classify.
But current sorting technique is difficult to the purpose for reaching Accurate classification, there are on some existing platforms substantial amounts of
Bilingual sentence is right, and most sentence is not to being marked classification, even if remaining sentence pair has been marked classification, the sentence accurately marked is right
Also relatively little of part is only accounted for, however, corresponding, data retrieval, content distribution and route on platform etc. are dependent on
In accurate language material classification mark, in order to preferably play the effect of various language material platforms, there is provided it is a kind of practical and
The high sorting technique of nicety of grading is necessary.
It should be noted that the introduction to technical background above be intended merely to the convenient technical scheme to the application carry out it is clear,
Complete explanation, and facilitate the understanding of those skilled in the art and illustrate.Can not merely because these schemes the application's
Background section is set forth and thinks that above-mentioned technical proposal is known to those skilled in the art.
The content of the invention
In view of the drawbacks described above of prior art, the technical problems to be solved by the invention there is provided a kind of improve and classify
The multiple statement of precision is to corpus of text sorting technique and grader.
To achieve the above object, the invention provides a kind of multiple statement to corpus of text sorting technique, including:
Data to be predicted are inputted, carry out languages separation;
Word corresponding to each languages is upset at random;
Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain pre-
Survey result;
Prediction result corresponding to multiple languages is interacted into checking;
Final judged result is exported according to validation-cross result.
Further, after described the step of upsetting word corresponding to each languages at random, each languages are beaten at random
Word after unrest is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also wrap
Include:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times
To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment,
Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences
Truncation operation is carried out, this is likely to result in the loss of sentence information;Thus, reading the number upset again is at least twice, and this is default
Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, then
Truncation operation, and follow-up flow are carried out again, can avoid truncation operation that sentence key message is lost, meanwhile, again may be used
With by it is bad, classification it is indefinite, be unfavorable for training sentence to screening out;In addition, final judged result then be based on
Machine upsets what multiple text was predicted respectively, if repeatedly prediction result is identical, can consider the prediction result naturally
The degree of accuracy is higher;Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification class
Not or classification etc. is not known;Wherein, the preset times are as the case may be so that classification can be met by training the result come
The required precision of device.
Further, the preset times are 5.In the present embodiment, preset times can according to circumstances change, for example,
When data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.
Further, the languages include Chinese languages and English languages, and the convolutional neural networks model includes Chinese
Convolutional neural networks model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English languages,
Corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through two respectively
Person is predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.
Further, input data to be predicted, it is described that each languages are corresponding after the step of carrying out languages separation
Word the step of upsetting at random before also include:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together
When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party
In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text
Word, it can specifically be segmented by applicable segmenter etc.;Wherein, stop-word refer to those frequency of use it is too high, to language
The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation
The word of property, therefore before training filter out these words;Specifically, stop-word database can be put into aggregate type
It is interior, and stop-word filtering is carried out to text by applicable filter method.
Further, the convolutional neural networks model is to sample preprocessing, trains to obtain by convolutional neural networks
's.In the present embodiment, the convolutional neural networks model trains to obtain previously according to sample preprocessing by convolutional neural networks
, specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be preserved,
To be optimized to convolutional neural networks model;Specifically, input CNN's is that size can be 50 * unit dictionary chis
Very little bivector, because dictionary size is very big, the size of these bivectors is also very big and very sparse(Contain
Have substantial amounts of 0), CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which embeding layer arrives these vectorial dimensionality reductions
50*200 or other are suitably sized(Selected according to actual computing capability), then input the second layer and train more advanced spy
Sign;The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact;In addition, model is instructed
Practice after completing, we are further tested on new sample, it is found that the prediction result of many samples does not meet general knowledge and sentenced
It is disconnected, and these samples are actually not belonging to any one known classification, we reject these samples, then consider newly-increased
One is not known classification, and dedicated for receiving the sample that those are not belonging to any one known class, these samples are not due to belonging to
In any one classification, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should be
It is relatively small, by experiment, confirm above-mentioned it is assumed that and being determined the threshold value of a standard deviation(0.05), by some without classification
Sample, which has been assigned to, does not know classification, with improving classification results reliability.
Further, the training step of convolutional neural networks model includes:
Chinese and English lock out operation is carried out to urtext;
The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, then carries out Chinese stop-word filtration treatment;Together
When, to the isolated advanced row English space word segmentation processing of English word, then carry out English stop-word filtration treatment;
High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples
Model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term is disassembled
The word minimum into granularity, and generate dictionary(For example, utilize Gensim storehouses), then sample data projected on glossary,
One is by form corresponding to each sentence generation(, there is frequency in " word ")Two tuples composition vector.To vector
In two tuples second dimension summed, the term included in the hope of whole sentence(Vocabulary)Number, on this basis
On, then obtain term(Vocabulary)The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification;Specifically
, for example, have the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, can using these terminology datas
To select the of a relatively high sample of quality from original sample;For one section of language material(Assuming that it is noted as:Medicine), utilize doctor
Medicine terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this section of language
Expect the percentage of total word number, the two comprehensive indexs, we may determine that this section of language material is strictly the degree for belonging to " medicine ",
Rational threshold value is gone out by experiment test(Each classification is required for individually testing), so as to select the corpus data of high quality.
Further, it is necessary to first by the high quality samples before using convolutional neural networks training high quality samples
Vectorization processing forms length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, in forecast period
Carry out;The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector( CNN )
It is trained;Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for our and extracts
Feature, distinguish classification, it is therefore desirable to be combined into short sentence longer(40 words to 50 words)And the syntagma of similar length;So
Afterwards, the bag of words based on word frequency are utilized(Bag of Words)Model, syntagma is changed into term vector(Each of vector corresponding one
Individual word, this numeral are exactly the frequency that this word occurs in syntagma, it can be seen that and vector length is identical with dictionary size,
Further, since actually multiple probability very little occurs in a sentence in a word, so all frequency of acquiescence is all
1);Finally, before neutral net is inputted, a length normalization method is done(By mending 0 or truncating, generation length is exactly
50 vector).
Further, the formation of the glossary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.In the present embodiment, the shape of glossary
Into being advantageous to obtain high-quality text, so as to improve the quality of convolutional neural networks model that training obtains, and then improve classification
Precision.
The invention also discloses a kind of multiple statement to corpus of text grader, including:
Languages separation module, for the data to be predicted to input, carry out languages separating treatment;
Upset module at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module, the convolution of corresponding languages is loaded into for the word after each languages are upset at random
Neural network model, and be predicted to obtain prediction result;
Interactive verification module, for prediction result corresponding to multiple languages to be interacted into checking;
Output module, for exporting final judged result according to validation-cross result.
In the present invention, different languages training in advance convolutional neural networks models are corresponded to respectively, in multilingual sentence, bag
Contained it is multilingual between be mutually related priori, extra information will bring benefit for our classification so that classification essence
Du Genggao;In addition, separated before convolutional neural networks model is loaded into, it is necessary to first treat the languages in prediction data, point
The grader that languages train to obtain is not corresponded to, the data after separation are predicted respectively, can make full use of bilingual sentence pair
Characteristic, checking is interacted between multiple graders, improves the precision of classification;Furthermore due to needing first to new input classification
The sample of device carries out truncation operation, in order to avoid truncating operation sentence key message is lost, we are every time input by sentence
Before grader, all sentence word order is upset at random, then carries out classification judgement.
With reference to following explanation and accompanying drawing, the particular implementation of the application is disclose in detail, specifies the original of the application
Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in scope.In appended power
In the range of the spirit and terms that profit requires, presently filed embodiment includes many changes, modifications and is equal.
The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more
Used in individual other embodiment, it is combined with the feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when being used herein, but simultaneously
It is not excluded for the presence or additional of one or more further features, one integral piece, step or component.
Brief description of the drawings
Included accompanying drawing is used for providing being further understood from the embodiment of the present application, which constitutes one of specification
Point, for illustrating presently filed embodiment, and come together with word description to explain the principle of the application.Under it should be evident that
Accompanying drawing in the description of face is only some embodiments of the present application, and wound will be not being paid for those of ordinary skill in the art
On the premise of the property made is laborious, other accompanying drawings can also be obtained according to these accompanying drawings.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique;
Fig. 2 is the training flow chart of convolutional neural networks model of the present invention;
Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application
The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation
Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area
All other embodiment that technical staff is obtained under the premise of creative work is not made, it should all belong to the application protection
Scope.
Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique, is understood with reference to figure 1:The present invention carries
A kind of multiple statement has been supplied to corpus of text sorting technique, including:
S1:Data to be predicted are inputted, carry out languages separation;
S2:Word corresponding to each languages is upset at random;
S3:Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted
To prediction result;
S4:Prediction result corresponding to multiple languages is interacted into checking;
S5:Final judged result is exported according to validation-cross result.
In the present invention, different languages training in advance convolutional neural networks models are corresponded to respectively, in multilingual sentence, bag
Contained it is multilingual between be mutually related priori, extra information will bring benefit for our classification so that classification essence
Du Genggao;In addition, separated before convolutional neural networks model is loaded into, it is necessary to first treat the languages in prediction data, point
The grader that languages train to obtain is not corresponded to, the data after separation are predicted respectively, can make full use of bilingual sentence pair
Characteristic, checking is interacted between multiple graders, improves the precision of classification;Furthermore due to needing first to new input classification
The sample of device carries out truncation operation, in order to avoid truncating operation sentence key message is lost, we are every time input by sentence
Before grader, all sentence word order is upset at random, then carries out classification judgement.
The present embodiment is optional, after the step of word corresponding to each languages is upset at random, each languages are random
Word after upsetting is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also
Including:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times
To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment,
Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences
Truncation operation is carried out, this is likely to result in the loss of sentence information;Thus, reading the number upset again is at least twice, and this is default
Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, Ran Houzai
Truncation operation, and follow-up flow are carried out, can avoid truncating operation sentence key message is lost, meanwhile, again can be with
By bad, sentence that is indefinite, being unfavorable for training of classifying to screening out;In addition, final judged result is then based on random
Upset what multiple text was predicted respectively, if repeatedly prediction result is identical, naturally it is considered that the standard of the prediction result
Exactness is higher;Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification classification
Or classification etc. is not known;Wherein, the preset times are as the case may be so that grader can be met by training the result come
Required precision.
The present embodiment is optional, preset times 5.In the present embodiment, preset times can according to circumstances change, example
Such as, when data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.
The present embodiment is optional, and languages include Chinese languages and English languages, during the convolutional neural networks model includes
File accumulates neural network model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English language
Kind, corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through respectively
Both are predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.
The present embodiment is optional, inputs data to be predicted, described by each languages pair after the step of carrying out languages separation
Also include before the step of word answered is upset at random:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together
When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party
In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text
Word, it can specifically be segmented by applicable segmenter etc.;Wherein, stop-word refer to those frequency of use it is too high, to language
The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation
The word of property, therefore before training filter out these words.
The present embodiment is optional, and convolutional neural networks model is to sample preprocessing, is trained by convolutional neural networks
Arrive.In the present embodiment, the convolutional neural networks model is trained previously according to sample preprocessing by convolutional neural networks
Arrive, specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be protected
Deposit, to be optimized to convolutional neural networks model;Specifically, input CNN's is that size can be 50 * unit words
The bivector of allusion quotation size, because dictionary size is very big, the size of these bivectors is also very big and very sparse
(Contain substantial amounts of 0), CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which these vectors drop in embeding layer
Tie up 50*200 or other are suitably sized(Selected according to actual computing capability), then input the second layer training it is more advanced
Feature;The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact;In addition, mould
After type training is completed, we are further tested on new sample, it is found that the prediction result of many samples is not met often
Know and judge, and these samples are actually not belonging to any one known classification, we reject these samples, then consider
Newly-increased one is not known classification, dedicated for receiving the sample that those are not belonging to any one known class, these samples due to
Any one classification is not belonging to, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should
This is relatively small, and by experiment, discovery situation is really in this way, we have been successfully found the threshold value of a standard deviation
(0.05), some have been assigned to without classification samples and has not known classification, with improving classification results reliability.
Therefore the two parameters of our comprehensive references, it is final to determine to screen and retain those " term vocabulary numbers " to be more than or equal to
X and " the term vocabulary accounting " sentence more than y.Because different classes of term quality, category feature are different, so for
Each classification(x,y)Also it is different, it is necessary to according to original sample number, need the sample number selected to test to determine.X and y value
Can specifically it be set as the case may be.
Fig. 2 is the training flow chart of convolutional neural networks model of the present invention, with reference to figure 2, is understood with reference to Fig. 1, the present embodiment
Optionally, the training step of convolutional neural networks model includes:
S31:Chinese and English lock out operation is carried out to urtext;
S32:The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, is then carried out at Chinese stop-word filtering
Reason;Meanwhile to the isolated advanced row English space word segmentation processing of English word, then carry out at English stop-word filtering
Reason;
S33:High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
S34:Chinese convolutional neural networks model and English convolutional Neural are obtained using convolutional neural networks training high quality samples
Network model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term
The minimum word of granularity is disassembled into, and generates dictionary(For example, utilize Gensim storehouses), then sample data projected to glossary
On, one is by form corresponding to each sentence generation(, there is frequency in " word ")Two tuples composition vector;It is right
Second dimension of two tuples in vector is summed, the term included in the hope of whole sentence(Vocabulary)Number, in this base
On plinth, then obtain term(Vocabulary)The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification;
Specifically, for example, there is the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, these term numbers are utilized
According to the of a relatively high sample of quality can be selected from original sample.For one section of language material(Assuming that it is noted as:Medicine), profit
With medical terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this
The percentage of the section total word number of language material, the two comprehensive indexs, we may determine that this section of language material is strictly to belong to " medicine "
Degree, rational threshold value is gone out by experiment test(Each classification is required for individually testing), so as to select the language material of high quality
Data.
The present embodiment is optional, it is necessary to first will be described high-quality before using convolutional neural networks training high quality samples
Measure sample vector processing and form length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, predicted
Stage will also be carried out;The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector(
CNN )It is trained;Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for us
Extract feature, distinguish classification, it is therefore desirable to be combined into short sentence longer(40 words to 50 words)And the sentence of similar length
Section;Then, the bag of words based on word frequency are utilized(Bag of Words)Model, syntagma is changed into term vector(Each of vector
A corresponding word, this numeral is exactly the frequency that this word occurs in syntagma, it can be seen that vector length is big with dictionary
It is small identical, further, since actually there is multiple probability very little in a sentence in a word, so the frequency that acquiescence is all
All it is 1);Finally, before neutral net is inputted, a length normalization method is done(By mending 0 or truncating, generation length is proper
It is 50 vector well).
The present embodiment is optional, and the formation of glossary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.In the present embodiment, the shape of glossary
Into being advantageous to obtain high-quality text, so as to improve the quality of convolutional neural networks model that training obtains, and then improve classification
Precision.
Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader, with reference to figure 2, is understood with reference to Fig. 1:The present invention
A kind of multiple statement is also disclosed to corpus of text grader 100, including:
Languages separation module 10, for the data to be predicted to input, carry out languages separating treatment;
Upset module 20 at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module .30, corresponding languages are loaded into for the word after each languages are upset at random
Convolutional neural networks model, and be predicted to obtain prediction result;
Interactive verification module 40, for prediction result corresponding to multiple languages to be interacted into checking;
Output module 50, for exporting final judged result according to validation-cross result.
Grader in the present invention, can be applicable above-mentioned sorting technique, correspond to different languages training in advance respectively
Convolutional neural networks model, in multilingual sentence, contain it is multilingual between be mutually related priori, extra information will
Benefit is brought for our classification so that nicety of grading is higher;In addition, it is necessary to first before convolutional neural networks model is loaded into
The languages treated in prediction data are separated, and correspond to the grader that languages train to obtain respectively, and the data after separation are carried out
Predict respectively, the characteristic of bilingual sentence pair can be made full use of, checking is interacted between multiple graders, improve the essence of classification
Degree;Furthermore due to needing first to carry out truncation operation to the sample of new input grader, in order to avoid truncating operation sentence is closed
Key information is lost, and we before input by sentence grader, are upset sentence word order at random every time, then are carried out classification and sentenced
It is disconnected.
Preferred embodiment of the invention described in detail above.It should be appreciated that one of ordinary skill in the art without
Creative work can is needed to make many modifications and variations according to the design of the present invention.Therefore, all technologies in the art
Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea
Technical scheme, all should be in the protection domain being defined in the patent claims.
Claims (10)
1. a kind of multiple statement is to corpus of text sorting technique, it is characterised in that including:
Data to be predicted are inputted, carry out languages separation;
Word corresponding to each languages is upset at random;
Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain pre-
Survey result;
Prediction result corresponding to multiple languages is interacted into checking;
Final judged result is exported according to validation-cross result.
2. a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that described by each languages
After the step of corresponding word is upset at random, the word after each languages are upset at random is loaded into the convolutional Neural of corresponding languages
Network model, and the step of be predicted to obtain prediction result before also include:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times
To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.
3. described a kind of multiple statement as claimed in claim 2 is to corpus of text sorting technique, it is characterised in that described default
Number is 5.
4. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the languages
Including Chinese languages and English languages, the convolutional neural networks model includes Chinese convolutional neural networks model and English convolution
Neural network model.
5. described a kind of multiple statement as claimed in claim 4 is to corpus of text sorting technique, it is characterised in that the input
Data to be predicted, carry out languages separation the step of after, before described the step of upsetting word corresponding to each languages at random
Also include:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together
When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.
6. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the convolution
Neural network model is to sample preprocessing, trains what is obtained by convolutional neural networks.
7. described a kind of multiple statement as claimed in claim 6 is to corpus of text sorting technique, it is characterised in that convolutional Neural
The training step of network model includes:
Chinese and English lock out operation is carried out to urtext;
The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, then carries out Chinese stop-word filtration treatment;Together
When, to the isolated advanced row English space word segmentation processing of English word, then carry out English stop-word filtration treatment;
High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples
Model.
8. a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that is using convolutional Neural
, it is necessary to which the high quality samples vectorization processing first is formed into length identical vector before network training high quality samples.
9. described a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that the term
The formation of dictionary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.
10. a kind of multiple statement is to corpus of text grader, it is characterised in that including:
Languages separation module, for the data to be predicted to input, carry out languages separating treatment;
Upset module at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module, the convolution of corresponding languages is loaded into for the word after each languages are upset at random
Neural network model, and be predicted to obtain prediction result;
Interactive verification module, for prediction result corresponding to multiple languages to be interacted into checking;
Output module, for exporting final judged result according to validation-cross result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711276465.XA CN107894980A (en) | 2017-12-06 | 2017-12-06 | A kind of multiple statement is to corpus of text sorting technique and grader |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711276465.XA CN107894980A (en) | 2017-12-06 | 2017-12-06 | A kind of multiple statement is to corpus of text sorting technique and grader |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107894980A true CN107894980A (en) | 2018-04-10 |
Family
ID=61806110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711276465.XA Pending CN107894980A (en) | 2017-12-06 | 2017-12-06 | A kind of multiple statement is to corpus of text sorting technique and grader |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107894980A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189932A (en) * | 2018-09-06 | 2019-01-11 | 北京京东尚科信息技术有限公司 | File classification method and device, computer readable storage medium |
CN109885686A (en) * | 2019-02-20 | 2019-06-14 | 延边大学 | A kind of multilingual file classification method merging subject information and BiLSTM-CNN |
CN110032714A (en) * | 2019-02-25 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of corpus labeling feedback method and device |
CN112084334A (en) * | 2020-09-04 | 2020-12-15 | 中国平安财产保险股份有限公司 | Corpus label classification method and device, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110098999A1 (en) * | 2009-10-22 | 2011-04-28 | National Research Council Of Canada | Text categorization based on co-classification learning from multilingual corpora |
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
-
2017
- 2017-12-06 CN CN201711276465.XA patent/CN107894980A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110098999A1 (en) * | 2009-10-22 | 2011-04-28 | National Research Council Of Canada | Text categorization based on co-classification learning from multilingual corpora |
CN103488623A (en) * | 2013-09-04 | 2014-01-01 | 中国科学院计算技术研究所 | Multilingual text data sorting treatment method |
Non-Patent Citations (2)
Title |
---|
CSDN_CSDN__AI: "用深度学习(CNN RNN Attention)解决大规模文本分类问题-综述和实践", 《CSDN博客》 * |
刘里: "中文文本分类中特征描述及分类器构造方法研究", 《中国优秀博硕士学位论文全文数据库信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109189932A (en) * | 2018-09-06 | 2019-01-11 | 北京京东尚科信息技术有限公司 | File classification method and device, computer readable storage medium |
CN109189932B (en) * | 2018-09-06 | 2021-02-26 | 北京京东尚科信息技术有限公司 | Text classification method and device and computer-readable storage medium |
CN109885686A (en) * | 2019-02-20 | 2019-06-14 | 延边大学 | A kind of multilingual file classification method merging subject information and BiLSTM-CNN |
CN110032714A (en) * | 2019-02-25 | 2019-07-19 | 阿里巴巴集团控股有限公司 | A kind of corpus labeling feedback method and device |
CN112084334A (en) * | 2020-09-04 | 2020-12-15 | 中国平安财产保险股份有限公司 | Corpus label classification method and device, computer equipment and storage medium |
CN112084334B (en) * | 2020-09-04 | 2023-11-21 | 中国平安财产保险股份有限公司 | Label classification method and device for corpus, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202177B (en) | A kind of file classification method and device | |
CN107894980A (en) | A kind of multiple statement is to corpus of text sorting technique and grader | |
CN102682124B (en) | Emotion classifying method and device for text | |
CN108376151A (en) | Question classification method, device, computer equipment and storage medium | |
CN107025284A (en) | The recognition methods of network comment text emotion tendency and convolutional neural networks model | |
CN106489149A (en) | A kind of data mask method based on data mining and mass-rent and system | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN108536756A (en) | Mood sorting technique and system based on bilingual information | |
CN108256104A (en) | Internet site compressive classification method based on multidimensional characteristic | |
CN111221939A (en) | Grading method and device and electronic equipment | |
CN110188047A (en) | A kind of repeated defects report detection method based on binary channels convolutional neural networks | |
CN109120632A (en) | Network flow abnormity detection method based on online feature selection | |
CN109271627A (en) | Text analyzing method, apparatus, computer equipment and storage medium | |
CN109960727A (en) | For the individual privacy information automatic testing method and system of non-structured text | |
CN109902202A (en) | A kind of video classification methods and device | |
CN110472203A (en) | A kind of duplicate checking detection method, device, equipment and the storage medium of article | |
CN109284374A (en) | For determining the method, apparatus, equipment and computer readable storage medium of entity class | |
CN112199496A (en) | Power grid equipment defect text classification method based on multi-head attention mechanism and RCNN (Rich coupled neural network) | |
CN108549723A (en) | A kind of text concept sorting technique, device and server | |
CN109145108A (en) | Classifier training method, classification method, device and computer equipment is laminated in text | |
CN114580418A (en) | Knowledge map system for police physical training | |
CN112966708A (en) | Chinese crowdsourcing test report clustering method based on semantic similarity | |
Jabbar et al. | Supervised learning approach for surface-mount device production | |
CN113297842A (en) | Text data enhancement method | |
CN112613321A (en) | Method and system for extracting entity attribute information in text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180410 |
|
RJ01 | Rejection of invention patent application after publication |