CN107894980A - A kind of multiple statement is to corpus of text sorting technique and grader - Google Patents

A kind of multiple statement is to corpus of text sorting technique and grader Download PDF

Info

Publication number
CN107894980A
CN107894980A CN201711276465.XA CN201711276465A CN107894980A CN 107894980 A CN107894980 A CN 107894980A CN 201711276465 A CN201711276465 A CN 201711276465A CN 107894980 A CN107894980 A CN 107894980A
Authority
CN
China
Prior art keywords
languages
word
convolutional neural
chinese
english
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711276465.XA
Other languages
Chinese (zh)
Inventor
陈件
张井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201711276465.XA priority Critical patent/CN107894980A/en
Publication of CN107894980A publication Critical patent/CN107894980A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

Corpus of text sorting technique and grader, the sorting technique are included the invention discloses a kind of multiple statement:Data to be predicted are inputted, carry out languages separation;Word corresponding to each languages is upset at random;Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain prediction result;Prediction result corresponding to multiple languages is interacted into checking;Final judged result is exported according to validation-cross result.The sorting technique of the present invention can improve nicety of grading.

Description

A kind of multiple statement is to corpus of text sorting technique and grader
Technical field
The present invention relates to areas of information technology, more particularly to a kind of multiple statement to corpus of text sorting technique and classification Device.
Background technology
Text corpus be also using electronic computer as carrier carry linguistry basic resource, complete spoken language materials Storehouse is used for language model structure, lexicography and text classification etc., text classification computer to text set (or other entities or Object) according to certain taxonomic hierarchies or the automatic key words sorting of standard progress.
Text classification problem can be attributed to according to be sorted with other classification problems without difference substantially, its method Some features of data are matched, and certainly complete matching is unlikely that, it is therefore necessary to(According to certain evaluation mark It is accurate)Optimal matching result is selected, so as to complete to classify.
But current sorting technique is difficult to the purpose for reaching Accurate classification, there are on some existing platforms substantial amounts of Bilingual sentence is right, and most sentence is not to being marked classification, even if remaining sentence pair has been marked classification, the sentence accurately marked is right Also relatively little of part is only accounted for, however, corresponding, data retrieval, content distribution and route on platform etc. are dependent on In accurate language material classification mark, in order to preferably play the effect of various language material platforms, there is provided it is a kind of practical and The high sorting technique of nicety of grading is necessary.
It should be noted that the introduction to technical background above be intended merely to the convenient technical scheme to the application carry out it is clear, Complete explanation, and facilitate the understanding of those skilled in the art and illustrate.Can not merely because these schemes the application's Background section is set forth and thinks that above-mentioned technical proposal is known to those skilled in the art.
The content of the invention
In view of the drawbacks described above of prior art, the technical problems to be solved by the invention there is provided a kind of improve and classify The multiple statement of precision is to corpus of text sorting technique and grader.
To achieve the above object, the invention provides a kind of multiple statement to corpus of text sorting technique, including:
Data to be predicted are inputted, carry out languages separation;
Word corresponding to each languages is upset at random;
Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain pre- Survey result;
Prediction result corresponding to multiple languages is interacted into checking;
Final judged result is exported according to validation-cross result.
Further, after described the step of upsetting word corresponding to each languages at random, each languages are beaten at random Word after unrest is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also wrap Include:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment, Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences Truncation operation is carried out, this is likely to result in the loss of sentence information;Thus, reading the number upset again is at least twice, and this is default Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, then Truncation operation, and follow-up flow are carried out again, can avoid truncation operation that sentence key message is lost, meanwhile, again may be used With by it is bad, classification it is indefinite, be unfavorable for training sentence to screening out;In addition, final judged result then be based on Machine upsets what multiple text was predicted respectively, if repeatedly prediction result is identical, can consider the prediction result naturally The degree of accuracy is higher;Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification class Not or classification etc. is not known;Wherein, the preset times are as the case may be so that classification can be met by training the result come The required precision of device.
Further, the preset times are 5.In the present embodiment, preset times can according to circumstances change, for example, When data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.
Further, the languages include Chinese languages and English languages, and the convolutional neural networks model includes Chinese Convolutional neural networks model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English languages, Corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through two respectively Person is predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.
Further, input data to be predicted, it is described that each languages are corresponding after the step of carrying out languages separation Word the step of upsetting at random before also include:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text Word, it can specifically be segmented by applicable segmenter etc.;Wherein, stop-word refer to those frequency of use it is too high, to language The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation The word of property, therefore before training filter out these words;Specifically, stop-word database can be put into aggregate type It is interior, and stop-word filtering is carried out to text by applicable filter method.
Further, the convolutional neural networks model is to sample preprocessing, trains to obtain by convolutional neural networks 's.In the present embodiment, the convolutional neural networks model trains to obtain previously according to sample preprocessing by convolutional neural networks , specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be preserved, To be optimized to convolutional neural networks model;Specifically, input CNN's is that size can be 50 * unit dictionary chis Very little bivector, because dictionary size is very big, the size of these bivectors is also very big and very sparse(Contain Have substantial amounts of 0), CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which embeding layer arrives these vectorial dimensionality reductions 50*200 or other are suitably sized(Selected according to actual computing capability), then input the second layer and train more advanced spy Sign;The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact;In addition, model is instructed Practice after completing, we are further tested on new sample, it is found that the prediction result of many samples does not meet general knowledge and sentenced It is disconnected, and these samples are actually not belonging to any one known classification, we reject these samples, then consider newly-increased One is not known classification, and dedicated for receiving the sample that those are not belonging to any one known class, these samples are not due to belonging to In any one classification, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should be It is relatively small, by experiment, confirm above-mentioned it is assumed that and being determined the threshold value of a standard deviation(0.05), by some without classification Sample, which has been assigned to, does not know classification, with improving classification results reliability.
Further, the training step of convolutional neural networks model includes:
Chinese and English lock out operation is carried out to urtext;
The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, then carries out Chinese stop-word filtration treatment;Together When, to the isolated advanced row English space word segmentation processing of English word, then carry out English stop-word filtration treatment;
High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples Model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term is disassembled The word minimum into granularity, and generate dictionary(For example, utilize Gensim storehouses), then sample data projected on glossary, One is by form corresponding to each sentence generation(, there is frequency in " word ")Two tuples composition vector.To vector In two tuples second dimension summed, the term included in the hope of whole sentence(Vocabulary)Number, on this basis On, then obtain term(Vocabulary)The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification;Specifically , for example, have the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, can using these terminology datas To select the of a relatively high sample of quality from original sample;For one section of language material(Assuming that it is noted as:Medicine), utilize doctor Medicine terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this section of language Expect the percentage of total word number, the two comprehensive indexs, we may determine that this section of language material is strictly the degree for belonging to " medicine ", Rational threshold value is gone out by experiment test(Each classification is required for individually testing), so as to select the corpus data of high quality.
Further, it is necessary to first by the high quality samples before using convolutional neural networks training high quality samples Vectorization processing forms length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, in forecast period Carry out;The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector( CNN ) It is trained;Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for our and extracts Feature, distinguish classification, it is therefore desirable to be combined into short sentence longer(40 words to 50 words)And the syntagma of similar length;So Afterwards, the bag of words based on word frequency are utilized(Bag of Words)Model, syntagma is changed into term vector(Each of vector corresponding one Individual word, this numeral are exactly the frequency that this word occurs in syntagma, it can be seen that and vector length is identical with dictionary size, Further, since actually multiple probability very little occurs in a sentence in a word, so all frequency of acquiescence is all 1);Finally, before neutral net is inputted, a length normalization method is done(By mending 0 or truncating, generation length is exactly 50 vector).
Further, the formation of the glossary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.In the present embodiment, the shape of glossary Into being advantageous to obtain high-quality text, so as to improve the quality of convolutional neural networks model that training obtains, and then improve classification Precision.
The invention also discloses a kind of multiple statement to corpus of text grader, including:
Languages separation module, for the data to be predicted to input, carry out languages separating treatment;
Upset module at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module, the convolution of corresponding languages is loaded into for the word after each languages are upset at random Neural network model, and be predicted to obtain prediction result;
Interactive verification module, for prediction result corresponding to multiple languages to be interacted into checking;
Output module, for exporting final judged result according to validation-cross result.
In the present invention, different languages training in advance convolutional neural networks models are corresponded to respectively, in multilingual sentence, bag Contained it is multilingual between be mutually related priori, extra information will bring benefit for our classification so that classification essence Du Genggao;In addition, separated before convolutional neural networks model is loaded into, it is necessary to first treat the languages in prediction data, point The grader that languages train to obtain is not corresponded to, the data after separation are predicted respectively, can make full use of bilingual sentence pair Characteristic, checking is interacted between multiple graders, improves the precision of classification;Furthermore due to needing first to new input classification The sample of device carries out truncation operation, in order to avoid truncating operation sentence key message is lost, we are every time input by sentence Before grader, all sentence word order is upset at random, then carries out classification judgement.
With reference to following explanation and accompanying drawing, the particular implementation of the application is disclose in detail, specifies the original of the application Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in scope.In appended power In the range of the spirit and terms that profit requires, presently filed embodiment includes many changes, modifications and is equal.
The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more Used in individual other embodiment, it is combined with the feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when being used herein, but simultaneously It is not excluded for the presence or additional of one or more further features, one integral piece, step or component.
Brief description of the drawings
Included accompanying drawing is used for providing being further understood from the embodiment of the present application, which constitutes one of specification Point, for illustrating presently filed embodiment, and come together with word description to explain the principle of the application.Under it should be evident that Accompanying drawing in the description of face is only some embodiments of the present application, and wound will be not being paid for those of ordinary skill in the art On the premise of the property made is laborious, other accompanying drawings can also be obtained according to these accompanying drawings.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique;
Fig. 2 is the training flow chart of convolutional neural networks model of the present invention;
Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area All other embodiment that technical staff is obtained under the premise of creative work is not made, it should all belong to the application protection Scope.
Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique, is understood with reference to figure 1:The present invention carries A kind of multiple statement has been supplied to corpus of text sorting technique, including:
S1:Data to be predicted are inputted, carry out languages separation;
S2:Word corresponding to each languages is upset at random;
S3:Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted To prediction result;
S4:Prediction result corresponding to multiple languages is interacted into checking;
S5:Final judged result is exported according to validation-cross result.
In the present invention, different languages training in advance convolutional neural networks models are corresponded to respectively, in multilingual sentence, bag Contained it is multilingual between be mutually related priori, extra information will bring benefit for our classification so that classification essence Du Genggao;In addition, separated before convolutional neural networks model is loaded into, it is necessary to first treat the languages in prediction data, point The grader that languages train to obtain is not corresponded to, the data after separation are predicted respectively, can make full use of bilingual sentence pair Characteristic, checking is interacted between multiple graders, improves the precision of classification;Furthermore due to needing first to new input classification The sample of device carries out truncation operation, in order to avoid truncating operation sentence key message is lost, we are every time input by sentence Before grader, all sentence word order is upset at random, then carries out classification judgement.
The present embodiment is optional, after the step of word corresponding to each languages is upset at random, each languages are random Word after upsetting is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also Including:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment, Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences Truncation operation is carried out, this is likely to result in the loss of sentence information;Thus, reading the number upset again is at least twice, and this is default Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, Ran Houzai Truncation operation, and follow-up flow are carried out, can avoid truncating operation sentence key message is lost, meanwhile, again can be with By bad, sentence that is indefinite, being unfavorable for training of classifying to screening out;In addition, final judged result is then based on random Upset what multiple text was predicted respectively, if repeatedly prediction result is identical, naturally it is considered that the standard of the prediction result Exactness is higher;Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification classification Or classification etc. is not known;Wherein, the preset times are as the case may be so that grader can be met by training the result come Required precision.
The present embodiment is optional, preset times 5.In the present embodiment, preset times can according to circumstances change, example Such as, when data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.
The present embodiment is optional, and languages include Chinese languages and English languages, during the convolutional neural networks model includes File accumulates neural network model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English language Kind, corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through respectively Both are predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.
The present embodiment is optional, inputs data to be predicted, described by each languages pair after the step of carrying out languages separation Also include before the step of word answered is upset at random:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text Word, it can specifically be segmented by applicable segmenter etc.;Wherein, stop-word refer to those frequency of use it is too high, to language The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation The word of property, therefore before training filter out these words.
The present embodiment is optional, and convolutional neural networks model is to sample preprocessing, is trained by convolutional neural networks Arrive.In the present embodiment, the convolutional neural networks model is trained previously according to sample preprocessing by convolutional neural networks Arrive, specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be protected Deposit, to be optimized to convolutional neural networks model;Specifically, input CNN's is that size can be 50 * unit words The bivector of allusion quotation size, because dictionary size is very big, the size of these bivectors is also very big and very sparse (Contain substantial amounts of 0), CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which these vectors drop in embeding layer Tie up 50*200 or other are suitably sized(Selected according to actual computing capability), then input the second layer training it is more advanced Feature;The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact;In addition, mould After type training is completed, we are further tested on new sample, it is found that the prediction result of many samples is not met often Know and judge, and these samples are actually not belonging to any one known classification, we reject these samples, then consider Newly-increased one is not known classification, dedicated for receiving the sample that those are not belonging to any one known class, these samples due to Any one classification is not belonging to, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should This is relatively small, and by experiment, discovery situation is really in this way, we have been successfully found the threshold value of a standard deviation (0.05), some have been assigned to without classification samples and has not known classification, with improving classification results reliability.
Therefore the two parameters of our comprehensive references, it is final to determine to screen and retain those " term vocabulary numbers " to be more than or equal to X and " the term vocabulary accounting " sentence more than y.Because different classes of term quality, category feature are different, so for Each classification(x,y)Also it is different, it is necessary to according to original sample number, need the sample number selected to test to determine.X and y value Can specifically it be set as the case may be.
Fig. 2 is the training flow chart of convolutional neural networks model of the present invention, with reference to figure 2, is understood with reference to Fig. 1, the present embodiment Optionally, the training step of convolutional neural networks model includes:
S31:Chinese and English lock out operation is carried out to urtext;
S32:The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, is then carried out at Chinese stop-word filtering Reason;Meanwhile to the isolated advanced row English space word segmentation processing of English word, then carry out at English stop-word filtering Reason;
S33:High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
S34:Chinese convolutional neural networks model and English convolutional Neural are obtained using convolutional neural networks training high quality samples Network model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term The minimum word of granularity is disassembled into, and generates dictionary(For example, utilize Gensim storehouses), then sample data projected to glossary On, one is by form corresponding to each sentence generation(, there is frequency in " word ")Two tuples composition vector;It is right Second dimension of two tuples in vector is summed, the term included in the hope of whole sentence(Vocabulary)Number, in this base On plinth, then obtain term(Vocabulary)The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification; Specifically, for example, there is the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, these term numbers are utilized According to the of a relatively high sample of quality can be selected from original sample.For one section of language material(Assuming that it is noted as:Medicine), profit With medical terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this The percentage of the section total word number of language material, the two comprehensive indexs, we may determine that this section of language material is strictly to belong to " medicine " Degree, rational threshold value is gone out by experiment test(Each classification is required for individually testing), so as to select the language material of high quality Data.
The present embodiment is optional, it is necessary to first will be described high-quality before using convolutional neural networks training high quality samples Measure sample vector processing and form length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, predicted Stage will also be carried out;The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector( CNN )It is trained;Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for us Extract feature, distinguish classification, it is therefore desirable to be combined into short sentence longer(40 words to 50 words)And the sentence of similar length Section;Then, the bag of words based on word frequency are utilized(Bag of Words)Model, syntagma is changed into term vector(Each of vector A corresponding word, this numeral is exactly the frequency that this word occurs in syntagma, it can be seen that vector length is big with dictionary It is small identical, further, since actually there is multiple probability very little in a sentence in a word, so the frequency that acquiescence is all All it is 1);Finally, before neutral net is inputted, a length normalization method is done(By mending 0 or truncating, generation length is proper It is 50 vector well).
The present embodiment is optional, and the formation of glossary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.In the present embodiment, the shape of glossary Into being advantageous to obtain high-quality text, so as to improve the quality of convolutional neural networks model that training obtains, and then improve classification Precision.
Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader, with reference to figure 2, is understood with reference to Fig. 1:The present invention A kind of multiple statement is also disclosed to corpus of text grader 100, including:
Languages separation module 10, for the data to be predicted to input, carry out languages separating treatment;
Upset module 20 at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module .30, corresponding languages are loaded into for the word after each languages are upset at random Convolutional neural networks model, and be predicted to obtain prediction result;
Interactive verification module 40, for prediction result corresponding to multiple languages to be interacted into checking;
Output module 50, for exporting final judged result according to validation-cross result.
Grader in the present invention, can be applicable above-mentioned sorting technique, correspond to different languages training in advance respectively Convolutional neural networks model, in multilingual sentence, contain it is multilingual between be mutually related priori, extra information will Benefit is brought for our classification so that nicety of grading is higher;In addition, it is necessary to first before convolutional neural networks model is loaded into The languages treated in prediction data are separated, and correspond to the grader that languages train to obtain respectively, and the data after separation are carried out Predict respectively, the characteristic of bilingual sentence pair can be made full use of, checking is interacted between multiple graders, improve the essence of classification Degree;Furthermore due to needing first to carry out truncation operation to the sample of new input grader, in order to avoid truncating operation sentence is closed Key information is lost, and we before input by sentence grader, are upset sentence word order at random every time, then are carried out classification and sentenced It is disconnected.
Preferred embodiment of the invention described in detail above.It should be appreciated that one of ordinary skill in the art without Creative work can is needed to make many modifications and variations according to the design of the present invention.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical scheme, all should be in the protection domain being defined in the patent claims.

Claims (10)

1. a kind of multiple statement is to corpus of text sorting technique, it is characterised in that including:
Data to be predicted are inputted, carry out languages separation;
Word corresponding to each languages is upset at random;
Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain pre- Survey result;
Prediction result corresponding to multiple languages is interacted into checking;
Final judged result is exported according to validation-cross result.
2. a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that described by each languages After the step of corresponding word is upset at random, the word after each languages are upset at random is loaded into the convolutional Neural of corresponding languages Network model, and the step of be predicted to obtain prediction result before also include:
Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times To the integrated forecasting label of preset times;
The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.
3. described a kind of multiple statement as claimed in claim 2 is to corpus of text sorting technique, it is characterised in that described default Number is 5.
4. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the languages Including Chinese languages and English languages, the convolutional neural networks model includes Chinese convolutional neural networks model and English convolution Neural network model.
5. described a kind of multiple statement as claimed in claim 4 is to corpus of text sorting technique, it is characterised in that the input Data to be predicted, carry out languages separation the step of after, before described the step of upsetting word corresponding to each languages at random Also include:
Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment;Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.
6. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the convolution Neural network model is to sample preprocessing, trains what is obtained by convolutional neural networks.
7. described a kind of multiple statement as claimed in claim 6 is to corpus of text sorting technique, it is characterised in that convolutional Neural The training step of network model includes:
Chinese and English lock out operation is carried out to urtext;
The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, then carries out Chinese stop-word filtration treatment;Together When, to the isolated advanced row English space word segmentation processing of English word, then carry out English stop-word filtration treatment;
High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment;
Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples Model.
8. a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that is using convolutional Neural , it is necessary to which the high quality samples vectorization processing first is formed into length identical vector before network training high quality samples.
9. described a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that the term The formation of dictionary includes step:
Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment;
The original Chinese terminology obtained according to filtration treatment, generates the glossary.
10. a kind of multiple statement is to corpus of text grader, it is characterised in that including:
Languages separation module, for the data to be predicted to input, carry out languages separating treatment;
Upset module at random, for word corresponding to each languages to be upset at random;
Convolutional neural networks model prediction module, the convolution of corresponding languages is loaded into for the word after each languages are upset at random Neural network model, and be predicted to obtain prediction result;
Interactive verification module, for prediction result corresponding to multiple languages to be interacted into checking;
Output module, for exporting final judged result according to validation-cross result.
CN201711276465.XA 2017-12-06 2017-12-06 A kind of multiple statement is to corpus of text sorting technique and grader Pending CN107894980A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711276465.XA CN107894980A (en) 2017-12-06 2017-12-06 A kind of multiple statement is to corpus of text sorting technique and grader

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711276465.XA CN107894980A (en) 2017-12-06 2017-12-06 A kind of multiple statement is to corpus of text sorting technique and grader

Publications (1)

Publication Number Publication Date
CN107894980A true CN107894980A (en) 2018-04-10

Family

ID=61806110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711276465.XA Pending CN107894980A (en) 2017-12-06 2017-12-06 A kind of multiple statement is to corpus of text sorting technique and grader

Country Status (1)

Country Link
CN (1) CN107894980A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189932A (en) * 2018-09-06 2019-01-11 北京京东尚科信息技术有限公司 File classification method and device, computer readable storage medium
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN
CN110032714A (en) * 2019-02-25 2019-07-19 阿里巴巴集团控股有限公司 A kind of corpus labeling feedback method and device
CN112084334A (en) * 2020-09-04 2020-12-15 中国平安财产保险股份有限公司 Corpus label classification method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110098999A1 (en) * 2009-10-22 2011-04-28 National Research Council Of Canada Text categorization based on co-classification learning from multilingual corpora
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110098999A1 (en) * 2009-10-22 2011-04-28 National Research Council Of Canada Text categorization based on co-classification learning from multilingual corpora
CN103488623A (en) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 Multilingual text data sorting treatment method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CSDN_CSDN__AI: "用深度学习(CNN RNN Attention)解决大规模文本分类问题-综述和实践", 《CSDN博客》 *
刘里: "中文文本分类中特征描述及分类器构造方法研究", 《中国优秀博硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189932A (en) * 2018-09-06 2019-01-11 北京京东尚科信息技术有限公司 File classification method and device, computer readable storage medium
CN109189932B (en) * 2018-09-06 2021-02-26 北京京东尚科信息技术有限公司 Text classification method and device and computer-readable storage medium
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN
CN110032714A (en) * 2019-02-25 2019-07-19 阿里巴巴集团控股有限公司 A kind of corpus labeling feedback method and device
CN112084334A (en) * 2020-09-04 2020-12-15 中国平安财产保险股份有限公司 Corpus label classification method and device, computer equipment and storage medium
CN112084334B (en) * 2020-09-04 2023-11-21 中国平安财产保险股份有限公司 Label classification method and device for corpus, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106202177B (en) A kind of file classification method and device
CN107894980A (en) A kind of multiple statement is to corpus of text sorting technique and grader
CN102682124B (en) Emotion classifying method and device for text
CN108376151A (en) Question classification method, device, computer equipment and storage medium
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN106489149A (en) A kind of data mask method based on data mining and mass-rent and system
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN108536756A (en) Mood sorting technique and system based on bilingual information
CN108256104A (en) Internet site compressive classification method based on multidimensional characteristic
CN111221939A (en) Grading method and device and electronic equipment
CN110188047A (en) A kind of repeated defects report detection method based on binary channels convolutional neural networks
CN109120632A (en) Network flow abnormity detection method based on online feature selection
CN109271627A (en) Text analyzing method, apparatus, computer equipment and storage medium
CN109960727A (en) For the individual privacy information automatic testing method and system of non-structured text
CN109902202A (en) A kind of video classification methods and device
CN110472203A (en) A kind of duplicate checking detection method, device, equipment and the storage medium of article
CN109284374A (en) For determining the method, apparatus, equipment and computer readable storage medium of entity class
CN112199496A (en) Power grid equipment defect text classification method based on multi-head attention mechanism and RCNN (Rich coupled neural network)
CN108549723A (en) A kind of text concept sorting technique, device and server
CN109145108A (en) Classifier training method, classification method, device and computer equipment is laminated in text
CN114580418A (en) Knowledge map system for police physical training
CN112966708A (en) Chinese crowdsourcing test report clustering method based on semantic similarity
Jabbar et al. Supervised learning approach for surface-mount device production
CN113297842A (en) Text data enhancement method
CN112613321A (en) Method and system for extracting entity attribute information in text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180410

RJ01 Rejection of invention patent application after publication