CN110196910A - A kind of method and device of corpus classification - Google Patents
A kind of method and device of corpus classification Download PDFInfo
- Publication number
- CN110196910A CN110196910A CN201910468030.8A CN201910468030A CN110196910A CN 110196910 A CN110196910 A CN 110196910A CN 201910468030 A CN201910468030 A CN 201910468030A CN 110196910 A CN110196910 A CN 110196910A
- Authority
- CN
- China
- Prior art keywords
- vector
- candidate
- translation
- corpus
- feature words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind of method and device of corpus classification, belongs to data analysis field.Wherein method part includes: and extracts respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;By the Feature Words, according to target languages are translated respectively, form the corresponding translation vector of each given class according to resulting translation and the corresponding vector characteristics of the Feature Words;Extract corresponding candidate word in candidate corpus and form candidate vector, translation vector corresponding with each given class is matched respectively, according to resulting matching degree determine the candidate corpus belonging to target category.The application is analyzed by the keyword in the corpus of text to known languages, it is matched after translated with the candidate corpus of unknown languages, to predict classification belonging to unknown corpus, the classification that corpus can also be carried out in the case where the translator of no corresponding languages, improves the efficiency of information processing.
Description
Technical field
This application involves data processing fields, more particularly to the method and device that the corpus of unknown languages is classified.
Background technique
With the explosive growth of information content in internet, the transmitting of information is expanded in the medium of multiple countries
It is propagated.The overwhelming majority in network data be all it is existing in the form of text, how to utilize natural language processing technique pair
These text informations are classified, and allow users to more acurrate, are quickly found useful information, are become artificial intelligence field
An important research problem.When classifying at present to corpus such as webpage, news, mainly carried out by machine learning model
Classification, when classifying on line to webpage, is constantly trained according to the sample manually marked of all categories of building, obtains
Classify again to candidate corpus after disaggregated model.
But under multilingual environment, it is necessary to establish the sample of different language, then need to distinguish for each languages artificial
Mark, and the training rules under each languages are constructed respectively.If target language is more, construction cost can be very high, significantly
Reduce the efficiency of information processing.
Summary of the invention
The purpose of the embodiment of the present application is to provide a kind of method and device of corpus classification, to realize under multilingual environment
The demand classified to corpus.
In order to solve the above technical problems, the embodiment of the present application is achieved in that
According to the embodiment of the present application in a first aspect, providing a kind of method of corpus classification, which comprises
It is extracted respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;
By the Feature Words, according to target languages are translated respectively, are respectively corresponded to according to resulting translation and the Feature Words
Vector characteristics form the corresponding translation vector of each given class;The translation vector is each in the target language for describing
The corresponding characteristic attribute of the Feature Words under given class;
Extract corresponding candidate word in candidate corpus and form candidate vector, by the candidate vector respectively with each given class
Corresponding translation vector is matched, and candidate vector matching degree corresponding with the translation vector of each given class is obtained;
According to the matching degree determine the candidate corpus belonging to target category.
In one embodiment of the application, the method also includes:
Extract the corresponding weight probability of each Feature Words in the translation vector;
It is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains the languages model;
The languages model is matched respectively as the corresponding translation vector of each given class with the candidate vector.
It is described when being extracted respectively from the corpus of text of each given class in one embodiment of the application,
The corpus of text is segmented, resulting keyword after statistics participle;
The corresponding near synonym of the keyword or conjunctive word are searched, the corresponding vector characteristics of the keyword are counted;
The corresponding weight of the keyword is set separately according to the vector characteristics, is sieved according to the weight
Choosing, obtains the corresponding Feature Words of the corpus of text.
In one embodiment of the application, when the corresponding translation vector of each given class of the composition,
The corresponding vector characteristics of the Feature Words are extracted, the corresponding vector characteristics of the translation are obtained, it will be described
After translation is associated combination together with the Feature Words, the translation vector is formed.
In one embodiment of the application, when the corresponding translation vector of each given class of the composition,
When the Feature Words under the target language corresponding translation be greater than it is a kind of when, by every kind of translation respectively with it is described
Feature Words are associated combination, and the respectively weight in the corresponding vector characteristics of the Feature Words, by every kind of translation respectively with institute
It states Feature Words and is associated combination, form the corresponding translation vector of multiple groups.
In one embodiment of the application, in the candidate corpus of the extraction when corresponding candidate word composition candidate vector,
The candidate corpus is analyzed, extracts candidate word therein respectively,
The corresponding characteristic attribute of the candidate word and the corresponding weight of the characteristic attribute are extracted respectively, are respectively obtained
The corresponding corresponding vector characteristics of each candidate word;
The corresponding vector characteristics of the candidate word are fitted, the corresponding candidate vector of the candidate corpus is formed.
In one embodiment of the application, described by the candidate vector, translation vector corresponding with each given class is carried out respectively
When matching,
Extract the corresponding vector characteristics of candidate word described in the candidate vector;
By the corresponding vector characteristics of the candidate word, vector characteristics corresponding with each translation vector are matched respectively;
The given class greater than given threshold is filtered out according to resulting matching degree;
Using the given class greater than given threshold as target category belonging to the candidate corpus.
According to the second aspect of the embodiment of the present application, a kind of device of corpus classification is provided, described device includes:
Extraction module obtains the corpus of text pair for extracting respectively from the corpus of text of each given class
The Feature Words answered;
Translation module, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and described
The corresponding vector characteristics of Feature Words form the corresponding translation vector of each given class;The translation vector is described for describing
The corresponding characteristic attribute of the Feature Words in target language under each given class;
Matching module, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector point
Translation vector not corresponding with each given class is matched, and the translation vector of the candidate vector Yu each given class is obtained
Corresponding matching degree;
Division module, for according to the matching degree determine the candidate corpus belonging to target category.
In one embodiment of the application, described device further includes that model unit specifically includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics
Predicate kind model;
Matching unit, for using the languages model as the corresponding translation vector of each given class and the candidate vector
It is matched respectively.
In one embodiment of the application, the extraction module is specifically included,
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
Associative cell counts the keyword pair for searching the corresponding near synonym of the keyword or conjunctive word
The vector characteristics answered;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to institute
It states weight to be screened, obtains the corresponding Feature Words of the corpus of text.
In one embodiment of the application, in the translation module, specifically include:
Associative cell, for the corresponding vector characteristics of the Feature Words to be extracted, obtain the translation it is corresponding to
Measure feature after the translation is associated combination together with the Feature Words, forms the translation vector.
In one embodiment of the application, in the translation module, translated when the Feature Words are corresponding under the target language
Text be greater than it is a kind of when, every kind of translation is associated with the Feature Words respectively and is combined, and respectively the Feature Words it is corresponding to
Every kind of translation is associated with the Feature Words respectively and combines by the weight in measure feature, forms the corresponding translation vector of multiple groups.
In one embodiment of the application, in the matching module, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is respectively right for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively
The weight answered respectively obtains the corresponding corresponding vector characteristics of each candidate word;
It is corresponding to form the candidate corpus for being fitted the corresponding vector characteristics of the candidate word for fitting unit
Candidate vector.
As can be seen from the technical scheme provided by the above embodiments of the present application, text language of the embodiment of the present application from each given class
It is extracted respectively in material, obtains the corresponding Feature Words of the corpus of text;By the Feature Words, according to target languages are carried out respectively
Translation, according to resulting translation and the corresponding vector characteristics of the Feature Words form the corresponding translation of each given class to
Amount;It extracts corresponding candidate word in candidate corpus and forms candidate vector, the candidate vector is corresponding with each given class respectively
Translation vector matched, obtain candidate vector matching degree corresponding with the translation vector of each given class;According to
The matching degree determines target category belonging to the candidate corpus.This programme passes through the pass in the corpus of text to known languages
Keyword is analyzed, and is matched after translated with the candidate corpus of unknown languages, to predict class belonging to unknown corpus
Not, the classification that corpus can also be carried out in the case where the translator of no corresponding languages, improves the efficiency of information processing.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The some embodiments recorded in specification, for those of ordinary skill in the art, before not making the creative labor property
It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is the flow chart of the method for the corpus classification of one embodiment of the application;
Fig. 2 is the structural schematic diagram of the device of the corpus classification of one embodiment of the application.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with the application
Attached drawing in embodiment, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described reality
Applying example is only this specification a part of the embodiment, instead of all the embodiments.The embodiment of base in this manual, ability
Domain those of ordinary skill every other embodiment obtained without making creative work, all should belong to this theory
The range of bright book protection.
The embodiment of the present application provides a kind of method and device of corpus classification.
A kind of method of corpus classification provided by the embodiments of the present application is introduced first below.
The overwhelming majority in network data is all existing in the form of text at present, and text data tends to belong to different again
When the prior art classifies to corpus such as webpage, news, the preset sample manually marked is constantly trained for languages,
Classify again to candidate corpus after needing a large amount of artificial treatment, training to obtain disaggregated model.But it is right under multilingual environment
Multilingual sample is marked then often unrealistic respectively, reduces development efficiency.The present invention passes through to known languages
Keyword in corpus of text is analyzed, and is trained the translation vector of each given class, and analyze candidate's expectation, is mentioned
It is matched respectively with translation vector of all categories again after taking out candidate vector, so that it is determined that going out the corresponding target class of candidate corpus
Not, make the classification that can also carry out corpus in the case where the translator of no corresponding languages, improve the effect of information processing
Rate.
Fig. 1 is the flow chart of the method for the corpus classification of one embodiment of the application, as shown in Figure 1, this method can be with
The following steps are included:
In a step 101, it is extracted respectively from the corpus of text of each given class, it is corresponding to obtain the corpus of text
Feature Words.
In the present embodiment, when being extracted respectively from the corpus of text of each given class,
Step 101a segments the corpus of text, resulting keyword after statistics participle;
It is divided after carrying out semantic analysis to the full text of corpus of text, each notional word in gained participle, special after statistics divides
It is not the position of noun, the corresponding word frequency of verb and these notional words in corpus of text, word frequency is greater than set threshold
Value, and/or positioned at the key positions such as title, first section, latter end notional word as the corresponding keyword of text corpus.
Step 101b searches the corresponding near synonym of the keyword or conjunctive word, it is corresponding to count the keyword
Vector characteristics;
The corresponding synonym of these keywords or near synonym are searched respectively, and it is special to count the corresponding vector of these keywords
It levies, in the present embodiment, extracts these corresponding word frequency of keyword (containing synonym and/or near synonym), length, parts of speech, position mark
Remember, whether start, whether the characteristic attributes such as overstriking, using these characteristic attributes as the vector characteristics of keyword.
The corresponding weight of the keyword is set separately according to the vector characteristics, according to the power in step 101c
It is screened again, obtains the corresponding Feature Words of the corpus of text
It since the keyword in corpus of text is excessive, therefore needs to be screened according to weight, will indicate from keyword should
The Feature Words of given class character pair extract, even if the Feature Words extracted accurately indicate the corresponding class of the given class
Other feature.
Specifically, the corresponding weight of vector attribute institute of the keyword is set separately, as and by each keyword pair
The attribute answered is weighted summation, is screened according to value resulting after weighted sum, specifically, according to setting in this implementation
Weight threshold each keyword is screened, using resulting value after weighted sum be greater than the corresponding keyword of weight threshold as
Feature Words.
Step 102: by the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words
Corresponding vector characteristics form the corresponding translation vector of each given class;The translation vector is for describing the target language
The corresponding characteristic attribute of the Feature Words in kind under each given class
Set translation word stocks are called, according to target languages are translated respectively by each Feature Words, and it is corresponding to obtain each Feature Words
Translation, extract the corresponding word frequency of translation, length, part of speech, position mark, whether start, whether the characteristic attributes such as overstriking, and root
It is counted according to the characteristic attribute and corresponding weight, the corresponding vector characteristics of translation is obtained, together with not translated feature
The corresponding vector characteristics of word are combined, and form translation vector.
It include two parts in translation vector in the present embodiment, a part is the feature in the text of given class
Word illustrates that the character pair of given class, another part are characterized the corresponding translation of word, indicates given class character pair
Languages attribute.Under normal circumstances, in the corresponding corpus of text of each given class, since translation obtains after Feature Words are translated
It arrives, then the corresponding vector characteristics of each Feature Words vector characteristics corresponding with translation are equal, but particularly, if Feature Words are corresponding
Translation have more than one, then Feature Words and every kind of translation are separately constituted into translation vector, at this time will be in the vector characteristics of translation
Corresponding each weight is divided equally respectively, is combined to form the corresponding translation vector of multiple groups in the vector characteristics with Feature Words,
Respective weights in translation vector are the average value of keyword and corresponding translation weight.
In other embodiments, the vector characteristics in translation vector are instructed together with the corresponding vector characteristics of Feature Words
Practice, obtain languages model, the translation vector of each given class is substituted using model structure, specifically:
Step 102a extracts the corresponding weight probability of each Feature Words in the translation vector;
It is corresponded in translation vector in each given class, it is special to extract the corresponding vector of Feature Words, translation in each translation vector
Sign, is normalized to weight probability for the weight parameter in vector characteristics respectively.
Step 102b is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains institute's predicate
Kind model;
After the corresponding vector characteristics of Feature Words are equally normalized in each given class, SVM (supporting vector will be used
Machine) method is trained the corresponding vector characteristics of Feature Words in each given class and the corresponding vector characteristics of translation vector.It will
The difference of the corresponding vector characteristics of each Feature Words vector characteristics corresponding with the translation vector is made in corpus of text under the category
For positive sample;By the corresponding vector characteristics of the non-Feature Words vector corresponding with the translation vector in the corpus of text under the category
The difference of feature is as negative sample;It is iterated training according to sample characteristics in the corpus of text of each given type respectively, is obtained
To languages model, to carry out category division to each translation under the languages, judge that translation belongs to the correspondence of a certain given class
Probability.
Step 102c distinguishes the languages model as the corresponding translation vector of each given class and the candidate vector
It is matched.
In subsequent step, candidate word is extracted from candidate corpus and forms candidate vector, by candidate vector in the languages mould
It is matched in type, obtains the probability that candidate vector belongs to each given class, probability highest is corresponded into classification as the candidate
Target category belonging to corpus.
Step 103: extracting corresponding candidate word in candidate corpus and form candidate vector, respectively and respectively by the candidate vector
The corresponding translation vector of given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector
Matching degree;
When converting candidate vector for candidate corpus, include the following steps:
Step 103a analyzes the candidate corpus, extracts candidate word therein respectively,
It is segmented in the candidate corpus of target language, is screened according to the result of participle, it will be corresponding in word segmentation result
Notional word, especially noun, verb respectively correspond the position of word frequency and notional word in corpus of text, word frequency is greater than set
Threshold value, and/or positioned at the key positions such as title, first section, latter end notional word as the corresponding candidate word of candidate's corpus.
Step 103b extracts the corresponding characteristic attribute of the candidate word and the corresponding power of the characteristic attribute respectively
Weight, respectively obtains the corresponding corresponding vector characteristics of each candidate word;
In the present embodiment, these corresponding word frequency of keyword (containing synonym and/or near synonym), length, part of speech, position are extracted
Tagging, whether start, whether the characteristic attributes such as overstriking, combine corresponding weight to be associated these characteristic attributes, respectively
Obtain the corresponding vector characteristics of each candidate word.
The corresponding vector characteristics of the candidate word are fitted by step 103c, form the corresponding time of the candidate corpus
Select vector.
In the present embodiment, the corresponding vector characteristics of candidate word are normalized, form candidate corpus it is corresponding it is candidate to
Amount.
When by the candidate vector, translation vector corresponding with each given class is matched respectively, extract it is described it is candidate to
The corresponding vector characteristics of candidate word described in amount;By vector characteristics translation vector progress corresponding with each given class respectively
Match;If candidate vector matching degree corresponding with some translation vector is larger, then candidate vector has more approximate with the translation vector
Feature Words, and the corresponding characteristic attribute of Feature Words is also more similar, i.e., the corpus of text category of candidate corpus and the given class
It is larger in the probability of same classification.Conversely, as candidate vector matching degree corresponding with some translation vector is smaller, then candidate vector
Feature Words corresponding with the translation vector differ greatly, i.e., the corpus of text of candidate corpus and the given class belongs to same classification
Probability it is smaller.
In other embodiments, candidate is expected corresponding candidate vector described by the languages model for calling training to finish
It is matched in languages model, candidate vector relevance score corresponding with each translation vector respectively is judged, so as in subsequent step
In chosen.
Step 104:, according to the matching degree determine the candidate corpus belonging to target category.
Choose matching after gained matching degree be greater than given threshold given class be used as target category, then it is described candidate corpus
Belong to the target category.
In other embodiments, candidate vector matching degree corresponding with the translation vector of each given class is greater than both
The target category for determining threshold value has more than one, then the candidate corpus belongs to more than one given class.
This programme is analyzed by the keyword in the corpus of text to known languages, translated rear and unknown languages
Candidate corpus is matched, to predict classification belonging to unknown corpus, the translator of no corresponding languages the case where
Under can also carry out the classification of corpus, improve the efficiency of information processing.
In another alternative embodiment, when corpus is classified
Step 201: being extracted respectively from the corpus of text of each given class, obtain the corresponding spy of the corpus of text
Levy word;
In the corresponding collection of document of corpus of text of given class, each notional word and its synonym are obtained, with prefix phase
Word, public substring related term are closed, corresponding set is denoted as S by semantic related term;
Step 202: by the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words
Corresponding vector characteristics form the corresponding translation vector of each given class;The translation vector is for describing the target language
The corresponding characteristic attribute of the Feature Words in kind under each given class;
In the present embodiment, set S is translated with object language, the corresponding set of translation is denoted as D.Extract set S
In the keyword that matches with the corpus of text of given class as Feature Words, will be special using topic model or word embedded mode
Sign word is trained, and generates the corresponding characteristic attribute of Feature Words, similarly, generates the characteristic attribute that Feature Words correspond to translation.
When calculating the corresponding weight factor of characteristic attribute of Feature Words, comprehensive TF/IDF, page aggregation angle value and semanteme are poly-
Right value is calculated:
1) TF/IDF: the corresponding TF/IDF value of the specific word is calculated, g1 is denoted as;
The word frequency and inverse document frequency product of TF/IDF expression Feature Words.Wherein, TF is the corresponding word frequency of Feature Words,
IDF is the number that Feature Words occur in collection of document, then inverted.
2) page aggregation degree value: under conditions of a sentence level sliding window is M (positive integer), there are other spies
The number for levying word, is denoted as g2;
3): within the scope of the field N of a vector space, there is the number of other Feature Words in semantics fusion degree value, note
For g3;
The then corresponding weight of the specific word: G=a1*g1+a2*g2+a3*g3, wherein a1, a2, a3 are given coefficients.
Similarly, when calculating a certain Feature Words and corresponding to the corresponding weight factor of characteristic attribute of translation, comprehensive TF/IDF, the page
Modal and semantics fusion angle value are calculated:
1) TF/IDF: the TF/IDF value that the specific word corresponds to translation is calculated, h1 is denoted as;
The word frequency and inverse document frequency product of TF/IDF expression Feature Words.
2) under conditions of a sentence level sliding window is M, it is corresponding page aggregation degree value: to there are other Feature Words
The number of translation, is denoted as h2;
3) semantics fusion degree value: within the scope of the field N of a vector space, there are other Feature Words and correspond to translation
Number is denoted as h3.
Feature Words correspond to the weight of translation are as follows: H=b1*h1+b2*h2+b3*h3, wherein b1, b2, b3, which are given, is
Number.
So in conclusion calculating translation vector corresponding to the corpus of text shelves of given class are as follows:
Wherein, VdocIt is the corresponding translation vector of corpus of text doc of given class, n is the feature in corpus of text doc
Word quantity, i=1,2,3 ..., n, VwiIt is characterized word wiCharacteristic attribute corresponding with Feature Words translation, GwiIt is Feature Words wiIt is corresponding
Weight, HwiIt is characterized word wiThe weight of corresponding translation.
Further, from the corpus of text of given class, a certain number of notional words (verb, name are inferior) is chosen as label
Word calculates the feature vector of positive sample, chooses other a certain number of words (function word, interjection etc.) and is used as non-label word, calculates negative
The feature vector of sample carries out just after subtracting each other the feature vector of positive sample, the feature vector of negative sample with translation vector respectively
Then change, thus training regression model.
Step 203: extracting corresponding candidate word in candidate corpus and form candidate vector, respectively and respectively by the candidate vector
The corresponding translation vector of given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector
Matching degree;
If step 202 is similar, in the present embodiment, candidate corpus is segmented, is extracted in participle acquired results and set D
The keyword to match obtains candidate word corresponding original text before translation as candidate word, and then subsequent composition is candidate
Vector.
Ibid, each candidate word, the corresponding weight of original text in candidate corpus are calculated, and then forms the corresponding candidate of candidate corpus
Vector.
Candidate vector and regression model obtained in the step 202 are analyzed, obtain candidate vector with it is each set
The matching degree of the translation vector of classification.
Step 204: according to the matching degree determine the candidate corpus belonging to target category.
The present invention is analyzed by the keyword in the corpus of text to known languages, trains translating for each given class
Literary vector, and candidate's expectation is analyzed, it is matched respectively with translation vector of all categories again after extracting candidate vector,
So that it is determined that going out the corresponding target category of candidate corpus, make to carry out in the case where the translator of no corresponding languages
The classification of corpus improves the efficiency of information processing.
Fig. 2 is the structural schematic diagram of the device of the corpus classification of one embodiment of the application.Referring to FIG. 2, in one kind
In Software Implementation, corpus sorter 800 in picture may include: extraction module 801, translation module 802, matching
Module 803 and division module 804, wherein
Extraction module 801 obtains the corpus of text for extracting respectively from the corpus of text of each given class
Corresponding Feature Words;
Translation module 802, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and institute
It states the corresponding vector characteristics of Feature Words and forms the corresponding translation vector of each given class;The translation vector is for describing institute
State the corresponding characteristic attribute of the Feature Words in target language under each given class;
Matching module 803, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector
Translation vector corresponding with each given class is matched respectively, obtain the translation of the candidate vector and each given class to
Measure corresponding matching degree;
Division module 804, for according to the matching degree determine the candidate corpus belonging to target category.
In the extraction module 801, comprising:
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
Associative cell counts the keyword pair for searching the corresponding near synonym of the keyword or conjunctive word
The vector characteristics answered;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to institute
It states weight to be screened, obtains the corresponding Feature Words of the corpus of text.
The corpus sorter 800 further includes that model unit specifically includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics
Predicate kind model;
Matching unit, for using the languages model as the corresponding translation vector of each given class and the candidate vector
It is matched respectively.
In the translation module 802, specifically include:
Associative cell, for the corresponding vector characteristics of the Feature Words to be extracted, obtain the translation it is corresponding to
Measure feature after the translation is associated combination together with the Feature Words, forms the translation vector.
It, will when corresponding translation is greater than a kind of to the Feature Words under the target language in the translation module 802
Every kind of translation is associated with the Feature Words combines respectively, and the respectively weight in the corresponding vector characteristics of the Feature Words,
Every kind of translation is associated with the Feature Words respectively and is combined, the corresponding translation vector of multiple groups is formed.
In the matching module 803, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is respectively right for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively
The weight answered respectively obtains the corresponding corresponding vector characteristics of each candidate word;
It is corresponding to form the candidate corpus for being fitted the corresponding vector characteristics of the candidate word for fitting unit
Candidate vector.
In short, being not intended to limit the protection of this specification the foregoing is merely the preferred embodiment of this specification
Range.For all spirit in this specification within principle, any modification, equivalent replacement, improvement and so on should be included in this
Within the protection scope of specification.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment
The combination of equipment.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want
There is also other identical elements in the process, method of element, commodity or equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.
Claims (13)
1. a kind of method of corpus classification, which is characterized in that the described method includes:
It is extracted respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;
By the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words it is corresponding to
Measure feature forms the corresponding translation vector of each given class;The translation vector is each set in the target language for describing
The corresponding characteristic attribute of the Feature Words under classification;
It extracts corresponding candidate word in candidate corpus and forms candidate vector, the candidate vector is corresponding with each given class respectively
Translation vector matched, obtain candidate vector matching degree corresponding with the translation vector of each given class;
According to the matching degree determine the candidate corpus belonging to target category.
2. the method according to claim 1, wherein include: from the corpus of text of each given class respectively into
When row extracts,
The corpus of text is segmented, resulting keyword after statistics participle;
The corresponding near synonym of the keyword or conjunctive word are searched, the corresponding vector characteristics of the keyword are counted;
The corresponding weight of the keyword is set separately according to the vector characteristics, is screened, is obtained according to the weight
To the corresponding Feature Words of the corpus of text.
3. method according to claim 1 or 2, which is characterized in that the method also includes:
Extract the corresponding weight probability of each Feature Words in the translation vector;
It is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains the languages model;
The languages model is matched respectively as the corresponding translation vector of each given class with the candidate vector.
4. the method according to claim 1, wherein when the corresponding translation vector of each given class of the composition,
The corresponding vector characteristics of the Feature Words are extracted, the corresponding vector characteristics of the translation are obtained, by the translation
After being associated combination together with the Feature Words, the translation vector is formed.
5. according to the method described in claim 4, it is characterized in that, when the corresponding translation vector of each given class of the composition,
When the Feature Words under the target language corresponding translation be greater than it is a kind of when, by every kind of translation respectively with the feature
Word is associated combination, and the respectively weight in the corresponding vector characteristics of the Feature Words, by every kind of translation respectively with the spy
Sign word is associated combination, forms the corresponding translation vector of multiple groups.
6. the method according to claim 1, wherein corresponding candidate word composition is waited in extraction candidate's corpus
When selecting vector,
The candidate corpus is analyzed, extracts candidate word therein respectively,
The corresponding characteristic attribute of the candidate word and the corresponding weight of the characteristic attribute are extracted respectively, respectively obtain each time
Select the corresponding corresponding vector characteristics of word;
The corresponding vector characteristics of the candidate word are fitted, the corresponding candidate vector of the candidate corpus is formed.
7. the method according to claim 1, wherein it is described by the candidate vector respectively with each given class pair
When the translation vector answered is matched,
Extract the corresponding vector characteristics of candidate word described in the candidate vector;
By the corresponding vector characteristics of the candidate word, vector characteristics corresponding with each translation vector are matched respectively;
The given class greater than given threshold is filtered out according to resulting matching degree;
Using the given class greater than given threshold as target category belonging to the candidate corpus.
8. a kind of device of corpus classification, which is characterized in that described device includes:
It is corresponding to obtain the corpus of text for extracting respectively from the corpus of text of each given class for extraction module
Feature Words;
Translation module, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and the feature
The corresponding vector characteristics of word form the corresponding translation vector of each given class;The translation vector is for describing the target
The corresponding characteristic attribute of the Feature Words in languages under each given class;
Matching module, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector respectively with
The corresponding translation vector of each given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector
Matching degree;
Division module, for according to the matching degree determine the candidate corpus belonging to target category.
9. device according to claim 8, which is characterized in that the extraction module specifically includes,
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
It is corresponding to count the keyword for searching the corresponding near synonym of the keyword or conjunctive word for associative cell
Vector characteristics;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to the power
It is screened again, obtains the corresponding Feature Words of the corpus of text.
10. device according to claim 8 or claim 9, which is characterized in that described device further includes model unit, specific to wrap
It includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute's predicate for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics
Kind model;
Matching unit, for distinguishing the languages model as the corresponding translation vector of each given class and the candidate vector
It is matched.
11. device according to claim 8, which is characterized in that in the translation module, specifically include:
It is special to obtain the corresponding vector of the translation for extracting the corresponding vector characteristics of the Feature Words for associative cell
Sign, after the translation is associated combination together with the Feature Words, forms the translation vector.
12. device according to claim 11, it is characterised in that: in the translation module, when the Feature Words are described
When corresponding translation is greater than a kind of under target language, every kind of translation is associated with the Feature Words respectively and is combined, and divided equally
Weight in the corresponding vector characteristics of the Feature Words, every kind of translation is associated with the Feature Words respectively and is combined, and is formed
The corresponding translation vector of multiple groups.
13. device according to claim 8, which is characterized in that in the matching module, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is corresponding for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively
Weight respectively obtains the corresponding corresponding vector characteristics of each candidate word;
Fitting unit forms the corresponding time of the candidate corpus for being fitted the corresponding vector characteristics of the candidate word
Select vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910468030.8A CN110196910B (en) | 2019-05-30 | 2019-05-30 | Corpus classification method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910468030.8A CN110196910B (en) | 2019-05-30 | 2019-05-30 | Corpus classification method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110196910A true CN110196910A (en) | 2019-09-03 |
CN110196910B CN110196910B (en) | 2022-02-15 |
Family
ID=67753486
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910468030.8A Active CN110196910B (en) | 2019-05-30 | 2019-05-30 | Corpus classification method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110196910B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111522927A (en) * | 2020-04-15 | 2020-08-11 | 北京百度网讯科技有限公司 | Entity query method and device based on knowledge graph |
CN112307210A (en) * | 2020-11-06 | 2021-02-02 | 中冶赛迪工程技术股份有限公司 | Document tag prediction method, system, medium and electronic device |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
CN112836045A (en) * | 2020-12-25 | 2021-05-25 | 中科恒运股份有限公司 | Data processing method and device based on text data set and terminal equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667177A (en) * | 2009-09-23 | 2010-03-10 | 清华大学 | Method and device for aligning bilingual text |
WO2011100862A1 (en) * | 2010-02-22 | 2011-08-25 | Yahoo! Inc. | Bootstrapping text classifiers by language adaptation |
CN103902619A (en) * | 2012-12-28 | 2014-07-02 | 中国移动通信集团公司 | Internet public opinion monitoring method and system |
US20180165278A1 (en) * | 2016-12-12 | 2018-06-14 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for translating based on artificial intelligence |
CN108460396A (en) * | 2017-09-20 | 2018-08-28 | 腾讯科技(深圳)有限公司 | The negative method of sampling and device |
CN108510977A (en) * | 2018-03-21 | 2018-09-07 | 清华大学 | Language Identification and computer equipment |
CN108536756A (en) * | 2018-03-16 | 2018-09-14 | 苏州大学 | Mood sorting technique and system based on bilingual information |
-
2019
- 2019-05-30 CN CN201910468030.8A patent/CN110196910B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667177A (en) * | 2009-09-23 | 2010-03-10 | 清华大学 | Method and device for aligning bilingual text |
WO2011100862A1 (en) * | 2010-02-22 | 2011-08-25 | Yahoo! Inc. | Bootstrapping text classifiers by language adaptation |
CN103902619A (en) * | 2012-12-28 | 2014-07-02 | 中国移动通信集团公司 | Internet public opinion monitoring method and system |
US20180165278A1 (en) * | 2016-12-12 | 2018-06-14 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for translating based on artificial intelligence |
CN108460396A (en) * | 2017-09-20 | 2018-08-28 | 腾讯科技(深圳)有限公司 | The negative method of sampling and device |
CN108536756A (en) * | 2018-03-16 | 2018-09-14 | 苏州大学 | Mood sorting technique and system based on bilingual information |
CN108510977A (en) * | 2018-03-21 | 2018-09-07 | 清华大学 | Language Identification and computer equipment |
Non-Patent Citations (3)
Title |
---|
张凤鸣等: "《武器装备数据挖掘技术》", 30 June 2017, 国防工业出版社 * |
朱珠: "基于双语的事件抽取方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
魏向清 等: "《中国外语类辞书编纂出版30年回顾与反思》", 28 February 2011, 上海辞书出版社 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111522927A (en) * | 2020-04-15 | 2020-08-11 | 北京百度网讯科技有限公司 | Entity query method and device based on knowledge graph |
CN112307210A (en) * | 2020-11-06 | 2021-02-02 | 中冶赛迪工程技术股份有限公司 | Document tag prediction method, system, medium and electronic device |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
CN112417153B (en) * | 2020-11-20 | 2023-07-04 | 虎博网络技术(上海)有限公司 | Text classification method, apparatus, terminal device and readable storage medium |
CN112836045A (en) * | 2020-12-25 | 2021-05-25 | 中科恒运股份有限公司 | Data processing method and device based on text data set and terminal equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110196910B (en) | 2022-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
CN110196910A (en) | A kind of method and device of corpus classification | |
Song et al. | A comparative study on text representation schemes in text categorization | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN108197109A (en) | A kind of multilingual analysis method and device based on natural language processing | |
Zagibalov et al. | Unsupervised classification of sentiment and objectivity in Chinese text | |
CN110134792B (en) | Text recognition method and device, electronic equipment and storage medium | |
Banik et al. | Evaluation of naïve bayes and support vector machines on bangla textual movie reviews | |
CN109558587B (en) | Method for classifying public opinion tendency recognition aiming at category distribution imbalance | |
Wu et al. | News filtering and summarization on the web | |
Abdelali et al. | Arabic dialect identification in the wild | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN110888983B (en) | Positive and negative emotion analysis method, terminal equipment and storage medium | |
CN112395395A (en) | Text keyword extraction method, device, equipment and storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
Patel et al. | Dynamic lexicon generation for natural scene images | |
CN109284389A (en) | A kind of information processing method of text data, device | |
CN110688540B (en) | Cheating account screening method, device, equipment and medium | |
KR101543680B1 (en) | Entity searching and opinion mining system of hybrid-based using internet and method thereof | |
Wang et al. | Multi‐label emotion recognition of weblog sentence based on Bayesian networks | |
CN112527963B (en) | Dictionary-based multi-label emotion classification method and device, equipment and storage medium | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN110688461B (en) | Online text education resource label generation method integrating multi-source knowledge | |
CN108427769B (en) | Character interest tag extraction method based on social network | |
CN107577667B (en) | Entity word processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220728 Address after: No.16 and 17, unit 1, North District, Kailin center, No.51 Jinshui East Road, Zhengzhou area (Zhengdong), Henan pilot Free Trade Zone, Zhengzhou City, Henan Province, 450000 Patentee after: Zhengzhou Apas Technology Co.,Ltd. Address before: E301-27, building 1, No.1, hagongda Road, Tangjiawan Town, Zhuhai City, Guangdong Province Patentee before: ZHUHAI TIANYAN TECHNOLOGY Co.,Ltd. |