CN110196910A - A kind of method and device of corpus classification - Google Patents

A kind of method and device of corpus classification Download PDF

Info

Publication number
CN110196910A
CN110196910A CN201910468030.8A CN201910468030A CN110196910A CN 110196910 A CN110196910 A CN 110196910A CN 201910468030 A CN201910468030 A CN 201910468030A CN 110196910 A CN110196910 A CN 110196910A
Authority
CN
China
Prior art keywords
vector
candidate
translation
corpus
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910468030.8A
Other languages
Chinese (zh)
Other versions
CN110196910B (en
Inventor
孙健
周桐
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Apas Technology Co ltd
Original Assignee
Zhuhai Tianyan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Tianyan Technology Co Ltd filed Critical Zhuhai Tianyan Technology Co Ltd
Priority to CN201910468030.8A priority Critical patent/CN110196910B/en
Publication of CN110196910A publication Critical patent/CN110196910A/en
Application granted granted Critical
Publication of CN110196910B publication Critical patent/CN110196910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind of method and device of corpus classification, belongs to data analysis field.Wherein method part includes: and extracts respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;By the Feature Words, according to target languages are translated respectively, form the corresponding translation vector of each given class according to resulting translation and the corresponding vector characteristics of the Feature Words;Extract corresponding candidate word in candidate corpus and form candidate vector, translation vector corresponding with each given class is matched respectively, according to resulting matching degree determine the candidate corpus belonging to target category.The application is analyzed by the keyword in the corpus of text to known languages, it is matched after translated with the candidate corpus of unknown languages, to predict classification belonging to unknown corpus, the classification that corpus can also be carried out in the case where the translator of no corresponding languages, improves the efficiency of information processing.

Description

A kind of method and device of corpus classification
Technical field
This application involves data processing fields, more particularly to the method and device that the corpus of unknown languages is classified.
Background technique
With the explosive growth of information content in internet, the transmitting of information is expanded in the medium of multiple countries It is propagated.The overwhelming majority in network data be all it is existing in the form of text, how to utilize natural language processing technique pair These text informations are classified, and allow users to more acurrate, are quickly found useful information, are become artificial intelligence field An important research problem.When classifying at present to corpus such as webpage, news, mainly carried out by machine learning model Classification, when classifying on line to webpage, is constantly trained according to the sample manually marked of all categories of building, obtains Classify again to candidate corpus after disaggregated model.
But under multilingual environment, it is necessary to establish the sample of different language, then need to distinguish for each languages artificial Mark, and the training rules under each languages are constructed respectively.If target language is more, construction cost can be very high, significantly Reduce the efficiency of information processing.
Summary of the invention
The purpose of the embodiment of the present application is to provide a kind of method and device of corpus classification, to realize under multilingual environment The demand classified to corpus.
In order to solve the above technical problems, the embodiment of the present application is achieved in that
According to the embodiment of the present application in a first aspect, providing a kind of method of corpus classification, which comprises
It is extracted respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;
By the Feature Words, according to target languages are translated respectively, are respectively corresponded to according to resulting translation and the Feature Words Vector characteristics form the corresponding translation vector of each given class;The translation vector is each in the target language for describing The corresponding characteristic attribute of the Feature Words under given class;
Extract corresponding candidate word in candidate corpus and form candidate vector, by the candidate vector respectively with each given class Corresponding translation vector is matched, and candidate vector matching degree corresponding with the translation vector of each given class is obtained;
According to the matching degree determine the candidate corpus belonging to target category.
In one embodiment of the application, the method also includes:
Extract the corresponding weight probability of each Feature Words in the translation vector;
It is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains the languages model;
The languages model is matched respectively as the corresponding translation vector of each given class with the candidate vector.
It is described when being extracted respectively from the corpus of text of each given class in one embodiment of the application,
The corpus of text is segmented, resulting keyword after statistics participle;
The corresponding near synonym of the keyword or conjunctive word are searched, the corresponding vector characteristics of the keyword are counted;
The corresponding weight of the keyword is set separately according to the vector characteristics, is sieved according to the weight Choosing, obtains the corresponding Feature Words of the corpus of text.
In one embodiment of the application, when the corresponding translation vector of each given class of the composition,
The corresponding vector characteristics of the Feature Words are extracted, the corresponding vector characteristics of the translation are obtained, it will be described After translation is associated combination together with the Feature Words, the translation vector is formed.
In one embodiment of the application, when the corresponding translation vector of each given class of the composition,
When the Feature Words under the target language corresponding translation be greater than it is a kind of when, by every kind of translation respectively with it is described Feature Words are associated combination, and the respectively weight in the corresponding vector characteristics of the Feature Words, by every kind of translation respectively with institute It states Feature Words and is associated combination, form the corresponding translation vector of multiple groups.
In one embodiment of the application, in the candidate corpus of the extraction when corresponding candidate word composition candidate vector,
The candidate corpus is analyzed, extracts candidate word therein respectively,
The corresponding characteristic attribute of the candidate word and the corresponding weight of the characteristic attribute are extracted respectively, are respectively obtained The corresponding corresponding vector characteristics of each candidate word;
The corresponding vector characteristics of the candidate word are fitted, the corresponding candidate vector of the candidate corpus is formed.
In one embodiment of the application, described by the candidate vector, translation vector corresponding with each given class is carried out respectively When matching,
Extract the corresponding vector characteristics of candidate word described in the candidate vector;
By the corresponding vector characteristics of the candidate word, vector characteristics corresponding with each translation vector are matched respectively;
The given class greater than given threshold is filtered out according to resulting matching degree;
Using the given class greater than given threshold as target category belonging to the candidate corpus.
According to the second aspect of the embodiment of the present application, a kind of device of corpus classification is provided, described device includes:
Extraction module obtains the corpus of text pair for extracting respectively from the corpus of text of each given class The Feature Words answered;
Translation module, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and described The corresponding vector characteristics of Feature Words form the corresponding translation vector of each given class;The translation vector is described for describing The corresponding characteristic attribute of the Feature Words in target language under each given class;
Matching module, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector point Translation vector not corresponding with each given class is matched, and the translation vector of the candidate vector Yu each given class is obtained Corresponding matching degree;
Division module, for according to the matching degree determine the candidate corpus belonging to target category.
In one embodiment of the application, described device further includes that model unit specifically includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics Predicate kind model;
Matching unit, for using the languages model as the corresponding translation vector of each given class and the candidate vector It is matched respectively.
In one embodiment of the application, the extraction module is specifically included,
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
Associative cell counts the keyword pair for searching the corresponding near synonym of the keyword or conjunctive word The vector characteristics answered;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to institute It states weight to be screened, obtains the corresponding Feature Words of the corpus of text.
In one embodiment of the application, in the translation module, specifically include:
Associative cell, for the corresponding vector characteristics of the Feature Words to be extracted, obtain the translation it is corresponding to Measure feature after the translation is associated combination together with the Feature Words, forms the translation vector.
In one embodiment of the application, in the translation module, translated when the Feature Words are corresponding under the target language Text be greater than it is a kind of when, every kind of translation is associated with the Feature Words respectively and is combined, and respectively the Feature Words it is corresponding to Every kind of translation is associated with the Feature Words respectively and combines by the weight in measure feature, forms the corresponding translation vector of multiple groups.
In one embodiment of the application, in the matching module, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is respectively right for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively The weight answered respectively obtains the corresponding corresponding vector characteristics of each candidate word;
It is corresponding to form the candidate corpus for being fitted the corresponding vector characteristics of the candidate word for fitting unit Candidate vector.
As can be seen from the technical scheme provided by the above embodiments of the present application, text language of the embodiment of the present application from each given class It is extracted respectively in material, obtains the corresponding Feature Words of the corpus of text;By the Feature Words, according to target languages are carried out respectively Translation, according to resulting translation and the corresponding vector characteristics of the Feature Words form the corresponding translation of each given class to Amount;It extracts corresponding candidate word in candidate corpus and forms candidate vector, the candidate vector is corresponding with each given class respectively Translation vector matched, obtain candidate vector matching degree corresponding with the translation vector of each given class;According to The matching degree determines target category belonging to the candidate corpus.This programme passes through the pass in the corpus of text to known languages Keyword is analyzed, and is matched after translated with the candidate corpus of unknown languages, to predict class belonging to unknown corpus Not, the classification that corpus can also be carried out in the case where the translator of no corresponding languages, improves the efficiency of information processing.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in specification, for those of ordinary skill in the art, before not making the creative labor property It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is the flow chart of the method for the corpus classification of one embodiment of the application;
Fig. 2 is the structural schematic diagram of the device of the corpus classification of one embodiment of the application.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with the application Attached drawing in embodiment, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described reality Applying example is only this specification a part of the embodiment, instead of all the embodiments.The embodiment of base in this manual, ability Domain those of ordinary skill every other embodiment obtained without making creative work, all should belong to this theory The range of bright book protection.
The embodiment of the present application provides a kind of method and device of corpus classification.
A kind of method of corpus classification provided by the embodiments of the present application is introduced first below.
The overwhelming majority in network data is all existing in the form of text at present, and text data tends to belong to different again When the prior art classifies to corpus such as webpage, news, the preset sample manually marked is constantly trained for languages, Classify again to candidate corpus after needing a large amount of artificial treatment, training to obtain disaggregated model.But it is right under multilingual environment Multilingual sample is marked then often unrealistic respectively, reduces development efficiency.The present invention passes through to known languages Keyword in corpus of text is analyzed, and is trained the translation vector of each given class, and analyze candidate's expectation, is mentioned It is matched respectively with translation vector of all categories again after taking out candidate vector, so that it is determined that going out the corresponding target class of candidate corpus Not, make the classification that can also carry out corpus in the case where the translator of no corresponding languages, improve the effect of information processing Rate.
Fig. 1 is the flow chart of the method for the corpus classification of one embodiment of the application, as shown in Figure 1, this method can be with The following steps are included:
In a step 101, it is extracted respectively from the corpus of text of each given class, it is corresponding to obtain the corpus of text Feature Words.
In the present embodiment, when being extracted respectively from the corpus of text of each given class,
Step 101a segments the corpus of text, resulting keyword after statistics participle;
It is divided after carrying out semantic analysis to the full text of corpus of text, each notional word in gained participle, special after statistics divides It is not the position of noun, the corresponding word frequency of verb and these notional words in corpus of text, word frequency is greater than set threshold Value, and/or positioned at the key positions such as title, first section, latter end notional word as the corresponding keyword of text corpus.
Step 101b searches the corresponding near synonym of the keyword or conjunctive word, it is corresponding to count the keyword Vector characteristics;
The corresponding synonym of these keywords or near synonym are searched respectively, and it is special to count the corresponding vector of these keywords It levies, in the present embodiment, extracts these corresponding word frequency of keyword (containing synonym and/or near synonym), length, parts of speech, position mark Remember, whether start, whether the characteristic attributes such as overstriking, using these characteristic attributes as the vector characteristics of keyword.
The corresponding weight of the keyword is set separately according to the vector characteristics, according to the power in step 101c It is screened again, obtains the corresponding Feature Words of the corpus of text
It since the keyword in corpus of text is excessive, therefore needs to be screened according to weight, will indicate from keyword should The Feature Words of given class character pair extract, even if the Feature Words extracted accurately indicate the corresponding class of the given class Other feature.
Specifically, the corresponding weight of vector attribute institute of the keyword is set separately, as and by each keyword pair The attribute answered is weighted summation, is screened according to value resulting after weighted sum, specifically, according to setting in this implementation Weight threshold each keyword is screened, using resulting value after weighted sum be greater than the corresponding keyword of weight threshold as Feature Words.
Step 102: by the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words Corresponding vector characteristics form the corresponding translation vector of each given class;The translation vector is for describing the target language The corresponding characteristic attribute of the Feature Words in kind under each given class
Set translation word stocks are called, according to target languages are translated respectively by each Feature Words, and it is corresponding to obtain each Feature Words Translation, extract the corresponding word frequency of translation, length, part of speech, position mark, whether start, whether the characteristic attributes such as overstriking, and root It is counted according to the characteristic attribute and corresponding weight, the corresponding vector characteristics of translation is obtained, together with not translated feature The corresponding vector characteristics of word are combined, and form translation vector.
It include two parts in translation vector in the present embodiment, a part is the feature in the text of given class Word illustrates that the character pair of given class, another part are characterized the corresponding translation of word, indicates given class character pair Languages attribute.Under normal circumstances, in the corresponding corpus of text of each given class, since translation obtains after Feature Words are translated It arrives, then the corresponding vector characteristics of each Feature Words vector characteristics corresponding with translation are equal, but particularly, if Feature Words are corresponding Translation have more than one, then Feature Words and every kind of translation are separately constituted into translation vector, at this time will be in the vector characteristics of translation Corresponding each weight is divided equally respectively, is combined to form the corresponding translation vector of multiple groups in the vector characteristics with Feature Words, Respective weights in translation vector are the average value of keyword and corresponding translation weight.
In other embodiments, the vector characteristics in translation vector are instructed together with the corresponding vector characteristics of Feature Words Practice, obtain languages model, the translation vector of each given class is substituted using model structure, specifically:
Step 102a extracts the corresponding weight probability of each Feature Words in the translation vector;
It is corresponded in translation vector in each given class, it is special to extract the corresponding vector of Feature Words, translation in each translation vector Sign, is normalized to weight probability for the weight parameter in vector characteristics respectively.
Step 102b is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains institute's predicate Kind model;
After the corresponding vector characteristics of Feature Words are equally normalized in each given class, SVM (supporting vector will be used Machine) method is trained the corresponding vector characteristics of Feature Words in each given class and the corresponding vector characteristics of translation vector.It will The difference of the corresponding vector characteristics of each Feature Words vector characteristics corresponding with the translation vector is made in corpus of text under the category For positive sample;By the corresponding vector characteristics of the non-Feature Words vector corresponding with the translation vector in the corpus of text under the category The difference of feature is as negative sample;It is iterated training according to sample characteristics in the corpus of text of each given type respectively, is obtained To languages model, to carry out category division to each translation under the languages, judge that translation belongs to the correspondence of a certain given class Probability.
Step 102c distinguishes the languages model as the corresponding translation vector of each given class and the candidate vector It is matched.
In subsequent step, candidate word is extracted from candidate corpus and forms candidate vector, by candidate vector in the languages mould It is matched in type, obtains the probability that candidate vector belongs to each given class, probability highest is corresponded into classification as the candidate Target category belonging to corpus.
Step 103: extracting corresponding candidate word in candidate corpus and form candidate vector, respectively and respectively by the candidate vector The corresponding translation vector of given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector Matching degree;
When converting candidate vector for candidate corpus, include the following steps:
Step 103a analyzes the candidate corpus, extracts candidate word therein respectively,
It is segmented in the candidate corpus of target language, is screened according to the result of participle, it will be corresponding in word segmentation result Notional word, especially noun, verb respectively correspond the position of word frequency and notional word in corpus of text, word frequency is greater than set Threshold value, and/or positioned at the key positions such as title, first section, latter end notional word as the corresponding candidate word of candidate's corpus.
Step 103b extracts the corresponding characteristic attribute of the candidate word and the corresponding power of the characteristic attribute respectively Weight, respectively obtains the corresponding corresponding vector characteristics of each candidate word;
In the present embodiment, these corresponding word frequency of keyword (containing synonym and/or near synonym), length, part of speech, position are extracted Tagging, whether start, whether the characteristic attributes such as overstriking, combine corresponding weight to be associated these characteristic attributes, respectively Obtain the corresponding vector characteristics of each candidate word.
The corresponding vector characteristics of the candidate word are fitted by step 103c, form the corresponding time of the candidate corpus Select vector.
In the present embodiment, the corresponding vector characteristics of candidate word are normalized, form candidate corpus it is corresponding it is candidate to Amount.
When by the candidate vector, translation vector corresponding with each given class is matched respectively, extract it is described it is candidate to The corresponding vector characteristics of candidate word described in amount;By vector characteristics translation vector progress corresponding with each given class respectively Match;If candidate vector matching degree corresponding with some translation vector is larger, then candidate vector has more approximate with the translation vector Feature Words, and the corresponding characteristic attribute of Feature Words is also more similar, i.e., the corpus of text category of candidate corpus and the given class It is larger in the probability of same classification.Conversely, as candidate vector matching degree corresponding with some translation vector is smaller, then candidate vector Feature Words corresponding with the translation vector differ greatly, i.e., the corpus of text of candidate corpus and the given class belongs to same classification Probability it is smaller.
In other embodiments, candidate is expected corresponding candidate vector described by the languages model for calling training to finish It is matched in languages model, candidate vector relevance score corresponding with each translation vector respectively is judged, so as in subsequent step In chosen.
Step 104:, according to the matching degree determine the candidate corpus belonging to target category.
Choose matching after gained matching degree be greater than given threshold given class be used as target category, then it is described candidate corpus Belong to the target category.
In other embodiments, candidate vector matching degree corresponding with the translation vector of each given class is greater than both The target category for determining threshold value has more than one, then the candidate corpus belongs to more than one given class.
This programme is analyzed by the keyword in the corpus of text to known languages, translated rear and unknown languages Candidate corpus is matched, to predict classification belonging to unknown corpus, the translator of no corresponding languages the case where Under can also carry out the classification of corpus, improve the efficiency of information processing.
In another alternative embodiment, when corpus is classified
Step 201: being extracted respectively from the corpus of text of each given class, obtain the corresponding spy of the corpus of text Levy word;
In the corresponding collection of document of corpus of text of given class, each notional word and its synonym are obtained, with prefix phase Word, public substring related term are closed, corresponding set is denoted as S by semantic related term;
Step 202: by the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words Corresponding vector characteristics form the corresponding translation vector of each given class;The translation vector is for describing the target language The corresponding characteristic attribute of the Feature Words in kind under each given class;
In the present embodiment, set S is translated with object language, the corresponding set of translation is denoted as D.Extract set S In the keyword that matches with the corpus of text of given class as Feature Words, will be special using topic model or word embedded mode Sign word is trained, and generates the corresponding characteristic attribute of Feature Words, similarly, generates the characteristic attribute that Feature Words correspond to translation.
When calculating the corresponding weight factor of characteristic attribute of Feature Words, comprehensive TF/IDF, page aggregation angle value and semanteme are poly- Right value is calculated:
1) TF/IDF: the corresponding TF/IDF value of the specific word is calculated, g1 is denoted as;
The word frequency and inverse document frequency product of TF/IDF expression Feature Words.Wherein, TF is the corresponding word frequency of Feature Words, IDF is the number that Feature Words occur in collection of document, then inverted.
2) page aggregation degree value: under conditions of a sentence level sliding window is M (positive integer), there are other spies The number for levying word, is denoted as g2;
3): within the scope of the field N of a vector space, there is the number of other Feature Words in semantics fusion degree value, note For g3;
The then corresponding weight of the specific word: G=a1*g1+a2*g2+a3*g3, wherein a1, a2, a3 are given coefficients.
Similarly, when calculating a certain Feature Words and corresponding to the corresponding weight factor of characteristic attribute of translation, comprehensive TF/IDF, the page Modal and semantics fusion angle value are calculated:
1) TF/IDF: the TF/IDF value that the specific word corresponds to translation is calculated, h1 is denoted as;
The word frequency and inverse document frequency product of TF/IDF expression Feature Words.
2) under conditions of a sentence level sliding window is M, it is corresponding page aggregation degree value: to there are other Feature Words The number of translation, is denoted as h2;
3) semantics fusion degree value: within the scope of the field N of a vector space, there are other Feature Words and correspond to translation Number is denoted as h3.
Feature Words correspond to the weight of translation are as follows: H=b1*h1+b2*h2+b3*h3, wherein b1, b2, b3, which are given, is Number.
So in conclusion calculating translation vector corresponding to the corpus of text shelves of given class are as follows:
Wherein, VdocIt is the corresponding translation vector of corpus of text doc of given class, n is the feature in corpus of text doc Word quantity, i=1,2,3 ..., n, VwiIt is characterized word wiCharacteristic attribute corresponding with Feature Words translation, GwiIt is Feature Words wiIt is corresponding Weight, HwiIt is characterized word wiThe weight of corresponding translation.
Further, from the corpus of text of given class, a certain number of notional words (verb, name are inferior) is chosen as label Word calculates the feature vector of positive sample, chooses other a certain number of words (function word, interjection etc.) and is used as non-label word, calculates negative The feature vector of sample carries out just after subtracting each other the feature vector of positive sample, the feature vector of negative sample with translation vector respectively Then change, thus training regression model.
Step 203: extracting corresponding candidate word in candidate corpus and form candidate vector, respectively and respectively by the candidate vector The corresponding translation vector of given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector Matching degree;
If step 202 is similar, in the present embodiment, candidate corpus is segmented, is extracted in participle acquired results and set D The keyword to match obtains candidate word corresponding original text before translation as candidate word, and then subsequent composition is candidate Vector.
Ibid, each candidate word, the corresponding weight of original text in candidate corpus are calculated, and then forms the corresponding candidate of candidate corpus Vector.
Candidate vector and regression model obtained in the step 202 are analyzed, obtain candidate vector with it is each set The matching degree of the translation vector of classification.
Step 204: according to the matching degree determine the candidate corpus belonging to target category.
The present invention is analyzed by the keyword in the corpus of text to known languages, trains translating for each given class Literary vector, and candidate's expectation is analyzed, it is matched respectively with translation vector of all categories again after extracting candidate vector, So that it is determined that going out the corresponding target category of candidate corpus, make to carry out in the case where the translator of no corresponding languages The classification of corpus improves the efficiency of information processing.
Fig. 2 is the structural schematic diagram of the device of the corpus classification of one embodiment of the application.Referring to FIG. 2, in one kind In Software Implementation, corpus sorter 800 in picture may include: extraction module 801, translation module 802, matching Module 803 and division module 804, wherein
Extraction module 801 obtains the corpus of text for extracting respectively from the corpus of text of each given class Corresponding Feature Words;
Translation module 802, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and institute It states the corresponding vector characteristics of Feature Words and forms the corresponding translation vector of each given class;The translation vector is for describing institute State the corresponding characteristic attribute of the Feature Words in target language under each given class;
Matching module 803, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector Translation vector corresponding with each given class is matched respectively, obtain the translation of the candidate vector and each given class to Measure corresponding matching degree;
Division module 804, for according to the matching degree determine the candidate corpus belonging to target category.
In the extraction module 801, comprising:
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
Associative cell counts the keyword pair for searching the corresponding near synonym of the keyword or conjunctive word The vector characteristics answered;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to institute It states weight to be screened, obtains the corresponding Feature Words of the corpus of text.
The corpus sorter 800 further includes that model unit specifically includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics Predicate kind model;
Matching unit, for using the languages model as the corresponding translation vector of each given class and the candidate vector It is matched respectively.
In the translation module 802, specifically include:
Associative cell, for the corresponding vector characteristics of the Feature Words to be extracted, obtain the translation it is corresponding to Measure feature after the translation is associated combination together with the Feature Words, forms the translation vector.
It, will when corresponding translation is greater than a kind of to the Feature Words under the target language in the translation module 802 Every kind of translation is associated with the Feature Words combines respectively, and the respectively weight in the corresponding vector characteristics of the Feature Words, Every kind of translation is associated with the Feature Words respectively and is combined, the corresponding translation vector of multiple groups is formed.
In the matching module 803, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is respectively right for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively The weight answered respectively obtains the corresponding corresponding vector characteristics of each candidate word;
It is corresponding to form the candidate corpus for being fitted the corresponding vector characteristics of the candidate word for fitting unit Candidate vector.
In short, being not intended to limit the protection of this specification the foregoing is merely the preferred embodiment of this specification Range.For all spirit in this specification within principle, any modification, equivalent replacement, improvement and so on should be included in this Within the protection scope of specification.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.

Claims (13)

1. a kind of method of corpus classification, which is characterized in that the described method includes:
It is extracted respectively from the corpus of text of each given class, obtains the corresponding Feature Words of the corpus of text;
By the Feature Words, according to target languages are translated respectively, according to resulting translation and the Feature Words it is corresponding to Measure feature forms the corresponding translation vector of each given class;The translation vector is each set in the target language for describing The corresponding characteristic attribute of the Feature Words under classification;
It extracts corresponding candidate word in candidate corpus and forms candidate vector, the candidate vector is corresponding with each given class respectively Translation vector matched, obtain candidate vector matching degree corresponding with the translation vector of each given class;
According to the matching degree determine the candidate corpus belonging to target category.
2. the method according to claim 1, wherein include: from the corpus of text of each given class respectively into When row extracts,
The corpus of text is segmented, resulting keyword after statistics participle;
The corresponding near synonym of the keyword or conjunctive word are searched, the corresponding vector characteristics of the keyword are counted;
The corresponding weight of the keyword is set separately according to the vector characteristics, is screened, is obtained according to the weight To the corresponding Feature Words of the corpus of text.
3. method according to claim 1 or 2, which is characterized in that the method also includes:
Extract the corresponding weight probability of each Feature Words in the translation vector;
It is iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics, obtains the languages model;
The languages model is matched respectively as the corresponding translation vector of each given class with the candidate vector.
4. the method according to claim 1, wherein when the corresponding translation vector of each given class of the composition,
The corresponding vector characteristics of the Feature Words are extracted, the corresponding vector characteristics of the translation are obtained, by the translation After being associated combination together with the Feature Words, the translation vector is formed.
5. according to the method described in claim 4, it is characterized in that, when the corresponding translation vector of each given class of the composition,
When the Feature Words under the target language corresponding translation be greater than it is a kind of when, by every kind of translation respectively with the feature Word is associated combination, and the respectively weight in the corresponding vector characteristics of the Feature Words, by every kind of translation respectively with the spy Sign word is associated combination, forms the corresponding translation vector of multiple groups.
6. the method according to claim 1, wherein corresponding candidate word composition is waited in extraction candidate's corpus When selecting vector,
The candidate corpus is analyzed, extracts candidate word therein respectively,
The corresponding characteristic attribute of the candidate word and the corresponding weight of the characteristic attribute are extracted respectively, respectively obtain each time Select the corresponding corresponding vector characteristics of word;
The corresponding vector characteristics of the candidate word are fitted, the corresponding candidate vector of the candidate corpus is formed.
7. the method according to claim 1, wherein it is described by the candidate vector respectively with each given class pair When the translation vector answered is matched,
Extract the corresponding vector characteristics of candidate word described in the candidate vector;
By the corresponding vector characteristics of the candidate word, vector characteristics corresponding with each translation vector are matched respectively;
The given class greater than given threshold is filtered out according to resulting matching degree;
Using the given class greater than given threshold as target category belonging to the candidate corpus.
8. a kind of device of corpus classification, which is characterized in that described device includes:
It is corresponding to obtain the corpus of text for extracting respectively from the corpus of text of each given class for extraction module Feature Words;
Translation module, for according to target languages to be translated respectively by the Feature Words, according to resulting translation and the feature The corresponding vector characteristics of word form the corresponding translation vector of each given class;The translation vector is for describing the target The corresponding characteristic attribute of the Feature Words in languages under each given class;
Matching module, for extracting corresponding candidate word composition candidate vector in candidate corpus, by the candidate vector respectively with The corresponding translation vector of each given class is matched, and it is corresponding with the translation vector of each given class to obtain the candidate vector Matching degree;
Division module, for according to the matching degree determine the candidate corpus belonging to target category.
9. device according to claim 8, which is characterized in that the extraction module specifically includes,
Participle unit, for being segmented to the corpus of text, resulting keyword after statistics participle;
It is corresponding to count the keyword for searching the corresponding near synonym of the keyword or conjunctive word for associative cell Vector characteristics;
Screening unit, for the corresponding weight of the keyword to be set separately according to the vector characteristics, according to the power It is screened again, obtains the corresponding Feature Words of the corpus of text.
10. device according to claim 8 or claim 9, which is characterized in that described device further includes model unit, specific to wrap It includes:
Extraction unit, for extracting the corresponding weight probability of each Feature Words in the translation vector;
Training unit obtains institute's predicate for being iterated training using the corresponding vector characteristics of the Feature Words as sample characteristics Kind model;
Matching unit, for distinguishing the languages model as the corresponding translation vector of each given class and the candidate vector It is matched.
11. device according to claim 8, which is characterized in that in the translation module, specifically include:
It is special to obtain the corresponding vector of the translation for extracting the corresponding vector characteristics of the Feature Words for associative cell Sign, after the translation is associated combination together with the Feature Words, forms the translation vector.
12. device according to claim 11, it is characterised in that: in the translation module, when the Feature Words are described When corresponding translation is greater than a kind of under target language, every kind of translation is associated with the Feature Words respectively and is combined, and divided equally Weight in the corresponding vector characteristics of the Feature Words, every kind of translation is associated with the Feature Words respectively and is combined, and is formed The corresponding translation vector of multiple groups.
13. device according to claim 8, which is characterized in that in the matching module, specifically include:
Analytical unit extracts candidate word therein for analyzing the candidate corpus respectively,
Weight-assigning unit, it is corresponding for extracting the corresponding characteristic attribute of the candidate word and the characteristic attribute respectively Weight respectively obtains the corresponding corresponding vector characteristics of each candidate word;
Fitting unit forms the corresponding time of the candidate corpus for being fitted the corresponding vector characteristics of the candidate word Select vector.
CN201910468030.8A 2019-05-30 2019-05-30 Corpus classification method and apparatus Active CN110196910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910468030.8A CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910468030.8A CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Publications (2)

Publication Number Publication Date
CN110196910A true CN110196910A (en) 2019-09-03
CN110196910B CN110196910B (en) 2022-02-15

Family

ID=67753486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910468030.8A Active CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Country Status (1)

Country Link
CN (1) CN110196910B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522927A (en) * 2020-04-15 2020-08-11 北京百度网讯科技有限公司 Entity query method and device based on knowledge graph
CN112307210A (en) * 2020-11-06 2021-02-02 中冶赛迪工程技术股份有限公司 Document tag prediction method, system, medium and electronic device
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112836045A (en) * 2020-12-25 2021-05-25 中科恒运股份有限公司 Data processing method and device based on text data set and terminal equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667177A (en) * 2009-09-23 2010-03-10 清华大学 Method and device for aligning bilingual text
WO2011100862A1 (en) * 2010-02-22 2011-08-25 Yahoo! Inc. Bootstrapping text classifiers by language adaptation
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
US20180165278A1 (en) * 2016-12-12 2018-06-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for translating based on artificial intelligence
CN108460396A (en) * 2017-09-20 2018-08-28 腾讯科技(深圳)有限公司 The negative method of sampling and device
CN108510977A (en) * 2018-03-21 2018-09-07 清华大学 Language Identification and computer equipment
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667177A (en) * 2009-09-23 2010-03-10 清华大学 Method and device for aligning bilingual text
WO2011100862A1 (en) * 2010-02-22 2011-08-25 Yahoo! Inc. Bootstrapping text classifiers by language adaptation
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
US20180165278A1 (en) * 2016-12-12 2018-06-14 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for translating based on artificial intelligence
CN108460396A (en) * 2017-09-20 2018-08-28 腾讯科技(深圳)有限公司 The negative method of sampling and device
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information
CN108510977A (en) * 2018-03-21 2018-09-07 清华大学 Language Identification and computer equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张凤鸣等: "《武器装备数据挖掘技术》", 30 June 2017, 国防工业出版社 *
朱珠: "基于双语的事件抽取方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
魏向清 等: "《中国外语类辞书编纂出版30年回顾与反思》", 28 February 2011, 上海辞书出版社 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522927A (en) * 2020-04-15 2020-08-11 北京百度网讯科技有限公司 Entity query method and device based on knowledge graph
CN112307210A (en) * 2020-11-06 2021-02-02 中冶赛迪工程技术股份有限公司 Document tag prediction method, system, medium and electronic device
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium
CN112836045A (en) * 2020-12-25 2021-05-25 中科恒运股份有限公司 Data processing method and device based on text data set and terminal equipment

Also Published As

Publication number Publication date
CN110196910B (en) 2022-02-15

Similar Documents

Publication Publication Date Title
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN110196910A (en) A kind of method and device of corpus classification
Song et al. A comparative study on text representation schemes in text categorization
CN104881458B (en) A kind of mask method and device of Web page subject
CN108197109A (en) A kind of multilingual analysis method and device based on natural language processing
Zagibalov et al. Unsupervised classification of sentiment and objectivity in Chinese text
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
Banik et al. Evaluation of naïve bayes and support vector machines on bangla textual movie reviews
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
Wu et al. News filtering and summarization on the web
Abdelali et al. Arabic dialect identification in the wild
CN105843796A (en) Microblog emotional tendency analysis method and device
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN112395395A (en) Text keyword extraction method, device, equipment and storage medium
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
Patel et al. Dynamic lexicon generation for natural scene images
CN109284389A (en) A kind of information processing method of text data, device
CN110688540B (en) Cheating account screening method, device, equipment and medium
KR101543680B1 (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
Wang et al. Multi‐label emotion recognition of weblog sentence based on Bayesian networks
CN112527963B (en) Dictionary-based multi-label emotion classification method and device, equipment and storage medium
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN110688461B (en) Online text education resource label generation method integrating multi-source knowledge
CN108427769B (en) Character interest tag extraction method based on social network
CN107577667B (en) Entity word processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220728

Address after: No.16 and 17, unit 1, North District, Kailin center, No.51 Jinshui East Road, Zhengzhou area (Zhengdong), Henan pilot Free Trade Zone, Zhengzhou City, Henan Province, 450000

Patentee after: Zhengzhou Apas Technology Co.,Ltd.

Address before: E301-27, building 1, No.1, hagongda Road, Tangjiawan Town, Zhuhai City, Guangdong Province

Patentee before: ZHUHAI TIANYAN TECHNOLOGY Co.,Ltd.