CN108319682B - Method, device, equipment and medium for correcting classifier and constructing classification corpus - Google Patents

Method, device, equipment and medium for correcting classifier and constructing classification corpus Download PDF

Info

Publication number
CN108319682B
CN108319682B CN201810097359.3A CN201810097359A CN108319682B CN 108319682 B CN108319682 B CN 108319682B CN 201810097359 A CN201810097359 A CN 201810097359A CN 108319682 B CN108319682 B CN 108319682B
Authority
CN
China
Prior art keywords
text
classifier
category
corrected
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810097359.3A
Other languages
Chinese (zh)
Other versions
CN108319682A (en
Inventor
张忠辉
鲁彬
李堪兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianwen Digital Media Technology Beijing Co ltd
Original Assignee
Tianwen Digital Media Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianwen Digital Media Technology Beijing Co ltd filed Critical Tianwen Digital Media Technology Beijing Co ltd
Priority to CN201810097359.3A priority Critical patent/CN108319682B/en
Publication of CN108319682A publication Critical patent/CN108319682A/en
Application granted granted Critical
Publication of CN108319682B publication Critical patent/CN108319682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for correcting a classifier and constructing a classified corpus. The correction method of the classifier comprises the following steps: acquiring category central vectors respectively corresponding to at least two text categories of the classifier; acquiring a corrected text of a set text type and a text feature vector of the corrected text; according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier; and returning to execute the operation of obtaining a corrected text of a set text type and the text feature vector of the corrected text until the condition of finishing correction is met, and obtaining the corrected classifier. The method enables the influence of the wrongly classified texts on the category center vector to be larger, and reduces the error rate of text classification.

Description

Method, device, equipment and medium for correcting classifier and constructing classification corpus
Technical Field
The embodiment of the invention relates to the field of text classification, in particular to a method, a device, equipment and a medium for correcting a classifier and constructing a classification corpus.
Background
With the development of electronic technology and the popularization of the internet, the reading mode of people is changed quietly, and the traditional reading mode mainly based on reading paper media gradually turns to digital reading. Therefore, electronic news is increasingly taking an increasingly important position in the news field.
Automatic text classification of electronic news, i.e., dividing electronic news into categories such as political, economic, military, entertainment, and sports according to news topics, can help us to filter interesting news. Meanwhile, the automatic text classification of the electronic news has important practical significance for a news topic selection system and public opinion monitoring.
At present, a classification algorithm has multiple choices, and for text classification, influences of each text on a center vector of each text category in a commonly selected center vector method are the same, that is, influences of a correct sample and an incorrect sample on the center vector are the same, so that a problem that the correct sample is not fully utilized and the incorrect sample is over utilized may occur, and further, an error rate of text classification is increased.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for correcting a classifier and constructing a classified corpus, so as to optimize a classifier algorithm and a construction method of a text classified corpus in the prior art and reduce the error rate of text classification.
In a first aspect, an embodiment of the present invention provides a method for correcting a text classifier, including:
acquiring category central vectors respectively corresponding to at least two text categories of a classifier, wherein the category central vectors are obtained by calculation according to at least two category texts corresponding to the text categories;
acquiring a corrected text of a set text type and a text feature vector of the corrected text;
according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier;
and returning to execute the operation of obtaining a corrected text of a set text type and the text feature vector of the corrected text until the condition of finishing correction is met, and obtaining the corrected classifier.
In a second aspect, an embodiment of the present invention further provides a method for constructing a classified corpus, including:
presorting at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field, and constructing an initial classification corpus;
training to obtain an initial classifier serving as a classifier to be corrected according to seed vocabularies corresponding to at least two text categories of the set field;
using the currently stored text in the initial classification corpus as a corrected text, and correcting the current classifier to be corrected by adopting the correction method of the text classifier in any embodiment of the invention to obtain a text classifier;
classifying the texts in the initial news classification corpus by using the text classifier, and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results from the initial classification corpus;
and after the text classifier is used as a new classifier to be corrected, returning to execute the operation of using the text in the initial classification corpus as a corrected text, correcting the current corrected classifier by adopting the correction method of the text classifier in any embodiment of the invention to obtain the operation of the text classifier until the preset text denoising condition is met, and using the current initial classification corpus as the classification corpus of the set field.
In a third aspect, an embodiment of the present invention further provides a correction apparatus for a text classifier, including:
the classification center vector acquisition module is used for acquiring classification center vectors respectively corresponding to at least two text classifications of the classifier, and the classification center vectors are obtained by calculation according to at least two classification texts corresponding to the text classifications;
the correction text acquisition module is used for acquiring a correction text of a set text type and a text feature vector of the correction text;
a classifier correction module, configured to correct a class center vector corresponding to each of the text classes in the classifier according to a similarity between the text feature vector and a class center vector of each of the text classes of the classifier and a text class of the corrected text;
and the circular operation module is used for returning and executing the operation of obtaining the corrected text of a set text type and the text characteristic vector of the corrected text until the condition of finishing correction is met, so as to obtain the corrected classifier.
In a fourth aspect, an embodiment of the present invention further provides a device for constructing a classified corpus, including:
the initial classification corpus establishing module is used for pre-classifying at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field and establishing an initial classification corpus;
the to-be-corrected classifier training module is used for training to obtain an initial classifier serving as the to-be-corrected classifier according to seed vocabularies corresponding to at least two text categories in the set field;
a text classifier generating module, configured to use a currently stored text in the initial classification corpus as a corrected text, and correct the current classifier to be corrected by using the correction method of the text classifier according to any embodiment of the present invention, so as to obtain a text classifier;
the initial classification corpus updating module is used for classifying the texts in the initial classification corpus by using the text classifier and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results from the initial classification corpus;
and the circular operation module is used for returning and executing the operation of using the text in the initial classification corpus as the corrected text after the text classifier is used as a new classifier to be corrected.
In a fifth aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for modifying a text classifier according to any embodiment of the present invention.
In a sixth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for modifying a text classifier according to any embodiment of the present invention.
In a seventh aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the program, implements the method for constructing a corpus according to any embodiment of the present invention.
In an eighth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for constructing a classified corpus according to any embodiment of the present invention.
The embodiment of the invention provides a method, a device, equipment and a medium for correcting a classifier and constructing a classification corpus.
Drawings
Fig. 1 is a flowchart of a method for modifying a text classifier according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a method for constructing a classified corpus according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a modification apparatus of a text classifier in a third embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus for constructing a classified corpus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
The embodiment of the present invention provides a method for correcting a text classifier, which is applicable to correcting a class center vector of a text classifier so as to improve the classification accuracy of the text classifier.
As shown in fig. 1, the method of this embodiment specifically includes:
s110, category center vectors respectively corresponding to at least two text categories of the classifier are obtained, and the category center vectors are obtained through calculation according to at least two category texts corresponding to the text categories.
For the text classifier, the text classifier comprises a plurality of text categories, each text category corresponds to a category center vector, the distance between each text and each category center vector is determined when the text classifier is used for classifying the text, and the text category corresponding to the category center vector with the minimum distance to the text is taken as the text category of the text. Wherein the more accurate the class center vector, the higher the accuracy of the corresponding text classifier. The present embodiment is specifically to utilize a text training set to correct a category center vector of a text classifier, so as to further improve the classification accuracy of the classifier. Therefore, first, an original category center vector of each text category corresponding to the classifier is obtained (i.e., an original category center vector of each text category in the text training set is obtained).
As an optional implementation manner of this embodiment, the category center vector may be calculated according to at least two category texts corresponding to the text categories, specifically: the category center vector is obtained by summing and normalizing the text feature vectors of at least two categories of text corresponding to the text categories.
For each text category, acquiring the feature vectors of each pre-classified category text corresponding to the text category, summing the feature vectors of the category texts to obtain a sum feature vector of the text category, and normalizing the sum feature vector to obtain a category center vector of the text category, namely the initial category center vector to be corrected.
S120, acquiring a corrected text of a set text type and a text feature vector of the corrected text.
Each text corresponding to the classifier to be corrected (i.e., each text in the training text set of the classifier) will affect the class center vector of each text class, so the degree of the effect of each text on the class center vector of each text class needs to be determined, and then the correction operation on the class center vector is performed.
And each text corresponding to the classifier to be corrected is each corrected text, and after each corrected text and the text feature vector thereof of each text category are sequentially acquired, the center vector of each category is corrected.
As a specific implementation manner of this embodiment, the obtaining of the text feature vector of the corrected text may specifically be: after feature extraction, feature selection and feature weighting are carried out on the corrected texts, text feature vectors of the corrected texts are generated, namely the text feature vectors of all the corrected texts can be obtained through the three steps of feature extraction, feature selection and feature weighting, and the following explanation is carried out by using news texts.
(1) Feature extraction
The classifier will include several features: word features, url (uniform resource locator) features, news column features.
The word features are generated by a word segmentation algorithm, and can be generally generated by a word segmentation algorithm in a Natural Language Processing (NLP) tool.
url features are typically entered externally by engineers, and these features are also used in the construction of the initial training set of text classifiers. Where url features are typically in the form of url prefixes, i.e., news with a certain url prefix is treated as a certain category of news. For example, news with http:// sports.qq.com/such prefix is considered news of the sports category, news with http:// art.peple.com.cn/such prefix is considered news of the art category, news with http:// invent.peple.com.cn/such prefix is considered news of the time-finance category, and news with http:// japan.peple.com.cn/such prefix is considered news of the japanese category.
News listings are characterized by the following form: first page > XX > text. For example, news corresponding to the news section feature "home page > entertainment scrolling news > text" is news of the entertainment category, and news corresponding to the news section feature "home page > sports > other > text" is news of the sports category.
(2) Feature generation
After determining which features to use, it is next necessary to determine how to use the features, i.e., feature selection. On one hand, the feature selection is to reduce the dimensionality of the text feature vector and reduce the space-time overhead of calculation, and on the other hand, some features with weak distinguishing capability are removed under certain conditions, so that the generalization capability can be improved.
Feature selection herein comprises two steps: the first step is stop word exclusion, i.e., excluding co-words such as "and the like; the second step is to rank the features from high to low based on their statistical values and then remove the next ranked features. The statistical value can adopt an information gain value, mutual information, a difference square check value and the like, and the information gain value is a method with the best overall performance through experimental comparison. The basic principle and method of information gain value calculation are as follows:
first, the concept of solution entropy, which is a measure of uncertainty, needs to be introduced. For the text classification problem, the entropy measures the class uncertainty of the text, which can be measured in terms of the probability values that the text is classified into various classes.
For the entire text training set, assume that the distribution of n text classes (i.e., the weight of each text class) is: p is a radical of1,p2,…,pnThen the entropy of the text training set is
Figure BDA0001565401810000091
Assume that all text includes two types of text: sports and non-sports, wherein 60 texts in sports are not 60 texts in sports, the entropy of the whole text set is: -0.5 log2(0.5)-0.5*log2(0.5)=1。
And calculating the information gain of a certain feature, namely actually calculating the information gain of the certain feature in the text classification. Firstly, a text training set is divided into two parts according to whether the word appears or not, then entropy values are respectively calculated on each part, the entropy values of the two parts are weighted and summed according to respective text numbers, the sum value is smaller than the entropy value calculated by the original training set without division, and the difference value is information gain.
For the above example, if the word feature "yaoming" appears in 40 of 60 sports classes and does not appear in non-sports classes, the training set can be divided into two parts according to whether "yaoming" appears or not:
the first part, 40 texts with "yaoming" appeared, all were sports, and its entropy was:
-1.0*log2(1.0)-0*log2(0)=0
note that: 0 log according to definition of entropy2(0)=0。
In the second part, 80 texts without "yaoming" are presented, 20 are sports, 60 are not sports, and the entropy values are:
-0.25*log2(0.25)-0.75*log2(0.75)==0.8112781244591328
the weighted sum of the entropy values of the two parts is:
1/3*0+2/3*0.8112781244591328=0.5408520829727552
therefore, the information gain of the term "yaoming" is calculated as follows:
1-0.5408520829727552=0.4591479170272448
it can be seen that when some words occur mainly in some text categories but not much in other text categories, the information gain value of the word is larger. The words with larger information gain values are generally more important to the meaning of the classification, so that the feature with the information gain value in front should be selected in the feature selection.
(3) Feature weighting
The greater the weight of a feature, the greater its impact on the classifier. A common method for weighting the vector is tf-idf values, where tf-idf values are of various types, taking the form in this embodiment:
tfidfij=log(tfij+1.0)*log(N/(dfi+1.0))
wherein, tfijFor a certain feature termiIn the text djN is the total number of texts in the text training set, dfiIs a feature termiThe frequency of the text that appears, i.e. how many texts the feature appears in, therefore tfidfijI.e. representational features termiIn the text djAnd the weight of the ith dimension in the corresponding text feature vector.
S130, according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier.
The method comprises the steps of correcting each category center vector according to the similarity between a text feature vector and each category center vector and the text category of the corrected text, namely the influence (weight) of each text on the category center vector is possibly different, the weight of some texts is larger, the weight of some texts is smaller, and particularly, the influence of the sample with the wrong classification in a text training set on the weight of the center vector is larger, so that the text classification error rate is reduced.
As an optional implementation manner of this embodiment, modifying, according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the modified text, the category center vector respectively corresponding to each text category in the classifier may specifically be:
respectively calculating cosine similarity between the text feature vector and the category center vector of each current text category; acquiring a first cosine similarity of the text feature vector and a category center vector of a text category to which the modified text belongs, and a second cosine similarity of the text feature vector and a category center vector of a second text category, wherein the second cosine similarity is the maximum value of the cosine similarities of the text feature vector and the category center vectors of the remaining text categories; and if the ratio of the first cosine similarity to the second cosine similarity meets a set condition, correcting the category center vector of the text category to which the corrected text belongs and the category center vector of the second text category by using the ratio.
Specifically, the ratio of the first cosine similarity to the second cosine similarity satisfies a set condition, which is: the ratio is smaller than the first external parameter.
Specifically, the modifying the category center vector of the text category to which the modified text belongs and the category center vector of the second text category by using the ratio is that:
O1=O1+η*(β-S1/S2)*D,O2=O2-η*(β-S1/S2)*D;
wherein O1 is a category center vector of a text category to which the corrected text belongs, O2 is a category center vector of the second text category, β is a first external parameter, η is a second external parameter, S1 is the first cosine similarity, S2 is the second cosine similarity, and D is a text feature vector of the corrected text.
Specifically, the correction procedure of the center vector of each category is as follows:
for each text category, calculating the cosine similarity between the text feature vector D of the corrected text and the central vector of each category, wherein the cosine similarity is also called cosine similarity, and the similarity is evaluated by calculating the cosine value of the included angle between the two vectors.
If the text category to which the modified text belongs in the text training set is the first text category, the cosine similarity between the text feature vector D and the category center vector O1 of the first text category is referred to as a first cosine similarity S1. The maximum cosine similarity value in the cosine similarities of the text feature vector D and the category center vectors of other text categories is referred to as a second cosine similarity S2, and the corresponding category center vector is O2. If the text training set only comprises two text categories, the cosine similarity of the text feature vector D and the category center vector of the other text category is called as second cosine similarity.
If the ratio of the first cosine similarity S1 to the second cosine similarity S2 meets the set condition, specifically, the ratio S1/S2 is smaller than the first external parameter beta, the category center vectors of the two text categories are corrected and updated by using the following correction formula,
O1=O1+η*(β-S1/S2)*D,O2=O2-η*(β-S1/S2)*D。
among them, the first external parameter β may be specifically 1.2, and 1.2 is a preferable experimental value. Therefore, the influence of the wrongly classified texts on the weight of the category center vector is larger, and the reduction of the classification error rate is facilitated. In addition, even if the classification of some texts is correct (i.e. S1/S2>1.0), if the similarity between the text and the category center vector of the text category to which the text belongs is not large enough (S1/S2<1.2), the weight of the category center vector can still be influenced, i.e. the characteristic that the first external parameter β is preferably 1.2 can help to fully utilize the information of the correct text, and at the same time, the influence of the wrong text on the weight can be avoided, so that overfitting can be avoided.
The second external parameter η is also empirically set in advance, and can be set to be the skew rate 0.01.
On the basis of the technical scheme, the method further comprises the following steps: if the ratio is determined to be larger than or equal to a first external parameter, the class center vector of the text class to which the modified text belongs and the class center vector of the second text class are not modified; and returning to execute the operation of acquiring the corrected text of a set text type and the text feature vector of the corrected text.
That is, the ratio of the first cosine similarity S1 to the second cosine similarity S2 of a corrected text is greater than or equal to the first external parameter β, for example, S1/S2 is greater than or equal to 1.2, which indicates that the corrected text is classified correctly, and the similarity between the corrected text and the category center vector of the text category to which the corrected text belongs is large enough, so that the corrected text is not required to correct the corresponding category center vector.
And S140, returning to execute the operation of obtaining the corrected text of a set text type and the text feature vector of the corrected text until the correction finishing condition is met, and obtaining the corrected classifier.
And repeatedly executing the steps S120 and S130, sequentially correcting the category center vector of each text category by using each text in the text training library until the category center vector converges, and performing classification prediction on the text to be classified by using the corrected category center vector of each text category as the corrected category center vector of each text category of the classifier.
When the corrected classifier is used for classifying the texts to be classified, firstly, feature extraction, feature selection and feature weighting are carried out on the texts to be classified, text feature vectors of the texts to be classified are obtained, cosine similarity of the text feature vectors and the corrected category center vectors of all the text categories is respectively calculated, and the text category where the category center vector with the largest cosine similarity value is located is selected as a prediction category of the texts to be classified.
In the method for correcting the text classifier provided by this embodiment, each text in the text training library is used to correct the category center vector of each text category, where the influence of each text in the text training library on the category center vector of each text category is different, the influence of the wrongly classified text on the weight of the category center vector is larger, and the error rate of text classification is reduced. Moreover, for some samples, even if the classification in the text training set is correct, if the similarity of the text and the class center vector of the text class to which the text belongs is not large enough, the weight of the class center vector can still be influenced by the text, which is beneficial to fully utilizing the information of the correct text, and meanwhile, the weight is prevented from being excessively influenced by the wrong text, so that overfitting is avoided.
Example two
The embodiment provides a method for constructing a classified corpus, which is applicable to the situation of automatically constructing and purifying the classified corpus based on a small amount of text classification Chinese and western words. As shown in fig. 2, the method of this embodiment specifically includes:
s210, presorting the at least two texts according to seed vocabularies corresponding to the at least two text categories of the pre-specified set field, and constructing an initial classification corpus.
For each text category of the set field, a plurality of seed vocabularies are manually specified, and the seed vocabularies are used for pre-classifying the text of the set field, so that an initial classification corpus is constructed.
Specifically, the set field may be a news field, and correspondingly, the classified corpus may be a news classified corpus.
For the text classification in the news field, only a plurality of seed vocabularies are assigned to each text category, and the vocabularies are used for matching url, title or news column of news text, so that an initial training set, namely an initial news classification corpus, can be obtained.
For example, "finish", "currency", and "inflation" are put into economic categories. Then the news text pointed to by this url is put into the economic category as its url contains "find".
Likewise, an article entitled "small up-regulation of monetary tool rate release continue-to-lever signal in the central row" would also be put into the economic category.
When the title of the news hits the keyword, we will also put the news under the title into the corresponding category. For example, "home > entertainment scrolling news > body" is put into the entertainment category and "home > sports > other > body" is put into the sports category.
S220, training to obtain an initial classifier serving as a classifier to be corrected according to seed vocabularies corresponding to the at least two text categories of the set field.
The initial classification corpus is noisy, and there may be situations where the text classification is incorrect. And training by utilizing an initial classification corpus constructed according to seed vocabularies corresponding to at least two text categories of the set field to obtain an initial classifier, taking the initial classifier as a classifier to be corrected, and further performing correction training on the initial classifier by utilizing the text in the initial classification corpus.
And S230, using the currently stored text in the initial classification corpus as a correction text, and correcting the current classifier to be corrected by adopting the correction method of the text classifier according to any embodiment of the invention to obtain the text classifier.
And taking each text in the initial classification prediction library as a correction text, correcting the class center vector of each text class of the current classifier to be corrected by adopting the correction method of the text classifier in the embodiment to obtain the text classifier, wherein the detailed correction method is detailed in the foregoing and is not repeated.
S240, classifying the texts in the initial news classification corpus by using a text classifier, and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results in the initial classification corpus.
And classifying the texts in the initial news classification corpus by using the corrected text classifier, and removing the texts with the classification results of the text classifier inconsistent with the pre-classification results in the initial classification corpus to remove the noise of the initial classification corpus.
Specifically, classifying the texts in the initial news classification corpus by using a text classifier includes:
acquiring text feature vectors of target texts in an initial classification corpus;
respectively calculating cosine similarity of the text feature vector of the target text and the category center vector of each text category corresponding to the text classifier;
and taking the text category where the category center vector matched with the cosine similarity maximum value is located as the classification result of the target text.
And S250, after the text classifier is used as a new classifier to be corrected, returning to execute the operation of using the text in the initial classification corpus as a corrected text, correcting the current correction classifier by adopting the correction method of the text classifier according to any embodiment of the invention to obtain the operation of the text classifier until a preset text denoising condition is met, and using the current initial classification corpus as the classification corpus in the set field.
And repeating the steps S230 and S240, correcting the updated classifier to be corrected until a preset text denoising condition is met, and taking the current denoised initial classification corpus as a classification corpus in a set field.
The text denoising condition may be that a result of text classification in the updated initial classification pre-material library by using the modified text classifier is completely consistent with a pre-classification result, and no noise needs to be deleted, or that an error between a result of text classification in the updated initial classification pre-material library and a pre-classification result by using the modified text classifier is smaller than a set error, for example, smaller than 1% or 0.5%.
According to the method for constructing the classified corpus, the classified corpus can be automatically constructed based on a small amount of seed words, and the classified corpus can be purified through denoising processing of the classified corpus. The construction method only needs to appoint a small number of seed vocabularies, reduces the workload of manually marking the corpora, and overcomes the defects that a static corpus is easy to be out of date and can not well classify the texts of new contents.
EXAMPLE III
The embodiment provides a correction device for a text classifier, which is applicable to the situation that the classification center vector of the text classifier is corrected so as to improve the classification accuracy of the text classifier, and the correction device can be implemented in a software and/or hardware manner and can be generally integrated in a processor. As shown in fig. 3, the apparatus includes: a category center vector obtaining module 310, a modified text obtaining module 320, a classifier modification module 330, and a loop operation module 340, wherein:
a category center vector obtaining module 310, configured to obtain category center vectors respectively corresponding to at least two text categories of the classifier, where the category center vectors are obtained by calculation according to at least two category texts corresponding to the text categories;
a modified text obtaining module 320, configured to obtain a modified text of a set text type and a text feature vector of the modified text;
a classifier modification module 330, configured to modify, according to a similarity between the text feature vector and a category center vector of each current text category of the classifier and a text category of the modified text, a category center vector corresponding to each text category in the classifier;
and the loop operation module 340 is configured to return to execute the operation of obtaining the corrected text of the set text type and the text feature vector of the corrected text until a condition for finishing correction is met, so as to obtain a corrected classifier.
According to the correction device of the text classifier provided by the embodiment, by means of the technical means of correcting the center vectors of the categories according to the similarity between the text feature vector of the corrected text and the center vector of the category to be corrected and the text category of the corrected text, the influence of each text in each category on the center vector of the category is different, specifically, the influence of the wrongly classified text on the center vector of the category is larger, and further, the error rate of text classification is reduced.
Specifically, the classifier modification module 330 includes: a cosine similarity calculation unit, a first and second cosine similarity acquisition unit and a correction unit, wherein,
the cosine similarity calculation unit is used for calculating the cosine similarity between the text feature vector and the category center vector of each current text category;
a first and second cosine similarity obtaining unit, configured to obtain a first cosine similarity between the text feature vector and a category center vector of a text category to which the modified text belongs, and a second cosine similarity between the text feature vector and a category center vector of a second text category, where the second cosine similarity is a maximum value among cosine similarities between the text feature vector and category center vectors of other text categories;
and the correcting unit is used for correcting the category center vector of the text category to which the corrected text belongs and the category center vector of the second text category by using the ratio if the ratio of the first cosine similarity to the second cosine similarity meets a set condition.
The correction unit is specifically configured to, if the ratio of the first cosine similarity to the second cosine similarity is smaller than a first external parameter, correct the category center vector of the text category to which the corrected text belongs and the category center vector of the second text category by using the following formula:
O1=O1+η*(β-S1/S2)*D,O2=O2-η*(β-S1/S2)*D;
wherein O1 is a category center vector of a text category to which the corrected text belongs, O2 is a category center vector of the second text category, β is a first external parameter, η is a second external parameter, S1 is the first cosine similarity, S2 is the second cosine similarity, and D is a text feature vector of the corrected text.
On the basis of the above technical solution, the classifier modification module 330 further includes: a correction ignore unit and a loop unit, wherein,
a correction ignoring unit, configured to not correct the category center vector of the text category to which the corrected text belongs and the category center vector of the second text category if it is determined that the ratio is greater than or equal to a first external parameter;
and the circulating unit is used for returning and executing the operation of acquiring the corrected text of a set text type and the text feature vector of the corrected text.
On the basis of the foregoing technical solution, the category center vector obtaining module 310 is specifically configured to obtain category center vectors corresponding to at least two text categories of the classifier, where the category center vectors are obtained by summing and normalizing text feature vectors of at least two text categories corresponding to the text categories.
The modified text obtaining module 320 is specifically configured to obtain a modified text of a set text type, and generate a text feature vector of the modified text after performing feature extraction, feature selection, and feature weighting on the modified text.
The correction device of the text classifier can execute the correction method of the text classifier provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executed correction method of the text classifier.
Example four
The embodiment provides a device for constructing a classified corpus, which is applicable to the situation of automatically constructing and purifying the classified corpus based on a small amount of text classification Chinese and western words. As shown in fig. 4, the apparatus includes: an initial classification corpus construction module 410, a to-be-corrected classifier training module 420, a text classifier generation module 430, an initial classification corpus update module 440, and a loop operation module 450, wherein,
an initial classification corpus construction module 410, configured to perform pre-classification on at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field, and construct an initial classification corpus;
a to-be-corrected classifier training module 420, configured to train to obtain an initial classifier as the to-be-corrected classifier according to seed vocabularies corresponding to the at least two text categories in the set field;
a text classifier generating module 430, configured to use a currently stored text in the initial classification corpus as a corrected text, and correct the current classifier to be corrected by using the correction method of the text classifier according to any embodiment of the present invention, so as to obtain a text classifier;
an initial classification corpus updating module 440, configured to classify the text in the initial classification corpus by using the text classifier, and delete the text with the classification result of the text classifier inconsistent with the pre-classification result from the initial classification corpus;
and a loop operation module 450, configured to return to execute the operation of using the text in the initial classification corpus as a corrected text after the text classifier is used as a new classifier to be corrected, correct the current correction classifier by using the correction method of the text classifier according to any embodiment of the present invention, so as to obtain an operation of the text classifier until a preset text denoising condition is met, and use the current initial classification corpus as the classification corpus in the set field.
The device for constructing the classified corpus provided by the embodiment can automatically construct the classified corpus based on a small amount of seed words, and simultaneously, the classified corpus is purified through denoising treatment of the classified corpus. The construction method only needs to appoint a small number of seed vocabularies, reduces the workload of manually marking the corpora, and overcomes the defects that a static corpus is easy to be out of date and can not well classify the texts of new contents.
Specifically, the initial classification corpus updating module 440 includes: a text feature vector obtaining unit, a cosine similarity calculating unit and a classifying unit, wherein,
a text feature vector obtaining unit, configured to obtain a text feature vector of a target text in the initial classification corpus;
the cosine similarity calculation unit is used for calculating cosine similarity between the text feature vector of the target text and the category center vector of each text category corresponding to the text classifier;
and the classification unit is used for taking the text category where the category center vector matched with the cosine similarity maximum value is located as the classification result of the target text.
Specifically, the classification corpus is a news classification corpus.
The device for constructing the classified corpus can execute the method for constructing the classified corpus provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executed method for constructing the classified corpus.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a computer apparatus according to a fifth embodiment of the present invention, as shown in fig. 5, the computer apparatus includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of the processors 510 in the computer device may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510, the memory 520, the input device 530 and the output device 540 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 5.
The memory 520 is a computer readable storage medium, and can be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the method for modifying a text classifier in any embodiment of the present invention (for example, the class center vector obtaining module 310, the modified text obtaining module 320, the classifier modifying module 330, and the loop operation module 340 in the device for modifying a text classifier), and program instructions/modules corresponding to the method for constructing a classified corpus in any embodiment of the present invention (for example, the initial classified corpus constructing module 410, the training module 420 for the classifier to be modified, the text classifier generating module 430, the initial classified corpus updating module 440, and the loop operation module 450 in the device for constructing a classified corpus). The processor 510 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 520, that is, implements the operations for the computer device described above.
The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 520 may further include memory located remotely from processor 510, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may be used to receive input touch information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 540 may include a display device such as a display screen.
EXAMPLE six
The sixth embodiment of the present invention further provides a storage medium containing computer-executable instructions, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the method for correcting a text classifier provided in any embodiment of the present invention is implemented, where the method includes:
acquiring category central vectors respectively corresponding to at least two text categories of a classifier, wherein the category central vectors are obtained by calculation according to at least two category texts corresponding to the text categories;
acquiring a corrected text of a set text type and a text feature vector of the corrected text;
according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier;
and returning to execute the operation of obtaining a corrected text of a set text type and the text feature vector of the corrected text until the condition of finishing correction is met, and obtaining the corrected classifier.
Or, when being executed by a processor, the program implements a method for constructing a corpus to be classified according to any embodiment of the present invention, where the method includes:
presorting at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field, and constructing an initial classification corpus;
training to obtain an initial classifier serving as a classifier to be corrected according to seed vocabularies corresponding to at least two text categories of the set field;
using the currently stored text in the initial classification corpus as a corrected text, and correcting the current classifier to be corrected by adopting the correction method of the text classifier in any embodiment of the invention to obtain a text classifier;
classifying the texts in the initial news classification corpus by using the text classifier, and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results from the initial classification corpus;
and after the text classifier is used as a new classifier to be corrected, returning to execute the operation of using the text in the initial classification corpus as a corrected text, correcting the current corrected classifier by adopting the correction method of the text classifier in any embodiment of the invention to obtain the operation of the text classifier until the preset text denoising condition is met, and using the current initial classification corpus as the classification corpus of the set field.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on this understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for causing a computer device to execute the method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the correction apparatus for a text classifier and the construction apparatus for a classified corpus, each included unit and module are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (14)

1. A method for modifying a text classifier, comprising:
acquiring category central vectors respectively corresponding to at least two text categories of a classifier, wherein the category central vectors are obtained by calculation according to at least two category texts corresponding to the text categories;
acquiring a corrected text of a set text type and a text feature vector of the corrected text;
according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier;
returning to execute the operation of obtaining a corrected text of a set text type and a text feature vector of the corrected text until a correction finishing condition is met, and obtaining a corrected classifier;
wherein, according to the similarity between the text feature vector and the category center vector of each current text category of the classifier and the text category of the corrected text, correcting the category center vector corresponding to each text category in the classifier respectively comprises: respectively calculating cosine similarity of the text feature vector and the category center vector of each current text category;
acquiring a first cosine similarity of the text feature vector and a category center vector of a text category to which the modified text belongs, and a second cosine similarity of the text feature vector and a category center vector of a second text category, wherein the second cosine similarity is the maximum value of the cosine similarities of the text feature vector and the category center vectors of the remaining text categories;
and if the ratio of the first cosine similarity to the second cosine similarity meets a set condition, correcting the category center vector of the text category to which the corrected text belongs and the category center vector of the second text category by using the ratio.
2. The method of claim 1, wherein a ratio of the first cosine similarity to the second cosine similarity satisfies a predetermined condition, comprising:
the ratio is less than a first external parameter;
modifying the category center vector of the text category to which the modified text belongs and the category center vector of the second text category by using the comparison value, wherein the modifying comprises the following steps:
O1=O1+η*(β-S1/S2)*D,O2=O2-η*(β-S1/S2)*D;
wherein O1 is a category center vector of a text category to which the corrected text belongs, O2 is a category center vector of the second text category, β is a first external parameter, η is a second external parameter, S1 is the first cosine similarity, S2 is the second cosine similarity, and D is a text feature vector of the corrected text.
3. The method of claim 2, wherein modifying the class center vector in the classifier corresponding to each of the text classes according to the similarity between the text feature vector and the class center vector of each of the current text classes of the classifier and the text class of the modified text, further comprises:
if the ratio is determined to be larger than or equal to a first external parameter, the class center vector of the text class to which the modified text belongs and the class center vector of the second text class are not modified;
and returning to execute the operation of acquiring the corrected text of a set text type and the text feature vector of the corrected text.
4. The method of claim 1, wherein the category center vector is computed from at least two categories of text corresponding to categories of text, comprising:
the category center vector is obtained by summing and normalizing text feature vectors of at least two categories of texts corresponding to the text categories.
5. The method of claim 1, wherein obtaining the text feature vector of the revised text comprises:
and performing feature extraction, feature selection and feature weighting on the corrected text to generate a text feature vector of the corrected text.
6. A method for constructing a classification corpus is characterized by comprising the following steps:
presorting at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field, and constructing an initial classification corpus;
training to obtain an initial classifier serving as a classifier to be corrected according to seed vocabularies corresponding to at least two text categories of the set field;
using the text currently stored in the initial classification corpus as a correction text, and correcting the current classifier to be corrected by adopting the method according to any one of claims 1 to 5 to obtain a text classifier;
classifying the texts in the initial news classification corpus by using the text classifier, and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results from the initial classification corpus;
and after the text classifier is used as a new classifier to be corrected, returning to execute the operation of using the text in the initial classification corpus as a corrected text, correcting the current correction classifier by adopting the method of any one of claims 1 to 5 to obtain the operation of the text classifier until the preset text denoising condition is met, and using the current initial classification corpus as the classification corpus of the set field.
7. The method of claim 6, wherein classifying the text in the initial news classification corpus using the text classifier comprises:
acquiring text feature vectors of target texts in the initial classification corpus;
respectively calculating cosine similarity of the text feature vector of the target text and the category center vector of each text category corresponding to the text classifier;
and taking the text category where the category center vector matched with the cosine similarity maximum value is located as the classification result of the target text.
8. The method of claim 6, wherein the classified corpus is a news classified corpus.
9. A correction device for a text classifier, comprising:
the classification center vector acquisition module is used for acquiring classification center vectors respectively corresponding to at least two text classifications of the classifier, and the classification center vectors are obtained by calculation according to at least two classification texts corresponding to the text classifications;
the correction text acquisition module is used for acquiring a correction text of a set text type and a text feature vector of the correction text;
a classifier correction module, configured to correct a class center vector corresponding to each of the text classes in the classifier according to a similarity between the text feature vector and a class center vector of each of the text classes of the classifier and a text class of the corrected text;
and the circular operation module is used for returning and executing the operation of obtaining the corrected text of a set text type and the text characteristic vector of the corrected text until the condition of finishing correction is met, so as to obtain the corrected classifier.
10. An apparatus for constructing a corpus of classes, comprising:
the initial classification corpus establishing module is used for pre-classifying at least two texts according to seed vocabularies corresponding to at least two text categories of a pre-specified set field and establishing an initial classification corpus;
the to-be-corrected classifier training module is used for training to obtain an initial classifier serving as the to-be-corrected classifier according to seed vocabularies corresponding to at least two text categories in the set field;
a text classifier generating module, configured to use a currently stored text in the initial classification corpus as a corrected text, and correct the current classifier to be corrected by using the method according to any one of claims 1 to 5, so as to obtain a text classifier;
the initial classification corpus updating module is used for classifying the texts in the initial classification corpus by using the text classifier and deleting the texts with the classification results of the text classifier inconsistent with the pre-classification results from the initial classification corpus;
and the circular operation module is used for returning and executing the operation of using the text in the initial classification corpus as a correction text after the text classifier is used as a new classifier to be corrected, correcting the current correction classifier by adopting the method of any one of claims 1 to 5 to obtain the operation of the text classifier until a preset text denoising condition is met, and using the current initial classification corpus as the classification corpus in the set field.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-5 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 6-8 when executing the program.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 6-8.
CN201810097359.3A 2018-01-31 2018-01-31 Method, device, equipment and medium for correcting classifier and constructing classification corpus Active CN108319682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810097359.3A CN108319682B (en) 2018-01-31 2018-01-31 Method, device, equipment and medium for correcting classifier and constructing classification corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810097359.3A CN108319682B (en) 2018-01-31 2018-01-31 Method, device, equipment and medium for correcting classifier and constructing classification corpus

Publications (2)

Publication Number Publication Date
CN108319682A CN108319682A (en) 2018-07-24
CN108319682B true CN108319682B (en) 2021-12-28

Family

ID=62888500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810097359.3A Active CN108319682B (en) 2018-01-31 2018-01-31 Method, device, equipment and medium for correcting classifier and constructing classification corpus

Country Status (1)

Country Link
CN (1) CN108319682B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046634B (en) * 2018-12-04 2021-04-27 创新先进技术有限公司 Interpretation method and device of clustering result
CN112988954B (en) * 2021-05-17 2021-09-21 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226521A (en) * 2008-02-18 2008-07-23 南京大学 Machine learning method for ambiguity data object estimation modeling
JP2016206986A (en) * 2015-04-23 2016-12-08 日本電信電話株式会社 Clustering device, method, and program
CN107480426A (en) * 2017-07-20 2017-12-15 广州慧扬健康科技有限公司 From iteration case history archive cluster analysis system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226521A (en) * 2008-02-18 2008-07-23 南京大学 Machine learning method for ambiguity data object estimation modeling
JP2016206986A (en) * 2015-04-23 2016-12-08 日本電信電話株式会社 Clustering device, method, and program
CN107480426A (en) * 2017-07-20 2017-12-15 广州慧扬健康科技有限公司 From iteration case history archive cluster analysis system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于LDA模型特征选择的在线医疗社区文本分类及用户聚类研究;吴江等;《情报学报》;20171124;全文 *

Also Published As

Publication number Publication date
CN108319682A (en) 2018-07-24

Similar Documents

Publication Publication Date Title
US11544474B2 (en) Generation of text from structured data
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
She An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors
CN109697289B (en) Improved active learning method for named entity recognition
CN107180084B (en) Word bank updating method and device
CN108776709B (en) Computer-readable storage medium and dictionary updating method
CN110472005B (en) Unsupervised keyword extraction method
WO2015165372A1 (en) Method and apparatus for classifying object based on social networking service, and storage medium
CN113268995A (en) Chinese academy keyword extraction method, device and storage medium
CN103605691B (en) Device and method used for processing issued contents in social network
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN109492217B (en) Word segmentation method based on machine learning and terminal equipment
CN104536979B (en) The generation method and device of topic model, the acquisition methods and device of theme distribution
CN111241813B (en) Corpus expansion method, apparatus, device and medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
Rohart et al. Selection of fixed effects in high dimensional linear mixed models using a multicycle ECM algorithm
CN107491434A (en) Text snippet automatic generation method and device based on semantic dependency
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
WO2015192798A1 (en) Topic mining method and device
CN108287848B (en) Method and system for semantic parsing
CN108319682B (en) Method, device, equipment and medium for correcting classifier and constructing classification corpus
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
CN111522953B (en) Marginal attack method and device for naive Bayes classifier and storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN112765357A (en) Text classification method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant