CN110969172A - Text classification method and related equipment - Google Patents

Text classification method and related equipment Download PDF

Info

Publication number
CN110969172A
CN110969172A CN201811142322.4A CN201811142322A CN110969172A CN 110969172 A CN110969172 A CN 110969172A CN 201811142322 A CN201811142322 A CN 201811142322A CN 110969172 A CN110969172 A CN 110969172A
Authority
CN
China
Prior art keywords
text
target
cluster
classified
central
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811142322.4A
Other languages
Chinese (zh)
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Douyu Network Technology Co Ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201811142322.4A priority Critical patent/CN110969172A/en
Publication of CN110969172A publication Critical patent/CN110969172A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a text classification method and related equipment, which are used for improving the speed and quality of text classification. The method comprises the following steps: dividing a training text set into a first cluster set; determining a central text of each cluster in the first cluster set to obtain a first text set; calculating the similarity of each text in the second text set and each center text in the first text set; allocating the first target texts in the second text set to the first target clusters to obtain a second cluster set; determining the central text of each cluster in the second cluster set to obtain a third text set; calculating the similarity between the text to be classified and each central text in the third text set; calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set; and determining the target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.

Description

Text classification method and related equipment
Technical Field
The invention relates to the field of big data, in particular to a text classification method and related equipment.
Background
In order to improve the relevance and the interestingness of the robot conversation, besides the conversation algorithm is optimized, high-quality corpus texts are also needed in an important mode, and the conversation relevance and the interestingness of the robot can be obviously improved by increasing the conversation corpus of the robot.
The corpus is training data of a language understanding model. Because both intent recognition and entity extraction are supervised models, the training data they require is the annotation data. The original corpus required by the language understanding model of the chat robot is words (queries) spoken by human beings to the chat robot, and the queries are labeled to be training corpuses.
Disclosure of Invention
The embodiment of the invention provides a text classification method and related equipment, which are used for improving the speed and the quality of text classification.
A first aspect of an embodiment of the present invention provides a text classification method, including:
dividing a training text set into a first cluster set, wherein the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
determining a central text of each cluster in the first cluster set to obtain a first text set;
calculating the similarity of each text in a second text set and each central text in the first text set, wherein the second text set is a set of other texts in the training text set except the central text in the first text set;
allocating a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster corresponding to a central text in the first text set, where the central text has the greatest similarity to the first target text;
determining the central text of each cluster in the second cluster set to obtain a third text set;
calculating the similarity between the text to be classified and each central text in the third text set;
calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set;
and determining a target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.
Optionally, after determining the target text classification as the text classification of the text to be classified, the method further includes:
calculating the similarity between the text to be classified and a second target text in the target text classification, wherein the second target text is any one text in the target text classification;
determining similarity between the second target text and other texts in a cluster corresponding to the second target text;
and when the similarity between the text to be classified and the central text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text, deleting the text to be classified from the target text classification.
Optionally, the determining the central text of each cluster in the second cluster set to obtain a third text set includes:
calculating the sum of Euclidean distances between each text in a second target cluster and other texts in the second target cluster, wherein the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the text in the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain the third text set.
Optionally, the calculating the similarity between the text to be classified and each central text in the third text set includes:
calculating the similarity between the text to be classified and each central text in the third text set by the following formula:
Figure BDA0001816079790000021
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diIs the ith central text, x in the third text setijIs d isiIs the weight of the jth dimension of (1), n is the weight of the ith central textDegree of dimension, d is the feature vector of the text to be classified, XjIs the weight of the j-th dimension of d.
Optionally, the calculating the weight values of the text to be classified and the N text classifications based on the similarity between the text to be classified and each center text in the third text set includes:
calculating the weight values of the text to be classified and the N text classifications according to the following formula:
Figure BDA0001816079790000031
wherein, Sim (d, d)i) Is d and diThe similarity of g (d, C)j) A category attribute function for the jth category in the training text set S;
Figure BDA0001816079790000032
u(di,Cj) Is the ith central text diThe importance function of (2);
Figure BDA0001816079790000033
Figure BDA0001816079790000034
is of class CjIs determined by the central vector of (a),
Figure BDA0001816079790000035
is d isiAnd the above-mentioned
Figure BDA0001816079790000036
The european-style european distance of (a),
Figure BDA0001816079790000037
is d isiAnd said
Figure BDA0001816079790000038
Cosine ofAnd (4) similarity.
A second aspect of the embodiments of the present invention provides a text classification device, including:
the device comprises a dividing unit, a calculating unit and a processing unit, wherein the dividing unit is used for dividing a training text set into a first cluster set, the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
a determining unit, configured to determine a central text of each cluster in the first cluster set, to obtain a first text set;
a calculating unit, configured to calculate a similarity between each text in a second text set and each central text in the first text set, where the second text set is a set of other texts in the training text set except the central text in the first text set;
an allocating unit, configured to allocate a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster in the first text set corresponding to a central text with the largest similarity to the first target text;
the determining unit is further configured to determine a central text of each cluster in the second cluster set to obtain a third text set;
the calculating unit is further configured to calculate a similarity between the text to be classified and each central text in the third text set;
the calculating unit is further configured to calculate weight values of the text to be classified and the N text classifications based on the similarity between the text to be classified and each center text in the third text set;
the determining unit is further configured to determine a target text classification as the text classification of the text to be classified, where the target text classification is a text classification with a highest weight value with respect to the text to be classified among the N text classifications.
Optionally, the calculating unit is further configured to calculate a similarity between the text to be classified and a second target text in the target text classification, where the second target text is any one text in the target text classification;
the calculation unit is further configured to determine similarity between the second target text and other texts in a cluster corresponding to the second target text;
the device further comprises: a deletion unit;
the deleting unit is configured to delete the text to be classified from the target text classification when the similarity between the text to be classified and the center text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text.
Optionally, the determining unit is specifically configured to:
calculating the sum of Euclidean distances between each text in a second target cluster and other texts in the second target cluster, wherein the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the text in the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain the third text set.
Optionally, the computing unit is specifically configured to:
calculating the similarity between the text to be classified and each central text in the third text set by the following formula:
Figure BDA0001816079790000051
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diIs the ith central text, x in the third text setijIs d isiN is the dimension number of the ith central text, d is the feature vector of the text to be classified, and X isjIs the weight of the j-th dimension of d.
Optionally, the computing unit is further specifically configured to:
calculating the weight values of the text to be classified and the N text classifications according to the following formula:
Figure BDA0001816079790000052
wherein, Sim (d, d)i) Is d and diThe similarity of g (d, C)j) A category attribute function for the jth category in the training text set S;
Figure BDA0001816079790000061
u(di,Cj) Is the ith central text diThe importance function of (2);
Figure BDA0001816079790000062
Figure BDA0001816079790000063
is of class CjIs determined by the central vector of (a),
Figure BDA0001816079790000064
is d isiAnd the above-mentioned
Figure BDA0001816079790000065
The european-style european distance of (a),
Figure BDA0001816079790000066
is d isiAnd said
Figure BDA0001816079790000067
Cosine similarity of (c).
A third aspect of the present invention provides an electronic device, comprising a memory and a processor, wherein the processor is configured to implement the steps of the text classification method according to any one of the above items when executing a computer management class program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having a computer management-like program stored thereon, characterized in that: the computer management program, when executed by a processor, performs the steps of the method for classifying text as described in any of the above.
In summary, in the embodiments provided by the present invention, the similarity between the text to be classified and the central text of each cluster in the training text set can be calculated, and then the weight value between the text to be classified and each text classification in the training text set is determined according to the similarity, and the text classification with the largest weight value is used as the text classification of the text to be classified, so that the text to be classified can be accurately assigned to the text classification with the highest similarity, and the speed and quality of text classification can be improved.
Drawings
Fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an embodiment of a text classification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic hardware structure diagram of a text classification apparatus according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a text classification method and related equipment, which are used for improving the speed and the quality of text classification.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The following describes a text classification method from the viewpoint of a text classification device, which may be a server or a service unit in the server.
Referring to fig. 1, fig. 1 is a schematic diagram of an embodiment of a text classification method according to an embodiment of the present invention, including:
101. the training text set is divided into a first cluster set.
In this embodiment, the text classification device may divide the training text set into a first cluster set, where the training text set includes N text classifications, where N is a positive integer greater than or equal to 2, specifically, for example, the training text set may be divided into 2 × N clusters, and certainly, the text classification device may also be other text classification devices, and is not limited specifically.
102. And determining the central text of each cluster in the first cluster set to obtain a first text set.
In this embodiment, after the text classification device divides the training text into 2 × N clusters, one text may be randomly selected from the training text set as the central text of each cluster, so as to obtain the first text set.
103. The similarity of each text in the second text set with each center text in the first text set is calculated.
In this embodiment, after determining the first text set, the text classification device may calculate a similarity between each text in a second text set and each central text in the first text set, where the second text set is a set of texts in the training text set except for the central text in the first text set. Specifically, the similarity between each text in the second text set and each center text in the first text set can be calculated by the following formula:
Figure BDA0001816079790000081
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diFor the ith central text, x, in the second text setijIs diN is the dimension number of the ith central text, d is the feature vector of each central text in the first text set, and XjIs the weight of the j-th dimension of d.
104. And allocating the first target texts in the second text set to the target clusters to obtain a second cluster set.
In this embodiment, after calculating the similarity between each text in the second text set and each center text in the first text set, the text classification device may allocate the first target text in the first text set to a target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the target cluster is a cluster corresponding to the center text in the first text set with the highest similarity to the first target text. For example, the a text is any one of the texts in the second text set, the training text set includes 4 clusters, B1, B2, B3 and B4, the similarity between the a text and the center text of the B1 cluster is 0.35, the similarity between the a text and the center text of the B2 cluster is 0.49, the similarity between the a text and the center text of the B3 cluster is 0.31, and the similarity between the a text and the center text of the B4 cluster is 0.69, and then the a text is allocated to the B4 cluster.
105. And determining the central text of each cluster in the second cluster set to obtain a third text set.
In this embodiment, the text classification device may determine a central text of each cluster in the second cluster set to obtain a third text set. Specifically, the sum of the euclidean distances between each text in the second target cluster and other texts in the second target cluster is calculated, and the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain a third text set. That is, a central text may be determined again for each cluster in the second cluster set by calculating a sum of euclidean distances between texts, calculating a sum of euclidean distances between each text in each cluster in the second cluster set and other texts, and taking a text with the minimum sum of euclidean distances as a central point of the cluster to obtain a third text set, that is, each text in the third text set is a central text reselected for each cluster in the second cluster set.
106. And calculating the similarity between the text to be classified and each central text in the third text set.
In this embodiment, after obtaining the third text set, the text classification device may calculate a similarity between the text to be classified and each center text in the third text set, specifically, the similarity is calculated by the following formula:
Figure BDA0001816079790000091
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diFor the ith central text, x, in the third text setijIs diN is the dimension number of the ith central text, d is the feature vector of the text to be classified, and XjIs the weight of the j-th dimension of d.
107. And calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set.
In this embodiment, the text classification device may calculate the weight values of the text to be classified and the N text classifications based on the similarity between the text to be classified and each center text in the third text set:
Figure BDA0001816079790000092
wherein, W (d, C)j) Weight values for the text d to be classified and the jth text classification, Sim (d, d)i) Is d and diSimilarity of (d), g (d, C)j) A category attribute function for the jth text classification in the training text set S;
wherein,
Figure BDA0001816079790000101
u(di,Cj) As the ith central text diThe importance function of (2);
wherein,
Figure BDA0001816079790000102
Figure BDA0001816079790000103
classifying C for jth text in training text set SjIs determined by the central vector of (a),
Figure BDA0001816079790000104
is diAnd
Figure BDA0001816079790000105
the european-style european distance of (a),
Figure BDA0001816079790000106
is diAnd
Figure BDA0001816079790000107
cosine similarity of (c).
108. And determining the target text classification as the text classification of the text to be classified.
In this embodiment, after the weighted values of the text to be classified and the N text classifications are obtained through calculation, the target text classification may be determined as the text classification of the text to be classified, where the target text classification is the text classification with the highest weighted value with respect to the text to be classified among the N text classifications.
It should be noted that after the text classification of the text to be classified is determined, the similarity between the text to be classified and a second target text in the target text classification can be calculated, and the second target text is any one text in the target text classification;
determining the similarity between the second target text and other texts in the cluster corresponding to the second target text;
and when the similarity between the text to be classified and the central text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text, deleting the text to be classified from the target text classification.
It should be noted that, the similarity between each text in the target text classification and the text to be classified may be calculated by the above formula for calculating the similarity, and is not limited specifically.
For ease of understanding, the following detailed description is made with reference to examples:
suppose that the training text set S has A, B two types of corpus text classification, wherein the A type corpus has a1 text and a2 text; the B-type corpus includes B1 text, B2 text and B3 text.
The text C to be classified comprises C1 text and C2 text.
All texts are vectorized through word2vec, and a vectorized text can be obtained. How text classification is performed is explained below:
1. firstly, dividing A, B types of texts into 4 clusters, and randomly selecting a central point in each cluster;
2. randomly selecting a point as the center point O of each cluster1、O2、O3、O4Corresponding to text classifications a1, a2, b1, b 3;
3. calculating the similarity between the remaining non-midpoint texts in the training text set S and the 4 midpoint texts, wherein the similarity between the b2 and each midpoint is that the remaining non-midpoint text in the training text set S is b2.:
Sim(a1,b2)=0.32;
Sim(a2,b2)=0.21;
Sim(b1,b2)=0.78;
Sim(b3,b2)=0.65。
thus dividing the text b2 into clusters corresponding to b 1.
4. Because the center point of each cluster selected before is randomly selected, a center point needs to be found out again for each cluster, and a point with the minimum sum of Euclidean distances between one point and other points in the cluster is taken as a new center point, then:
the center point of the a1 cluster is a 1;
the central point in the cluster of a2 is a 2;
the center point of the b3 cluster is b 3;
the following description will take the selection process of the center point of the b1 cluster as an example:
when b2 is taken as a central point, sum (d (b1, b2) + d (b3, b2)) ═ 0.79;
when b3 is taken as a central point, sum (d (b1, b3) + d (b2, b3)) ═ 1.02;
when b1 is taken as a central point, sum (d (b2, b1) + d (b3, b1)) ═ 1.31;
the center point of the final b1 cluster is b2.
5. According to the formula, the similarity between the C1 and the C2 in the text C to be classified and the center point of each cluster is calculated as follows:
Sim(c1,a1)=0.65,sim(c1,a2)=0.41,sim(c1,b3)=0.53,sim(c1,b2)=0.35;
Sim(c2,a1)=0.24,sim(c2,a2)=0.31,sim(c2,b3)=0.56,sim(c2,b2)=0.85;
and calculating the weight of the text to be classified and A, B categories in the training text set:
the nearest 2 points of c1 are a1 and b3, thus:
W(c1,A)=sim(c1,a1)g(c1,A)+sim(c1,b3)g(c1,A);
W(c1,B)=sim(c1,a1)g(c1,B)+sim(c1,b3)g(c1,B);
the center point of each class is a ', B', thus:
sim(c1,a1)g(c1,A)=0.65*u(a1,A')=0.65*0.98=0.637;
sim(c1,b3)g(c1,A)=0.53*u(b3,A')=0.53*0=0;
thus W (c1, a) ═ 0.637;
sim(c1,a1)g(c1,B)=0.65*u(a1,B')=0.65*0=0;
sim(c1,b3)g(c1,B)=0.53*u(b3,B')=0.53*0.78=0.4134;
thus W (c1, B) ═ 0.4134, W (c1, a) > W (c1, B);
finally, the c1 sample was assigned to class A. The same may result in c2 belonging to class B.
In addition, after the texts to be classified are classified, in order to make each classified text in the training text set as detailed as possible, text clipping can be performed, that is, B1, B2 is a point existing in a cluster in class B, c2 is a point to be newly added, and if the similarity between c2 and the center point of each cluster in class B is smaller than the similarity between B1 and B2 and the center point of the cluster, c2 is not added to class B.
It should be noted that, the formula for calculating the similarity and the weight values of the text to be classified and each category in the foregoing examples may refer to the formula in the foregoing step, and the foregoing step has already been described in detail, and is not repeated herein.
It should be noted that the numerical values given above are merely examples and do not represent limitations of the respective values.
In summary, it can be seen that the technical method provided in the embodiment of the present invention can calculate the similarity between the text to be classified and the central text of each cluster in the training text set, further determine the weight value between the text to be classified and each text classification in the training text set according to the similarity, and use the text classification with the largest weight value as the text classification of the text to be classified.
The text classification method in the embodiment of the present invention is described above, and a text classification device in the embodiment of the present invention is described below.
Referring to fig. 2, an embodiment of a text classification device according to an embodiment of the present invention is applied to a live broadcast platform, and the text classification device includes:
a dividing unit 201, configured to divide a training text set into a first cluster set, where the training text set includes N text classifications, where N is a positive integer greater than or equal to 2;
a determining unit 202, configured to determine a central text of each cluster in the first cluster set, so as to obtain a first text set;
a calculating unit 203, configured to calculate a similarity between each text in a second text set and each central text in the first text set, where the second text set is a set of texts in the training text set except for the central text in the first text set;
an allocating unit 204, configured to allocate a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster in the first text set corresponding to a central text with the largest similarity to the first target text;
the determining unit 202 is further configured to determine a central text of each cluster in the second cluster set to obtain a third text set;
the calculating unit 203 is further configured to calculate a similarity between the text to be classified and each central text in the third text set;
the calculating unit 203 is further configured to calculate weight values of the text to be classified and the N text classifications based on the similarity between the text to be classified and each central text in the third text set;
the determining unit 202 is further configured to determine a target text classification as the text classification of the text to be classified, where the target text classification is a text classification with a highest weight value with respect to the text to be classified among the N text classifications.
Optionally, the calculating unit 203 is further configured to calculate a similarity between the text to be classified and a second target text in the target text classification, where the second target text is any one text in the target text classification;
the calculating unit 203 is further configured to determine similarity between the second target text and other texts in a cluster corresponding to the second target text;
the device further comprises: a deletion unit 205;
the deleting unit 205 is configured to delete the text to be classified from the target text classification when the similarity between the text to be classified and the center text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text.
Optionally, the determining unit 202 is specifically configured to:
calculating the sum of Euclidean distances between each text in a second target cluster and other texts in the second target cluster, wherein the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the text in the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain the third text set.
Optionally, the calculating unit 203 is specifically configured to:
calculating the similarity between the text to be classified and each central text in the third text set by the following formula:
Figure BDA0001816079790000141
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diIs the ith central text, x in the third text setijIs d isiN is the dimension number of the ith central text, d is the feature vector of the text to be classified, and X isjIs the weight of the j-th dimension of d.
Optionally, the calculating unit 203 is further specifically configured to:
calculating the weight values of the text to be classified and the N text classifications according to the following formula:
Figure BDA0001816079790000151
wherein, Sim (d, d)i) Is d and diThe similarity of g (d, C)j) A category attribute function for the jth category in the training text set S;
Figure BDA0001816079790000152
u(di,Cj) Is the ith central text diThe importance function of (2);
Figure BDA0001816079790000153
Figure BDA0001816079790000154
is of class CjIs determined by the central vector of (a),
Figure BDA0001816079790000155
is d isiAnd the above-mentioned
Figure BDA0001816079790000156
The european-style european distance of (a),
Figure BDA0001816079790000157
is d isiAnd said
Figure BDA0001816079790000158
Cosine similarity of (c).
Fig. 2 above describes the text classification apparatus in the embodiment of the present invention from the perspective of a modular functional entity, and the following describes the text classification apparatus in the embodiment of the present invention in detail from the perspective of hardware processing, referring to fig. 3, an embodiment of a text classification apparatus 300 in the embodiment of the present invention includes:
an input device 301, an output device 302, a processor 303 and a memory 304 (wherein the number of the processor 303 may be one or more, and one processor 303 is taken as an example in fig. 3). In some embodiments of the present invention, the input device 301, the output device 502, the processor 303, and the memory 304 may be connected by a bus or other means, wherein the connection by the bus is exemplified in fig. 3.
Wherein, by calling the operation instruction stored in the memory 304, the processor 303 is configured to perform the following steps:
dividing a training text set into a first cluster set, wherein the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
determining a central text of each cluster in the first cluster set to obtain a first text set;
calculating the similarity of each text in a second text set and each central text in the first text set, wherein the second text set is a set of other texts in the training text set except the central text in the first text set;
allocating a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster corresponding to a central text in the first text set, where the central text has the greatest similarity to the first target text;
determining the central text of each cluster in the second cluster set to obtain a third text set;
calculating the similarity between the text to be classified and each central text in the third text set;
calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set;
and determining a target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.
The processor 303 is also configured to perform any of the methods in the corresponding embodiments of fig. 1 by calling the operation instructions stored in the memory 304.
Referring to fig. 4, fig. 4 is a schematic view of an embodiment of an electronic device according to an embodiment of the invention.
As shown in fig. 4, an embodiment of the present invention provides an electronic device, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps are implemented:
dividing a training text set into a first cluster set, wherein the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
determining a central text of each cluster in the first cluster set to obtain a first text set;
calculating the similarity of each text in a second text set and each central text in the first text set, wherein the second text set is a set of other texts in the training text set except the central text in the first text set;
allocating a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster corresponding to a central text in the first text set, where the central text has the greatest similarity to the first target text;
determining the central text of each cluster in the second cluster set to obtain a third text set;
calculating the similarity between the text to be classified and each central text in the third text set;
calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set;
and determining a target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.
In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.
Since the electronic device described in this embodiment is a device used for implementing a text classification apparatus in the embodiment of the present invention, based on the method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner of the electronic device of this embodiment and various variations thereof, so that how to implement the method in the embodiment of the present invention by the electronic device is not described in detail herein, and as long as the device used for implementing the method in the embodiment of the present invention by the person skilled in the art belongs to the intended scope of the present invention.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to the present invention.
As shown in fig. 5, the present embodiment provides a computer-readable storage medium 500 having a computer program 511 stored thereon, the computer program 511 implementing the following steps when executed by a processor:
dividing a training text set into a first cluster set, wherein the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
determining a central text of each cluster in the first cluster set to obtain a first text set;
calculating the similarity of each text in a second text set and each central text in the first text set, wherein the second text set is a set of other texts in the training text set except the central text in the first text set;
allocating a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster corresponding to a central text in the first text set, where the central text has the greatest similarity to the first target text;
determining the central text of each cluster in the second cluster set to obtain a third text set;
calculating the similarity between the text to be classified and each central text in the third text set;
calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set;
and determining a target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.
In a specific implementation, the computer program 511 may implement any of the embodiments corresponding to fig. 1 when executed by a processor.
It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Embodiments of the present invention further provide a computer program product, where the computer program product includes computer software instructions, and when the computer software instructions are executed on a processing device, the processing device executes a flow in the method for designing a wind farm digital platform in the embodiment corresponding to fig. 1.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for classifying text, comprising:
dividing a training text set into a first cluster set, wherein the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
determining a central text of each cluster in the first cluster set to obtain a first text set;
calculating the similarity of each text in a second text set and each central text in the first text set, wherein the second text set is a set of other texts in the training text set except the central text in the first text set;
allocating a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster corresponding to a central text in the first text set, where the central text has the greatest similarity to the first target text;
determining the central text of each cluster in the second cluster set to obtain a third text set;
calculating the similarity between the text to be classified and each central text in the third text set;
calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each central text in the third text set;
and determining a target text classification as the text classification of the text to be classified, wherein the target text classification is the text classification with the highest weight value with the text to be classified in the N text classifications.
2. The method of claim 1, wherein after determining the target text classification as the text classification of the text to be classified, the method further comprises:
calculating the similarity between the text to be classified and a second target text in the target text classification, wherein the second target text is any one text in the target text classification;
determining similarity between the second target text and other texts in a cluster corresponding to the second target text;
and when the similarity between the text to be classified and the central text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text, deleting the text to be classified from the target text classification.
3. The method according to any one of claims 1 or 2, wherein the determining the center text of each cluster in the second set of clusters, resulting in a third set of texts comprises:
calculating the sum of Euclidean distances between each text in a second target cluster and other texts in the second target cluster, wherein the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the text in the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain the third text set.
4. The method of claim 1, wherein calculating the similarity between the text to be classified and each center text in the third text set comprises:
calculating the similarity between the text to be classified and each central text in the third text set by the following formula:
Figure FDA0001816079780000021
wherein S is a training text set, N is the number of text classifications in the training text set S, M is the total number of texts in the training text set S, and diIs the ith central text, x in the third text setijIs d isiN is the dimension number of the ith central text, d is the feature vector of the text to be classified, and X isjIs the weight of the j-th dimension of d.
5. The method according to claim 1, wherein the calculating the weight values of the text to be classified and the N text classifications based on the similarity of the text to be classified and each center text in the third text set comprises:
calculating the weight values of the text to be classified and the N text classifications according to the following formula:
Figure FDA0001816079780000022
wherein, Sim (d, d)i) Is d and diThe similarity of g (d, C)j) A category attribute function for the jth category in the training text set S;
Figure FDA0001816079780000031
u(di,Cj) Is the ith central text diThe importance function of (2);
Figure FDA0001816079780000032
Figure FDA0001816079780000033
is of class CjIs determined by the central vector of (a),
Figure FDA0001816079780000034
is d isiAnd the above-mentioned
Figure FDA0001816079780000035
The european-style european distance of (a),
Figure FDA0001816079780000036
is d isiAnd said
Figure FDA0001816079780000037
Cosine similarity of (c).
6. An apparatus for classifying text, comprising:
the device comprises a dividing unit, a calculating unit and a processing unit, wherein the dividing unit is used for dividing a training text set into a first cluster set, the training text set comprises N text classifications, and N is a positive integer greater than or equal to 2;
a determining unit, configured to determine a central text of each cluster in the first cluster set, to obtain a first text set;
a calculating unit, configured to calculate a similarity between each text in a second text set and each central text in the first text set, where the second text set is a set of other texts in the training text set except the central text in the first text set;
an allocating unit, configured to allocate a first target text in the second text set to a first target cluster to obtain a second cluster set, where the first target text is any one text in the second text set, and the first target cluster is a cluster in the first text set corresponding to a central text with the largest similarity to the first target text;
the determining unit is further configured to determine a central text of each cluster in the second cluster set to obtain a third text set;
the calculating unit is further configured to calculate a similarity between the text to be classified and each central text in the third text set;
the calculating unit is further configured to calculate weight values of the text to be classified and the N text classifications based on the similarity between the text to be classified and each center text in the third text set;
the determining unit is further configured to determine a target text classification as the text classification of the text to be classified, where the target text classification is a text classification with a highest weight value with respect to the text to be classified among the N text classifications.
7. The apparatus according to claim 6, wherein the calculating unit is further configured to calculate a similarity between the text to be classified and a second target text in the target text classification, where the second target text is any one text in the target text classification;
the calculation unit is further configured to determine similarity between the second target text and other texts in a cluster corresponding to the second target text;
the device further comprises: a deletion unit;
the deleting unit is configured to delete the text to be classified from the target text classification when the similarity between the text to be classified and the center text in the target text classification is smaller than the similarity between the second target text and other texts in the cluster corresponding to the second target text.
8. The apparatus according to any one of claims 6 or 7, wherein the determining unit is specifically configured to:
calculating the sum of Euclidean distances between each text in a second target cluster and other texts in the second target cluster, wherein the second target cluster is any one cluster in the second cluster set;
and determining the text with the minimum sum of Euclidean distances between the text in the second target cluster and other texts in the second target cluster as the central text of the target cluster to obtain the third text set.
9. An electronic device comprising a memory, a processor, wherein the processor is configured to implement the steps of the method for classifying text according to any one of claims 1 to 5 when executing a computer management class program stored in the memory.
10. A computer-readable storage medium having stored thereon a computer management-like program, characterized in that: the computer management class program, when executed by a processor, implements the steps of the method of classifying text according to any one of claims 1 to 5.
CN201811142322.4A 2018-09-28 2018-09-28 Text classification method and related equipment Pending CN110969172A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811142322.4A CN110969172A (en) 2018-09-28 2018-09-28 Text classification method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811142322.4A CN110969172A (en) 2018-09-28 2018-09-28 Text classification method and related equipment

Publications (1)

Publication Number Publication Date
CN110969172A true CN110969172A (en) 2020-04-07

Family

ID=70027006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811142322.4A Pending CN110969172A (en) 2018-09-28 2018-09-28 Text classification method and related equipment

Country Status (1)

Country Link
CN (1) CN110969172A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984789A (en) * 2020-08-26 2020-11-24 普信恒业科技发展(北京)有限公司 Corpus classification method and device and server
CN112182206A (en) * 2020-09-01 2021-01-05 中国联合网络通信集团有限公司 Text clustering method and device
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN112988954A (en) * 2021-05-17 2021-06-18 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN113553430A (en) * 2021-07-20 2021-10-26 中国工商银行股份有限公司 Data classification method, device and equipment
CN113849653A (en) * 2021-10-14 2021-12-28 鼎富智能科技有限公司 Text classification method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7289911B1 (en) * 2000-08-23 2007-10-30 David Roth Rigney System, methods, and computer program product for analyzing microarray data
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN106557485A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 A kind of method and device for choosing text classification training set
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 Text classification method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7289911B1 (en) * 2000-08-23 2007-10-30 David Roth Rigney System, methods, and computer program product for analyzing microarray data
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN106557485A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 A kind of method and device for choosing text classification training set
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN107609121A (en) * 2017-09-14 2018-01-19 深圳市玛腾科技有限公司 Newsletter archive sorting technique based on LDA and word2vec algorithms
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 Text classification method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984789A (en) * 2020-08-26 2020-11-24 普信恒业科技发展(北京)有限公司 Corpus classification method and device and server
CN111984789B (en) * 2020-08-26 2024-01-30 普信恒业科技发展(北京)有限公司 Corpus classification method, corpus classification device and server
CN112182206A (en) * 2020-09-01 2021-01-05 中国联合网络通信集团有限公司 Text clustering method and device
CN112182206B (en) * 2020-09-01 2023-06-09 中国联合网络通信集团有限公司 Text clustering method and device
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN112988954A (en) * 2021-05-17 2021-06-18 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN113553430A (en) * 2021-07-20 2021-10-26 中国工商银行股份有限公司 Data classification method, device and equipment
CN113849653A (en) * 2021-10-14 2021-12-28 鼎富智能科技有限公司 Text classification method and device

Similar Documents

Publication Publication Date Title
CN110969172A (en) Text classification method and related equipment
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN104574192B (en) Method and device for identifying same user in multiple social networks
CN108021708B (en) Content recommendation method and device and computer readable storage medium
CN109918498B (en) Problem warehousing method and device
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
CN113255370A (en) Industry type recommendation method, device, equipment and medium based on semantic similarity
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
WO2014177050A1 (en) Method and device for aggregating documents
CN110457704A (en) Determination method, apparatus, storage medium and the electronic device of aiming field
CN110968802B (en) Analysis method and analysis device for user characteristics and readable storage medium
CN115456043A (en) Classification model processing method, intent recognition method, device and computer equipment
CN109376362A (en) A kind of the determination method and relevant device of corrected text
CN110309293A (en) Text recommended method and device
CN113239150B (en) Text matching method, system and equipment
CN110019400B (en) Data storage method, electronic device and storage medium
CN115374775A (en) Method, device and equipment for determining text similarity and storage medium
CN108073567B (en) Feature word extraction processing method, system and server
CN114281983A (en) Text classification method and system of hierarchical structure, electronic device and storage medium
CN111931035B (en) Service recommendation method, device and equipment
CN110705889A (en) Enterprise screening method, device, equipment and storage medium
CN111881293A (en) Risk content identification method and device, server and storage medium
CN114860667B (en) File classification method, device, electronic equipment and computer readable storage medium
CN109871540A (en) A kind of calculation method and relevant device of text similarity
CN107402984B (en) A kind of classification method and device based on theme

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240927

AD01 Patent right deemed abandoned