CN114706985A - Text classification method and device, electronic equipment and storage medium - Google Patents

Text classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114706985A
CN114706985A CN202210432887.6A CN202210432887A CN114706985A CN 114706985 A CN114706985 A CN 114706985A CN 202210432887 A CN202210432887 A CN 202210432887A CN 114706985 A CN114706985 A CN 114706985A
Authority
CN
China
Prior art keywords
text
target
sample
field
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210432887.6A
Other languages
Chinese (zh)
Inventor
刘羲
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210432887.6A priority Critical patent/CN114706985A/en
Publication of CN114706985A publication Critical patent/CN114706985A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a text classification method, which comprises the following steps: splitting a sample set corresponding to each field into a plurality of positive sample pairs and a plurality of negative sample pairs; determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training a text classification model corresponding to each field by minimizing the target loss function; identifying a target field corresponding to a text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text category is balanced or not, and if not, calculating the weight corresponding to each text category of the target field; inputting the text to be classified into a text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, and determining the target text category corresponding to the text to be classified based on the predicted value and the weight. The invention also provides a text classification device, electronic equipment and a storage medium. The invention improves the text classification accuracy.

Description

Text classification method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a text classification method and device, electronic equipment and a storage medium.
Background
The text classification is an important subject in the field of natural language processing, and can be widely applied to multiple scenes such as search engines, intelligent question answering, knowledge retrieval, dialog systems and the like.
Currently, the mainstream method is to classify texts by using a text classification model, and the training process of the text classification model is usually: and inputting the sample carrying the label into the model, and optimizing the text classification model by minimizing the difference between the label and the predicted value. However, this approach does not take into account the correlation between samples, making the model classification less accurate. Therefore, a method for classifying texts is needed to improve the accuracy of text classification.
Disclosure of Invention
In view of the above, there is a need to provide a text classification method, apparatus, electronic device and storage medium, aiming to improve the accuracy of text classification.
The text classification method provided by the invention comprises the following steps:
acquiring a sample set which carries text category information and corresponds to each field, and splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information;
determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field;
receiving a text to be classified, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on text categories is balanced or not, and if not, calculating the weight corresponding to each text category of the target field;
inputting the text to be classified into a text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, determining a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determining a target text category corresponding to the text to be classified based on the target value.
Optionally, the determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair includes:
selecting a sample set of a field, inputting the selected sample set into a coding network of an initial classification model to execute coding processing, and obtaining a coding vector of each sample in the selected sample set;
calculating a first similarity value of each positive sample pair and a second similarity value of each negative sample pair corresponding to the selected sample set based on the coding vector;
determining a first loss function corresponding to the selected sample set based on a difference between the first similarity value and a second similarity value;
inputting the coding vector into a classification network of the initial classification model to perform classification processing, obtaining a probability value of each text category of each sample in the selected sample set in the corresponding field, and determining a second loss function corresponding to the selected sample set based on the probability value;
and determining a target loss function corresponding to the selected sample set based on the first loss function and the second loss function.
Optionally, the identifying a target field corresponding to the text to be classified includes:
performing word segmentation processing on the text to be classified to obtain a word set;
matching each word in the word set with a word library corresponding to each field respectively to obtain a matched word set corresponding to each field;
and taking the field with the most number of the matched words in the matched word set as the target field corresponding to the text to be classified.
Optionally, the splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information includes:
combining each sample in each sample set with samples of the same text type in pairs to obtain a plurality of positive sample pairs corresponding to each sample set;
and combining each sample in each sample set with samples of different text types in pairs to obtain a plurality of negative sample pairs corresponding to each sample set.
Optionally, the determining whether the distribution of the sample set corresponding to the target field in the text category is balanced includes:
counting the number of samples of the sample set corresponding to the target field in each text category;
calculating a sample average value corresponding to the target field based on the number of samples;
calculating the variance of a sample set corresponding to the target field based on the number of samples and the average value of the samples;
and determining whether the distribution of the sample set corresponding to the target field on the text category is balanced or not according to the variance.
Optionally, after the determining whether the distribution of the sample set corresponding to the target field in the text category is balanced, the method further includes:
if the text to be classified is judged to be the text to be classified, inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of the text to be classified in each text category, and determining the target text category corresponding to the text to be classified based on the predicted value.
Optionally, the first loss function is:
Figure BDA0003608807660000031
wherein L is a first loss function corresponding to the selected sample set, SiSecond similarity values, S, for the ith negative sample pair corresponding to the selected sample setjAnd the first similarity value of the jth positive sample pair corresponding to the selected sample set is obtained, m is the total number of negative sample pairs in the selected sample set, and n is the total number of positive sample pairs in the selected sample set.
In order to solve the above problem, the present invention further provides a text classification apparatus, including:
the splitting module is used for acquiring a sample set which carries text category information and corresponds to each field, and splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information;
the training module is used for determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field;
the judging module is used for receiving a text to be classified, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text category is balanced or not, and if not, calculating the weight corresponding to each text category of the target field;
the classification module is used for inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, determining a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determining a target text category corresponding to the text to be classified based on the target value.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a text classification program executable by the at least one processor to enable the at least one processor to perform the text classification method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having a text classification program stored thereon, the text classification program being executable by one or more processors to implement the above text classification method.
Compared with the prior art, the method includes the steps that firstly, a sample set corresponding to each field is divided into a plurality of positive sample pairs and a plurality of negative sample pairs; secondly, determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training a text classification model corresponding to each field by minimizing the target loss function; then, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text categories is balanced or not, and if not, calculating the weight corresponding to each text category of the target field; and finally, inputting the text to be classified into a text classification model corresponding to the target field, and determining the target text category corresponding to the text to be classified based on the predicted value output by the text classification model and the weight. According to the method, the target loss function is determined according to the similarity difference between the positive sample pair and the negative sample pair, so that the model can learn the similarity between the positive sample pair and the dissimilarity between the negative sample pair, interaction between samples is deepened, and the classification accuracy of the model is improved. Therefore, the invention improves the accuracy of text classification.
Drawings
Fig. 1 is a schematic flowchart of a text classification method according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing a text classification method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention provides a text classification method. Fig. 1 is a schematic flow chart of a text classification method according to an embodiment of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
In this embodiment, the text classification method includes the following steps S1 to S4:
s1, obtaining a sample set carrying text category information corresponding to each field, and splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information.
In this embodiment, a corresponding sample set is collected in advance for each field, where the fields may include medical treatment, sports, finance, education, travel, and the like, and the samples in the sample set carry text type information labels, for example, the text types in the medical field include the text types in internal medicine, surgery, emergency department, fitness department, brain department, and the like.
The splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information comprises the following steps A11-A12:
a11, combining each sample in each sample set with samples of the same text type in pairs to obtain a plurality of positive sample pairs corresponding to each sample set;
for example, in the medical field, if there are 500, 1000, 1200, 1500 and 1600 samples in the text categories of medical department, surgical department, emergency department, fitness department and brain department, respectively, the number of positive sample pairs is increased
Figure BDA0003608807660000051
And A12, combining each sample in each sample set with samples of different text types in pairs respectively to obtain a plurality of negative sample pairs corresponding to each sample set.
For sample 1 of the medical text category in the medical field, it can be combined with any one of 5300 samples (where 5300 is 1000+1200+1500+1600) corresponding to the text categories of surgery, emergency department, fitness department, and brain department to form a negative sample pair, and then the number of the negative sample pairs corresponding to sample 1 is as follows
Figure BDA0003608807660000061
Number of pairs of negative examples corresponding to medical field
Figure BDA0003608807660000062
Figure BDA0003608807660000063
S2, determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field.
The embodiment increases the similarity distance between the positive sample pair and the negative sample pair as a first target to construct a first loss function, reduces the difference value between a predicted value and a label value of each sample as a second target to construct a second loss function, determines a target loss function based on the first and second loss functions, and trains the model by minimizing the target loss function, so that the model can learn the similarity between the positive sample pair and the dissimilarity between the negative sample pair, deepens the interaction between the samples, and simultaneously draws the distance between the predicted value and the label value closer to ensure that the trained text classification model is more accurate.
The determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair comprises the following steps B11-B15:
b11, selecting a sample set of a field, inputting the selected sample set into a coding network of an initial classification model to execute coding processing, and obtaining a coding vector of each sample in the selected sample set;
in this embodiment, the initial classification model may be a transform model, and the initial classification model includes a coding network and a classification network, where the coding network is configured to code an input text (including character coding, position coding, and semantic coding) to obtain a coding vector of the input text.
B12, calculating a first similarity value of each positive sample pair and a second similarity value of each negative sample pair corresponding to the selected sample set based on the coding vector;
in this embodiment, a cosine similarity, a manhattan distance, an euclidean distance, and a explicit distance algorithm may be used to calculate a first similarity value between two samples in the positive sample pair and a second similarity value between two samples in the negative sample pair.
B13, determining a first loss function corresponding to the selected sample set based on the difference value of the first similarity value and the second similarity value;
the first loss function is:
Figure BDA0003608807660000064
wherein L is a first loss function corresponding to the selected sample set, SiSecond similarity values, S, for the ith negative sample pair corresponding to the selected sample setjAnd the first similarity value of the jth positive sample pair corresponding to the selected sample set is obtained, m is the total number of negative sample pairs in the selected sample set, and n is the total number of positive sample pairs in the selected sample set.
The outermost layer log (1+ x) of the first loss function is a smoothing function of the function max (x), which when L decreases, will cause S to decreaseiDecrease of SjIncreasing, thereby, it is realized that the second similarity value of any negative sample pair corresponding to the sample set is smaller than the first similarity value of any positive sample pair, that is: the distance between the similarity values of the negative and positive pairs of samples is increased.
B14, inputting the coding vector into the classification network of the initial classification model to perform classification processing, obtaining a probability value of each text category of each sample in the selected sample set in the corresponding field, and determining a second loss function corresponding to the selected sample set based on the probability values;
the classification network is used for predicting a probability value of the input sample in each text category of the corresponding field, determining a true value of the input sample in each text category according to a label of the input sample, and determining a second loss function corresponding to the selected sample set based on the predicted probability value and the true value.
The second loss function is:
Figure BDA0003608807660000071
wherein Y is a second loss function corresponding to the selected sample,
Figure BDA0003608807660000072
for the true value of the ith sample in the jth text class of the corresponding field in the selected sample set,
Figure BDA0003608807660000073
and predicting probability values of ith samples in the selected sample set in jth text categories of the corresponding fields, wherein a is the total number of samples in the selected sample set, and b is the total number of text categories in the fields corresponding to the selected sample set.
And B15, determining a target loss function corresponding to the selected sample set based on the first loss function and the second loss function.
In this embodiment, the first and second loss functions are summed to obtain a target loss function. In other embodiments, corresponding weights may be set for the first loss function and the second loss function, respectively, and the target loss function may be determined by weighted summation.
S3, receiving a text to be classified, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text type is balanced or not, and if not, calculating the weight corresponding to each text type of the target field.
In this embodiment, steps S1-S2 are adopted to train a corresponding text classification model for each field, however, the classification accuracy of the text classification models is different due to different sample distribution conditions of each field, the classification accuracy of the text classification models is higher for the field in which the sample distribution of each text category is balanced, and the classification accuracy of the text classification models is not high for the field in which the sample distribution of each text category is unbalanced.
In order to improve the text classification accuracy in the field with unbalanced sample distribution, in this embodiment, a weight parameter is set for each text category in the field, the optimal value of the weight parameter is solved by using a Powell method, and the product of the output value of the model in each text category and the value of the corresponding weight parameter is used as the target value of the input text in each text category, so that the problem that the model text classification in the field is inaccurate is solved. The Powell method is a method for solving parameter values by iterating step by step and searching the optimal value of the parameter to make the loss reach the minimum value, and the method for solving the parameter values by adopting the Powell method is the prior art and is not described herein again.
The step of identifying the target domain corresponding to the text to be classified comprises the following steps C11-C13:
c11, performing word segmentation processing on the text to be classified to obtain a word set;
in this embodiment, the word segmentation process may be performed on the text to be classified by using a forward maximum matching method, a reverse maximum matching method, or a least segmentation method.
C12, matching each word in the word set with a word library corresponding to each field respectively to obtain a matching word set corresponding to each field;
in this embodiment, a corresponding word library is configured for each field in advance.
And C13, taking the field with the maximum number of the matched words in the matched word set as the target field corresponding to the text to be classified.
For example, if the number of matching words in the matching word set corresponding to the medical field is the largest, the medical field is taken as the target field corresponding to the text to be classified.
The step of judging whether the distribution of the sample set corresponding to the target field on the text categories is balanced or not comprises the following steps D11-D14:
d11, counting the number of samples of the sample set corresponding to the target field in each text category;
specifically, the number of samples of each text category can be counted according to the label information of the samples.
D12, calculating a sample average value corresponding to the target field based on the sample number;
d13, calculating the variance of the sample set corresponding to the target field based on the sample number and the sample average value;
d14, determining whether the distribution of the sample set corresponding to the target field on the text category is balanced or not according to the variance.
In this embodiment, if the variance is smaller than a predetermined variance threshold, the distribution of the sample set corresponding to the target field on the text category is balanced; and if the variance is larger than the variance threshold, the distribution of the sample set corresponding to the target field on the text type is unbalanced.
S4, inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, determining a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determining a target text category corresponding to the text to be classified based on the target value.
In this embodiment, the product of the predicted value and the corresponding weight of each text category is used as a target value of the text to be classified in each text category, and the text category with the largest target value is used as a target text category corresponding to the text to be classified.
After the determining whether the distribution of the sample set corresponding to the target domain on the text category is balanced, the method further includes:
if the text to be classified is judged to be the text to be classified, inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of the text to be classified in each text category, and determining the target text category corresponding to the text to be classified based on the predicted value.
As can be seen from the foregoing embodiments, in the text classification method provided by the present invention, first, a sample set corresponding to each field is divided into a plurality of positive sample pairs and a plurality of negative sample pairs; secondly, determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training a text classification model corresponding to each field by minimizing the target loss function; then, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text categories is balanced or not, and if not, calculating the weight corresponding to each text category of the target field; and finally, inputting the text to be classified into a text classification model corresponding to the target field, and determining the target text category corresponding to the text to be classified based on the predicted value output by the text classification model and the weight. According to the method, the target loss function is determined according to the similarity difference between the positive sample pair and the negative sample pair, so that the model can learn the similarity between the positive sample pair and the dissimilarity between the negative sample pair, the interaction between the samples is deepened, and the classification accuracy of the model is improved. Therefore, the invention improves the accuracy of text classification.
Fig. 2 is a schematic block diagram of a text classification apparatus according to an embodiment of the present invention.
The text classification apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the text classification apparatus 100 may include a splitting module 110, a training module 120, a determining module 130, and a classifying module 140. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the splitting module 110 is configured to obtain a sample set carrying text category information corresponding to each field, and split each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information.
The splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information comprises the following steps A21-A22:
a21, combining each sample in each sample set with samples of the same text type in pairs to obtain a plurality of positive sample pairs corresponding to each sample set;
and A22, combining each sample in each sample set with samples of different text types in pairs respectively to obtain a plurality of negative sample pairs corresponding to each sample set.
A training module 120, configured to determine a target loss function according to a similarity difference between the positive sample pair and the negative sample pair, and train an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field.
The determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair comprises the following steps B21-B25:
b21, selecting a sample set of a field, inputting the selected sample set into a coding network of an initial classification model to execute coding processing, and obtaining a coding vector of each sample in the selected sample set;
b22, calculating a first similarity value of each positive sample pair and a second similarity value of each negative sample pair corresponding to the selected sample set based on the coding vector;
b23, determining a first loss function corresponding to the selected sample set based on the difference value of the first similarity value and the second similarity value;
the first loss function is:
Figure BDA0003608807660000101
wherein L is a first loss function corresponding to the selected sample set, SiSecond similarity values, S, for the ith negative sample pair corresponding to the selected sample setjAnd the first similarity value of the jth positive sample pair corresponding to the selected sample set is obtained, m is the total number of negative sample pairs in the selected sample set, and n is the total number of positive sample pairs in the selected sample set.
B24, inputting the coding vector into the classification network of the initial classification model to perform classification processing, obtaining a probability value of each text category of each sample in the selected sample set in the corresponding field, and determining a second loss function corresponding to the selected sample set based on the probability values;
and B25, determining a target loss function corresponding to the selected sample set based on the first loss function and the second loss function.
The determining module 130 is configured to receive a text to be classified, identify a target field corresponding to the text to be classified, determine whether distribution of a sample set corresponding to the target field on text categories is balanced, and if not, calculate a weight corresponding to each text category of the target field.
The step of identifying the target domain corresponding to the text to be classified comprises the following steps C21-C23:
c21, performing word segmentation processing on the text to be classified to obtain a word set;
c22, matching each word in the word set with a word library corresponding to each field respectively to obtain a matching word set corresponding to each field;
and C23, taking the field with the maximum number of the matched words in the matched word set as the target field corresponding to the text to be classified.
The step of judging whether the distribution of the sample set corresponding to the target field on the text categories is balanced or not comprises the following steps D21-D24:
d21, counting the number of samples of the sample set corresponding to the target field in each text category;
d22, calculating a sample average value corresponding to the target field based on the sample number;
d23, calculating the variance of the sample set corresponding to the target field based on the sample number and the sample average value;
d24, determining whether the distribution of the sample set corresponding to the target field on the text category is balanced or not according to the variance.
The classification module 140 is configured to input the text to be classified into the text classification model corresponding to the target field, obtain a predicted value of each text category of the text to be classified in the target field, determine a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determine a target text category corresponding to the text to be classified based on the target value.
After determining whether the distribution of the sample set corresponding to the target domain over the text categories is balanced, the classification module 140 is further configured to:
if the text to be classified is judged to be the text to be classified, inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of the text to be classified in each text category, and determining the target text category corresponding to the text to be classified based on the predicted value.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a text classification method according to an embodiment of the present invention.
The electronic device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.
In the present embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores a text classification program 10, and the text classification program 10 is executable by the processor 12. While fig. 3 shows only the electronic device 1 with components 11-13 and the text classification program 10, those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may include fewer or more components than shown, or some components in combination, or a different arrangement of components.
The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic equipment 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various types of application software installed in the electronic device 1, for example, codes of the text classification program 10 in an embodiment of the present invention, and the like. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the text classification program 10.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is used for establishing a communication connection between the electronic device 1 and a client (not shown in the figure).
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores a text classification program 10 that is a combination of instructions that, when executed in the processor 12, implements the text classification method described above.
Specifically, the processor 12 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the text classification program 10, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be non-volatile or non-volatile. The computer-readable storage medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The computer readable storage medium has stored thereon a text classification program 10, which text classification program 10 is executable by one or more processors to implement the text classification method described above.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of text classification, the method comprising:
acquiring a sample set which carries text category information and corresponds to each field, and splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information;
determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field;
receiving a text to be classified, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on text categories is balanced or not, and if not, calculating the weight corresponding to each text category of the target field;
inputting the text to be classified into a text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, determining a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determining a target text category corresponding to the text to be classified based on the target value.
2. The method of text classification according to claim 1, wherein said determining a target loss function from the difference in similarity of the positive and negative sample pairs comprises:
selecting a sample set of a field, inputting the selected sample set into a coding network of an initial classification model to execute coding processing, and obtaining a coding vector of each sample in the selected sample set;
calculating a first similarity value of each positive sample pair and a second similarity value of each negative sample pair corresponding to the selected sample set based on the coding vector;
determining a first loss function corresponding to the selected sample set based on a difference between the first similarity value and a second similarity value;
inputting the coding vector into a classification network of the initial classification model to perform classification processing, obtaining a probability value of each text category of each sample in the selected sample set in the corresponding field, and determining a second loss function corresponding to the selected sample set based on the probability value;
and determining a target loss function corresponding to the selected sample set based on the first loss function and the second loss function.
3. The method for classifying texts according to claim 1, wherein the identifying a target domain corresponding to the text to be classified comprises:
performing word segmentation processing on the text to be classified to obtain a word set;
matching each word in the word set with a word library corresponding to each field respectively to obtain a matched word set corresponding to each field;
and taking the field with the maximum number of the matched words in the matched word set as the target field corresponding to the text to be classified.
4. The text classification method of claim 1, wherein the splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information comprises:
combining each sample in each sample set with samples of the same text type in pairs to obtain a plurality of positive sample pairs corresponding to each sample set;
and combining each sample in each sample set with samples of different text types in pairs to obtain a plurality of negative sample pairs corresponding to each sample set.
5. The method for classifying texts according to claim 1, wherein the determining whether the distribution of the sample sets corresponding to the target domain over the text categories is balanced includes:
counting the number of samples of the sample set corresponding to the target field in each text category;
calculating a sample average value corresponding to the target field based on the number of samples;
calculating the variance of a sample set corresponding to the target field based on the number of samples and the average value of the samples;
and determining whether the distribution of the sample set corresponding to the target field on the text category is balanced or not according to the variance.
6. The text classification method according to any one of claims 1 to 5, wherein after the determining whether the distribution of the sample set corresponding to the target domain over the text categories is balanced, the method further comprises:
if the text to be classified is judged to be the text to be classified, inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of the text to be classified in each text category, and determining the target text category corresponding to the text to be classified based on the predicted value.
7. The text classification method of claim 2, characterized in that the first loss function is:
Figure FDA0003608807650000021
wherein L is a first loss function corresponding to the selected sample set, SiSecond similarity values, S, for the ith negative sample pair corresponding to the selected sample setjAnd the first similarity value of the jth positive sample pair corresponding to the selected sample set is obtained, m is the total number of negative sample pairs in the selected sample set, and n is the total number of positive sample pairs in the selected sample set.
8. An apparatus for classifying text, the apparatus comprising:
the splitting module is used for acquiring a sample set which carries text category information and corresponds to each field, and splitting each sample set into a plurality of positive sample pairs and a plurality of negative sample pairs according to the text category information;
the training module is used for determining a target loss function according to the similarity difference between the positive sample pair and the negative sample pair, and training an initial classification model corresponding to each field by minimizing the target loss function to obtain a text classification model corresponding to each field;
the judging module is used for receiving a text to be classified, identifying a target field corresponding to the text to be classified, judging whether the distribution of a sample set corresponding to the target field on the text category is balanced or not, and if not, calculating the weight corresponding to each text category of the target field;
the classification module is used for inputting the text to be classified into the text classification model corresponding to the target field to obtain a predicted value of each text category of the text to be classified in the target field, determining a target value of each text category of the text to be classified in the target field based on the predicted value and the weight, and determining a target text category corresponding to the text to be classified based on the target value.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a text classification program executable by the at least one processor to enable the at least one processor to perform the text classification method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a text classification program executable by one or more processors to implement the text classification method of any one of claims 1 to 7.
CN202210432887.6A 2022-04-21 2022-04-21 Text classification method and device, electronic equipment and storage medium Pending CN114706985A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210432887.6A CN114706985A (en) 2022-04-21 2022-04-21 Text classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210432887.6A CN114706985A (en) 2022-04-21 2022-04-21 Text classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114706985A true CN114706985A (en) 2022-07-05

Family

ID=82174727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210432887.6A Pending CN114706985A (en) 2022-04-21 2022-04-21 Text classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114706985A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048525A (en) * 2022-08-15 2022-09-13 有米科技股份有限公司 Method and device for text classification and text classification model training based on multi-tuple
CN116975299A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text data discrimination method, device, equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048525A (en) * 2022-08-15 2022-09-13 有米科技股份有限公司 Method and device for text classification and text classification model training based on multi-tuple
CN116975299A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text data discrimination method, device, equipment and medium
CN116975299B (en) * 2023-09-22 2024-05-28 腾讯科技(深圳)有限公司 Text data discrimination method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111695033B (en) Enterprise public opinion analysis method, enterprise public opinion analysis device, electronic equipment and medium
CN112417096B (en) Question-answer pair matching method, device, electronic equipment and storage medium
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN114706985A (en) Text classification method and device, electronic equipment and storage medium
CN113178071B (en) Driving risk level identification method and device, electronic equipment and readable storage medium
CN112597135A (en) User classification method and device, electronic equipment and readable storage medium
CN114462412A (en) Entity identification method and device, electronic equipment and storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN115034315A (en) Business processing method and device based on artificial intelligence, computer equipment and medium
CN114416939A (en) Intelligent question and answer method, device, equipment and storage medium
CN114281991A (en) Text classification method and device, electronic equipment and storage medium
CN114220536A (en) Disease analysis method, device, equipment and storage medium based on machine learning
CN113344125A (en) Long text matching identification method and device, electronic equipment and storage medium
CN116628162A (en) Semantic question-answering method, device, equipment and storage medium
CN114818685B (en) Keyword extraction method and device, electronic equipment and storage medium
CN114139530A (en) Synonym extraction method and device, electronic equipment and storage medium
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113688239B (en) Text classification method and device under small sample, electronic equipment and storage medium
CN113656586B (en) Emotion classification method, emotion classification device, electronic equipment and readable storage medium
CN115169360A (en) User intention identification method based on artificial intelligence and related equipment
CN114398877A (en) Theme extraction method and device based on artificial intelligence, electronic equipment and medium
CN113312482A (en) Question classification method and device, electronic equipment and readable storage medium
CN113706252A (en) Product recommendation method and device, electronic equipment and storage medium
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN114742060B (en) Entity identification method, entity identification device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination