CN117236329B - Text classification method and device and related equipment - Google Patents

Text classification method and device and related equipment Download PDF

Info

Publication number
CN117236329B
CN117236329B CN202311519737.XA CN202311519737A CN117236329B CN 117236329 B CN117236329 B CN 117236329B CN 202311519737 A CN202311519737 A CN 202311519737A CN 117236329 B CN117236329 B CN 117236329B
Authority
CN
China
Prior art keywords
text
classification
probability
classified
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311519737.XA
Other languages
Chinese (zh)
Other versions
CN117236329A (en
Inventor
柴晓晋
童琪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Academy Beijing Technology Co ltd
Original Assignee
Alibaba Damo Academy Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Academy Beijing Technology Co ltd filed Critical Alibaba Damo Academy Beijing Technology Co ltd
Priority to CN202311519737.XA priority Critical patent/CN117236329B/en
Publication of CN117236329A publication Critical patent/CN117236329A/en
Application granted granted Critical
Publication of CN117236329B publication Critical patent/CN117236329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text classification method, a text classification device and related equipment. The text classification method comprises the following steps: acquiring a text to be classified; word segmentation is carried out on the text to be classified according to a preset word stock; selecting keywords from the used word segments according to a preset feature vector table, and determining feature vectors; acquiring a characteristic selection rate; acquiring an initial classification label and classification probability by using the feature vector and a preset classifier; acquiring the fusion probability of the classification probability and the feature selection rate; and when the fusion probability meets a preset fusion probability condition, determining a final classification label of the text to be classified as the initial classification label. The clustering speed of the samples to be classified is improved on the premise that a large number of negative samples do not need to be trained by using the feature selection rate and the classification probability, so that the workload of processing the negative samples is reduced, and extra computational power and resource consumption are not introduced.

Description

Text classification method and device and related equipment
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to a text classification method, a text classification device and related equipment.
Background
Home appliances with offline ASR (automatic speech recognition) recognition have been popular for several years, and such device chips are usually limited in computational power and less in resources, but have low requirements on model capability due to relatively low required dialogue processing and language understanding capability, and mainly improve processing precision and generalization capability under limited computational power and memory resources.
However, the existing training methods for language identification all need a large number of negative samples, which is time-consuming and labor-consuming, and simultaneously the negative samples can introduce a large number of feature vectors, which require additional storage space, and meanwhile, the complexity of the model is improved, and the computational power requirement and the resource requirement are greatly improved.
Therefore, how to improve the clustering speed of the classification performance of the negative sample without introducing extra calculation power and improving the resource consumption becomes a technical problem to be solved by the technicians in the field.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a text classification method, apparatus, and related device, so as to improve the clustering speed of classification performance for negative samples without introducing additional computing power and improving resource consumption.
In order to solve the above problems, the embodiment of the present invention provides the following technical solutions:
in a first aspect, an embodiment of the present invention provides a text classification method, including:
acquiring a text to be classified;
word segmentation is carried out on the text to be classified according to a preset word stock, and used word segmentation is obtained;
selecting a keyword from the used word according to a preset feature vector table, and acquiring a feature vector according to the keyword and the feature vector table;
acquiring a feature selection rate according to the number of the keywords and the number of the used segmented words;
acquiring an initial classification label of the text to be classified and classification probability corresponding to the initial classification label by utilizing the feature vector and a preset classifier;
acquiring the fusion probability of the classification probability and the feature selection rate;
and when the fusion probability meets a preset fusion probability condition, determining a final classification label of the text to be classified as the initial classification label.
In a second aspect, an embodiment of the present application further provides a text classification apparatus, including:
the text acquisition device is used for acquiring texts to be classified;
the word segmentation device is used for segmenting the text to be classified according to a preset word stock to obtain used word segmentation;
the feature vector selection device is used for selecting keywords from the used word segments according to a preset feature vector table and obtaining feature vectors according to the keywords and the feature vector table;
the feature enrollment rate calculation device is used for obtaining feature enrollment rates according to the number of the keywords and the number of the used segmented words;
the classifying device is used for acquiring an initial classifying label of the text to be classified and classifying probability corresponding to the initial classifying label by utilizing the feature vector and a preset classifier;
the fusion probability calculation device is used for acquiring the fusion probability of the classification probability and the feature selection rate;
and the label calculating device is used for determining that the final classification label of the text to be classified is the initial classification label when the fusion probability meets the preset fusion probability condition.
In a third aspect, embodiments of the present application also provide a storage medium storing a program adapted for text classification to implement the text classification method as described above.
In a fourth aspect, embodiments of the present application also provide an electronic device including at least one memory and at least one processor; the memory stores a program that the processor invokes to perform the text classification method as described above.
Compared with the prior art, the technical scheme of the real-time example has the following advantages:
the text classification method provided by the embodiment of the invention obtains the text to be classified; word segmentation is carried out on the text to be classified according to a preset word stock, and used word segmentation is obtained; selecting a keyword from the used word according to a preset feature vector table, and acquiring a feature vector according to the keyword and the feature vector table; acquiring a feature selection rate according to the number of the keywords and the number of the used segmented words; acquiring an initial classification label of the text to be classified and classification probability corresponding to the initial classification label by utilizing the feature vector and a preset classifier; acquiring the fusion probability of the classification probability and the feature selection rate; and when the fusion probability meets a preset fusion probability condition, determining a final classification label of the text to be classified as the initial classification label.
According to the technical scheme, the feature selection rate and the classification probability can be introduced in the classification process of the sample to be classified, so that the classification probability is corrected by utilizing the feature selection rate when the classifier outputs a result, the accuracy of the classification result is improved, and therefore, when the classifier is trained, training is only carried out through positive samples, training is not needed by using negative samples, the clustering speed of the sample to be classified is improved on the premise that a large number of negative samples are not needed, the workload of the classifier for processing the negative samples is reduced, no extra calculation force is introduced, and the resource consumption required by the classifier training is reduced on the basis of ensuring the accuracy of the classification result.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a text processing method;
fig. 2 is a schematic flow chart of a text processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a text processing device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Natural language understanding is the current direction of research on heat, with the advent of many different natural language models, the development of natural language understanding is brought to a new height, but these large models are often used to complete complex man-machine dialogue operations, and the computational demands on the operating equipment are extremely high.
Home appliances with offline ASR (automatic speech recognition) recognition have been popular for several years, and such device chips are usually limited in computational power and less in resources, but have low requirements on model capability due to relatively low required dialogue processing and language understanding capability, and mainly improve processing precision and generalization capability under limited computational power and memory resources.
The first step in language recognition is to make a category decision on the entered text to determine the task that needs to be performed later, which is essentially a text classification problem.
In a text classification method, as shown in fig. 1, the method comprises the following steps:
step S1: and inputting the text to be classified.
For example, enter the text "please help me turn on the air conditioner".
Step S2: and cutting the text to be classified into a plurality of words to obtain word segmentation.
Dividing the to-be-classified codebook into, for example, x= { x 0 ,x 1 ,…,x n And removing stop words, e.g. the text may be cut into segmentation words and remove words of no practical significance"please", obtain the segmentation: x= { "help me", "open", "air conditioner" }.
Step S3: searching the word segmentation in a feature vector table to obtain a feature vector.
Continuing to combine the cases, for example, the words searched in the feature vector table by the text are segmented, so that corresponding feature vectors are obtained, the words such as 'help me' are usually not strong in characterization ability and can be removed, so that the searched segmented words can be two words, namely 'open' and 'air conditioner', and finally, X= { 'open', 'air conditioner' }.
Step S4: and inputting the feature vector into a classifier to obtain a classification result.
Inputting the feature vectors into a classifier to obtain a classification result, wherein the classification result is thatWherein->
In the result of the feature vector X described above, y 0 May correspond to an "open air conditioner" tag, y 1 May correspond to a "set temperature" tag, and y others Then the "other" label is corresponding.
Further, for the samples corresponding to the "other" tags in real life, it is generally difficult to exhaust all the tags, taking the above-mentioned operation air conditioner as an example, the number of the corresponding operation function tags is tens of, while the "other" tags correspond to all possible data except the control air conditioner bits, and the data interval is an on interval, for example, the number of the operation function tags such as "turn on the lamp", "turn on the curtain", etc., which is almost impossible to exhaust, here we call y 0 ~y n The data corresponding to the label is a positive sample, y others The corresponding data is a negative sample. In order to improve the classification accuracy of the classifier, a negative sample is required to be used for model training in the training process, and when the negative sample is used for model training, the model training is performed through schemes such as direct training and multistage model series connection. Wherein direct training protocols typically require collection of a large number of negative samplesData, sampling the negative sample, and training to obtain a model; the multi-stage model serial connection is to connect a clustering model in series before the positive sample classification model, to classify the sample simply two times, and then to classify the positive sample.
However, both direct training and multi-stage model series connection require a large number of negative samples, which is time-consuming and labor-consuming, and the negative samples introduce a large number of feature vectors, which require additional storage space, and the complexity of the model is increased, and the computational and resource requirements are greatly increased.
In order to solve the foregoing problems, an embodiment of the present invention provides a text classification method, so as to improve a clustering speed of samples to be classified and reduce computing power or resource consumption.
Referring specifically to fig. 2, fig. 2 is a schematic flow chart of a text processing method provided in an embodiment of the present application, where, as shown in the drawing, the text processing method provided in the embodiment of the present application includes:
step S10: and obtaining the text to be classified.
It is easy to understand that the specific way to obtain the text to be classified may be: directly acquiring text to be classified input by the text, or acquiring the text to be classified by voice recognition.
Specifically, for the sake of understanding, the following description is made by taking the foregoing "please help me turn on the air conditioner" as an example.
Step S20: and segmenting the text to be classified according to a preset word stock to obtain the used segmentation.
After the text to be classified is obtained, the text to be classified is segmented, and the used segmentation is obtained.
In a specific embodiment, the preset word stock includes all possible usage word segments of the product to which the text to be classified belongs. The determination can be made according to possible words of the product to which the text classification method provided by the embodiment of the application belongs. The word segmentation can be conveniently carried out by presetting the word stock, and the word segmentation efficiency is improved.
Specifically, by comparing the words in the text to be classified with the words in the preset word stock,the words matching the words in the preset word stock are obtained by word segmentation and stored in a preset format, such as x= { x 0 ,x 1 ,…,x n -a }; the unmatched words are stop words, namely words or words which can be filtered in the text to be classified are processed, and the stop words are removed at the same time, so that the storage space is saved, the searching efficiency is improved, and the text to be classified is processed. For example, the word x= { "help me", "open", "air conditioner" } is obtained after the text to be classified is segmented.
Step S30: and selecting keywords from the used word according to a preset feature vector table, and acquiring feature vectors according to the keywords and the feature vector table.
After the word segmentation is used, the word segmentation is needed to be converted into a feature vector for convenience in subsequent processing.
In a specific embodiment, the preset feature vector table includes feature vectors of all the possible use words that meet the requirement of the characterizability strength. The obtained used word is searched in the preset feature vector table to be the keyword. Meanwhile, each keyword in the feature vector table also has a feature vector corresponding thereto for calculating a feature vector used in the classifier, so that the feature vector can be determined at the same time after the keyword is determined. The feature vector is acquired through the preset feature vector table, so that the acquisition process of the feature vector is simpler, the acquisition efficiency of the feature vector can be shortened, and the clustering speed of samples to be classified is improved.
Specifically, by comparing the feature vector table with words in the usage word, it is determined that the shared words are keywords, for example, the feature vectors in the usage word may be two words of "open" and "air conditioner", but the characterization capability of "help me" is not strong, and the keywords x= { "open", "air conditioner" are finally obtained in the feature vector table.
Step S40: and obtaining the feature selection rate according to the number of the keywords and the number of the used segmented words.
In order to improve accuracy of text recognition, the technical scheme provided by the embodiment of the application also acquires the feature selection rate so as to correct the subsequently obtained classification probability.
Specifically, in one embodiment, the step S40 may include: and obtaining the ratio of the number of the keywords to the number of the used segmented words to obtain the feature selection rate. The feature selection rate is obtained through the ratio of the number of the keywords to the number of the used segmented words, and the samples to be classified can be clustered rapidly, so that the workload of processing negative samples is reduced greatly, the ratio of the number of the keywords to the number of the used segmented words is used as the feature selection rate, the obtaining method is simple, and the obtained feature selection rate is high in accuracy.
For example, let the number of words used be n, and the number of keywords be n in The characteristic enrollment rate is μ, and at this time, the calculation formula for calculating the characteristic enrollment rate μ is the characteristic enrollment rate. Taking the above example as an example, n can be obtained in =2, n=3, μ= = -3>
Step S50: and acquiring an initial classification label of the text to be classified and classification probability corresponding to the initial classification label by utilizing the feature vector and a preset classifier.
It is easy to understand that the classifier is a pre-trained classifier, and the training data of the classifier is all positive sample data.
It is easy to understand that in the text classification method provided by the application, when the result of the final classification label is not required to be determined to be in the negative sample range, the specific result value of the final classification label can accurately obtain the classification result, so that the classifier does not need to use a negative sample for training in the prior training, and only the results which are not positive samples are classified into one class and are uniformly processed. In this way, processing of negative-sample correlation results, which consumes a lot of resources, can be avoided without affecting the classification performance of the classifier.
The feature vector is input into a classifier to obtain a classification result y 0 =f(X),y 0 E Y, where y= { Y 0 ,y 1 ,...,y n ,y others E.g. y 0 May correspond to an "open air conditioner" tag, y 1 May correspond to a "set temperature" tag, and y others The "other" labels are corresponded to the "other" labels which contain all the results that are not positive samples, wherein all the negative samples are necessarily included, so that the classifier does not need to determine a specific result when the "other" labels are obtained, and further does not need to input the negative samples for training when the classifier is trained. The classification result is an initial classification label, the probability that the initial classification label corresponding to the text is the largest in the above example that the air conditioner is turned on is obtained, the classifier calculates the classification probability while obtaining the initial classification label, and it can be assumed that the classification probability P (y=y 0 )=0.9。
It should be noted that, since there is no influence between the step S40 and the step S50, the operation between the step S40 and the step S50 is not sequential, either one may be before the other one may be after the other one, or both may be operated simultaneously, and in a specific embodiment, the step S40 and the step S50 are operated synchronously due to the longer operation time of the step S50, so as to improve the operation efficiency.
With continued reference to fig. 2, step S60: and acquiring the fusion probability of the classification probability and the feature selection rate.
In order to improve the accuracy of text classification, after the classification probability and the feature selection rate are obtained, the classification probability and the feature selection rate are fused, and fusion probability is obtained.
Specifically, in one embodiment, the step S60 may include obtaining the fusion probability using the dependency weight, the classification probability, and the feature inclusion rate.
The dependency weight is a priori parameter preset according to the performance of the classifier, and is used for adjusting the dependency degree of the fusion probability on the classification probability and the feature selection rate respectively.
It is easy to understand that the stronger the performance of the classifier, the higher the reliability of the result obtained by the classifier, so the posterior probability is sufficient for judging the final classification label, and the smaller the dependence on the feature inclusion rate when the fusion probability is calculated. Thus the stronger the classifier performance, the greater the dependency weight and vice versa.
In the above example, assuming that the dependency weight is θ, the calculation formula of the fusion probability is fusion probability γ= (1- θ) μ+θp (y=y) 0 ) Wherein 0 is<And θ is less than or equal to 1, the larger the dependence weight θ is, the larger the duty ratio of the classification probability is, the more the fusion probability is dependent on the classification probability, and when the dependence weight θ=1, the classification probability of the model is completely dependent. In the above example, if the dependency weight θ=0.6 is taken, the probabilities are fused
Step S70: and judging that the fusion probability meets the preset fusion probability condition, if yes, executing the step S80, and if not, executing the step S90.
It is easy to understand that the fusion probability is obtained by combining the classification probability and the feature enrollment rate, so that the fusion probability cannot be directly used as a result of the classifier, and the fusion probability needs to be judged according to related prior knowledge.
Specifically, the fusion probability may be compared with a predetermined fusion probability condition, whether the predetermined fusion probability condition is satisfied is determined, when the fusion probability satisfies the predetermined fusion probability condition, step S80 is executed, that is, the final classification label of the text to be classified is the initial classification label, and if not, step S90 is executed.
The predetermined fusion probability delta can be determined according to the classification capability of the classifier on the sample in the process of training the classifier.
The final classification label can be obtained easily in the use process through simple comparison and judgment of the fusion probability.
In a specific embodiment, the step S70 may include determining whether the fusion probability is greater than or equal to a predetermined fusion probability.
Step S80: and determining the final classification label of the text to be classified as the initial classification label.
And when the fusion probability meets a preset fusion probability condition, determining a final classification label of the text to be classified as the initial classification label.
Specifically, when the fusion probability is greater than or equal to a predetermined fusion probability, determining that the final classification label of the text to be classified is the initial classification label.
In the above example, the predetermined fusion probability δ=0.6 is taken, at which time the tag is initially classifiedAccording to the result of the fusion probability in the above example, the final classification label of the sample to be classified "please help me open the air conditioner" is the initial classification label "open the air conditioner".
Step S90: and determining the final classification labels of the text to be classified as other classification labels.
And when the fusion probability does not meet a preset fusion probability condition, determining that the final classification label of the text to be classified is other classification labels.
Specifically, when the fusion probability is smaller than the preset fusion probability, determining that the final classification label of the text to be classified is other classification labels.
For ease of understanding, the description will be given again taking the negative example "please help me turn on humidifier":
firstly, a text to be classified is obtained, then, word segmentation is carried out on the text to be classified according to a preset word stock, and used word segmentation is obtained, and because the preset word stock only comprises all possibly used words including products to which the text to be classified belongs, all possibly used words of some non-belonging products are eliminated after word segmentation.
At this time, we use the preset word stock to word the negative sample "please help me open humidifier" to obtain the word x= { "help me", "open", "add", "wet", "ware" }. It can be seen that the term "humidifier" is cut into three words because it is not in the preset lexicon.
And then, selecting a keyword from the used word according to a preset feature vector table, and acquiring a feature vector according to the keyword and the feature vector table, wherein the keyword in the negative sample is an open word, and at the moment, the feature vector X= { "open" }.
Then, obtaining a feature selection rate according to the number of the keywords and the number of the used segmented words, and calculating a formula according to the feature selection rateIn the negative sample, n in =1, n=5, μ= = -5>At this point we can find that the feature inclusion rate of the negative samples is significantly lower compared to the positive samples.
Then, the feature vector and a preset classifier are utilized to obtain an initial classification label of the text to be classified and classification probability corresponding to the initial classification label, and a classification result y is obtained by inputting the feature vector X= { "open" } into the classifier 0 =f (X). The probability of the corresponding "air conditioner on" label in the negative example is relatively large, assuming P (y=y 0 )=0.6。
Then, acquiring the fusion probability of the classification probability and the feature selection rate, and fusing the probabilities according to a calculation formula of the fusion probabilityTaking θ=0.6 at this time, the fusion probability of the negative sample is then
Finally, judging whether the fusion probability is greater than or equal to a preset fusion probability, determining whether a final classification label of the text to be classified is the initial classification label, if so, determining that the final classification label of the text to be classified is the initial classification label, and if soAnd determining the final classification label of the text to be classified as other classification labels instead of the initial classification label. In accordance with the predetermined fusion probability condition,wherein δ=0.6 is taken, due to +.><0.6, so that the fusion probability is smaller than the preset fusion probability, the final classification label of the text to be classified is other classification labels, namely the confidence that the negative sample 'please help me open the humidifier' belongs to the 'open air conditioner' label is lower, and the negative sample 'please help me open the humidifier' does not belong to the 'open air conditioner' label.
It should be noted that, in an alternative embodiment, the text classification method described in the present application is set in an intelligent home product. The smart home comprises: intelligent home appliances such as intelligent air conditioner, intelligent refrigerator and intelligent curtain. Because the chip arranged on the intelligent home product is limited in computational effort and less in resources, the computational effort required during language identification can be reduced through the text classification method, and the processing precision of the language identification is ensured.
According to the embodiment of the application, the feature selection rate and the classification probability are introduced in the classification process of the sample to be classified, so that the classification probability is corrected by utilizing the feature selection rate when the classifier outputs a result, the accuracy of the classification result is improved, and therefore, when the classifier is trained, only the positive sample is trained, the negative sample is not required to be used for training, the clustering speed of the sample to be classified is improved on the premise that a large number of negative samples are not required to be trained, the workload of the classifier for processing the negative sample is reduced, no extra calculation force is introduced, and the resource consumption required by the classifier training is reduced on the basis of ensuring the accuracy of the classification result.
The embodiment of the application also provides a text classification device, as shown in fig. 3, including:
a text obtaining device 500, configured to obtain a text to be classified;
the word segmentation device 510 is configured to segment the text to be classified according to a preset word stock, so as to obtain a used word;
the feature vector selecting device 520 is configured to select a keyword from the used word segment according to a preset feature vector table, and obtain a feature vector according to the keyword and the feature vector table;
a feature enrollment rate calculation device 530, configured to obtain a feature enrollment rate according to the number of keywords and the number of used segmentation words;
the classifying device 540 is configured to obtain an initial classification label of the text to be classified and a classification probability corresponding to the initial classification label by using the feature vector and a preset classifier;
fusion probability calculation means 550, configured to obtain a fusion probability of the classification probability and the feature enrollment rate;
and the label calculating device 560 is configured to determine that the final classification label of the text to be classified is the initial classification label when the fusion probability satisfies a predetermined fusion probability condition.
Further, in a specific embodiment, the feature enrollment rate calculating device 530 is configured to obtain a feature enrollment rate according to the number of keywords and the number of used segmentation words, and includes: and obtaining the ratio of the number of the keywords to the number of the used segmented words to obtain the feature selection rate.
Further, in an embodiment, the preset word stock includes all possible usage word segments of the product to which the text to be classified belongs.
Further, in a specific embodiment, the preset feature vector table includes feature vectors of all the possible use words that satisfy the requirement for the strength of the characterizability.
Further, in a specific embodiment, the fused probability calculating device 550 is configured to obtain the fused probability of the classification probability and the feature enrollment rate, and includes: the fusion probability is obtained by using the dependency weight, the classification probability and the feature enrollment rate.
Further, in a specific embodiment, the dependent weights are determined according to the classifier performance.
Further, in a specific embodiment, the tag calculating device 560 is configured to determine, when the fusion probability satisfies a predetermined fusion probability condition, that a final classification tag of the text to be classified is the initial classification tag, including: and when the fusion probability is greater than or equal to a preset fusion probability, determining that the final classification label of the text to be classified is the initial classification label.
Further, in a specific embodiment, the tag calculating device 560 is configured to determine, when the fusion probability satisfies a predetermined fusion probability condition, that a final classification tag of the text to be classified is the initial classification tag, and further includes: and when the fusion probability is smaller than a preset fusion probability, determining that the final classification label of the text to be classified is other classification labels.
Further, in a specific embodiment, the data for training the preset classifier is all positive sample data.
According to the text classification device provided by the embodiment of the application, the characteristic selection rate and the classification probability are introduced in the classification process of the sample to be classified, so that the classifier can accurately obtain the classification result without determining the specific value of the negative sample result when outputting the result, and further, the classifier does not need to train by using the negative sample in the prior training process, only the results which are not positive samples are classified into one class and are uniformly processed, and therefore, the clustering speed of the sample to be classified is improved on the premise that a large number of negative samples do not need to train, the workload of the classifier for processing the negative samples is further reduced, the resource consumption required by classification is reduced, no extra calculation force is introduced, and the classification performance of the classifier is not influenced.
The embodiment of the application also provides a storage medium storing a program suitable for text classification to implement the text classification method as described above.
The embodiment of the application also provides electronic equipment, which comprises at least one memory and at least one processor; the memory stores a program that the processor invokes to perform the text classification method as described above.
Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims (12)

1. A method of text classification, comprising:
acquiring a text to be classified;
word segmentation is carried out on the text to be classified according to a preset word stock, and used word segmentation is obtained;
selecting a keyword from the used word according to a preset feature vector table, and acquiring a feature vector according to the keyword and the feature vector table;
acquiring a feature selection rate according to the number of the keywords and the number of the used segmented words;
acquiring an initial classification label of the text to be classified and classification probability corresponding to the initial classification label by utilizing the feature vector and a preset classifier;
acquiring the fusion probability of the classification probability and the feature selection rate;
and when the fusion probability meets a preset fusion probability condition, determining a final classification label of the text to be classified as the initial classification label.
2. The text classification method of claim 1, wherein the step of obtaining a feature inclusion rate according to the number of keywords and the number of used segmentations comprises:
and obtaining the ratio of the number of the keywords to the number of the used segmented words to obtain the feature selection rate.
3. The text classification method of claim 1 wherein,
the preset word stock comprises all possible use word segmentation of the product to which the text to be classified belongs.
4. A text classification method as claimed in claim 3 wherein said predetermined feature vector table comprises feature vectors of all of said possible usage words satisfying a characterizer strength requirement.
5. The text classification method of claim 1, wherein the step of obtaining a fusion probability of the classification probability and the feature inclusion rate comprises:
the fusion probability is obtained by using the dependency weight, the classification probability and the feature enrollment rate.
6. The text classification method of claim 5, wherein the dependent weights are determined based on the classifier performance.
7. The text classification method of claim 1, wherein the step of the final classification tag of the text to be classified being the initial classification tag when the fusion probability satisfies a preset probability condition comprises:
and when the fusion probability is greater than or equal to a preset fusion probability, determining that the final classification label of the text to be classified is the initial classification label.
8. The text classification method of claim 1, further comprising:
and when the fusion probability is smaller than a preset fusion probability, determining that the final classification label of the text to be classified is other classification labels.
9. The text classification method of claim 1, wherein the data training the pre-set classifier is all positive sample data.
10. A text classification device, comprising:
the text acquisition device is used for acquiring texts to be classified;
the word segmentation device is used for segmenting the text to be classified according to a preset word stock to obtain used word segmentation;
the feature vector selection device is used for selecting keywords from the used word segments according to a preset feature vector table and obtaining feature vectors according to the keywords and the feature vector table;
the feature enrollment rate calculation device is used for obtaining feature enrollment rates according to the number of the keywords and the number of the used segmented words;
the classifying device is used for acquiring an initial classifying label of the text to be classified and classifying probability corresponding to the initial classifying label by utilizing the feature vector and a preset classifier;
the fusion probability calculation device is used for acquiring the fusion probability of the classification probability and the feature selection rate;
and the label calculating device is used for determining that the final classification label of the text to be classified is the initial classification label when the fusion probability meets the preset fusion probability condition.
11. A storage medium storing a program adapted for text classification to implement the text classification method according to any one of claims 1 to 9.
12. An electronic device comprising at least one memory and at least one processor; the memory stores a program that the processor invokes to perform the text classification method of any one of claims 1 to 9.
CN202311519737.XA 2023-11-15 2023-11-15 Text classification method and device and related equipment Active CN117236329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311519737.XA CN117236329B (en) 2023-11-15 2023-11-15 Text classification method and device and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311519737.XA CN117236329B (en) 2023-11-15 2023-11-15 Text classification method and device and related equipment

Publications (2)

Publication Number Publication Date
CN117236329A CN117236329A (en) 2023-12-15
CN117236329B true CN117236329B (en) 2024-02-06

Family

ID=89088472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311519737.XA Active CN117236329B (en) 2023-11-15 2023-11-15 Text classification method and device and related equipment

Country Status (1)

Country Link
CN (1) CN117236329B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110781663A (en) * 2019-10-28 2020-02-11 北京金山数字娱乐科技有限公司 Training method and device of text analysis model and text analysis method and device
CN113177121A (en) * 2021-05-20 2021-07-27 中国建设银行股份有限公司 Text topic classification method and device, electronic equipment and storage medium
CN115495579A (en) * 2022-09-20 2022-12-20 号百信息服务有限公司 Method and device for classifying text of 5G communication assistant, electronic equipment and storage medium
CN115510232A (en) * 2022-09-28 2022-12-23 平安科技(深圳)有限公司 Text sentence classification method and classification device, electronic equipment and storage medium
CN116628195A (en) * 2023-04-17 2023-08-22 平安科技(深圳)有限公司 Text classification method, apparatus, electronic device and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329836A (en) * 2020-11-02 2021-02-05 成都网安科技发展有限公司 Text classification method, device, server and storage medium based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543032A (en) * 2018-10-26 2019-03-29 平安科技(深圳)有限公司 File classification method, device, computer equipment and storage medium
CN110781663A (en) * 2019-10-28 2020-02-11 北京金山数字娱乐科技有限公司 Training method and device of text analysis model and text analysis method and device
CN113177121A (en) * 2021-05-20 2021-07-27 中国建设银行股份有限公司 Text topic classification method and device, electronic equipment and storage medium
CN115495579A (en) * 2022-09-20 2022-12-20 号百信息服务有限公司 Method and device for classifying text of 5G communication assistant, electronic equipment and storage medium
CN115510232A (en) * 2022-09-28 2022-12-23 平安科技(深圳)有限公司 Text sentence classification method and classification device, electronic equipment and storage medium
CN116628195A (en) * 2023-04-17 2023-08-22 平安科技(深圳)有限公司 Text classification method, apparatus, electronic device and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Feature Transfer Learning for Text Categorization;Zhao Shichen;Journal of Data Acquisition and Processing;第32卷(第3期);516-522 *
基于PTM潜在Dirichlet分配的少量标记样本文本分类;赵丽;齐兴斌;李雪梅;田涛;;计算机应用研究;第32卷(第05期);154-170 *

Also Published As

Publication number Publication date
CN117236329A (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN110209823B (en) Multi-label text classification method and system
CN106571140B (en) Intelligent electric appliance control method and system based on voice semantics
CN112069310B (en) Text classification method and system based on active learning strategy
Jansen et al. Efficient spoken term discovery using randomized algorithms
CN107066555B (en) On-line theme detection method for professional field
CN112732871B (en) Multi-label classification method for acquiring client intention labels through robot induction
WO2022052505A1 (en) Method and apparatus for extracting sentence main portion on the basis of dependency grammar, and readable storage medium
CN112199501B (en) Scientific and technological information text classification method
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
CN114564964B (en) Unknown intention detection method based on k nearest neighbor contrast learning
CN110597082A (en) Intelligent household equipment control method and device, computer equipment and storage medium
WO2020168754A1 (en) Prediction model-based performance prediction method and device, and storage medium
CN111667817A (en) Voice recognition method, device, computer system and readable storage medium
CN112446209A (en) Method, equipment and device for setting intention label and storage medium
CN114328939B (en) Natural language processing model construction method based on big data
CN112685374B (en) Log classification method and device and electronic equipment
CN117236329B (en) Text classification method and device and related equipment
CN113821637A (en) Long text classification method and device, computer equipment and readable storage medium
CN110458383B (en) Method and device for realizing demand processing servitization, computer equipment and storage medium
CN111950652A (en) Semi-supervised learning data classification algorithm based on similarity
CN114996466B (en) Method and system for establishing medical standard mapping model and using method
CN108573275B (en) Construction method of online classification micro-service
CN116595170A (en) Medical text classification method based on soft prompt
CN116361316A (en) Semantic engine adaptation method, device, equipment and storage medium
CN112668342B (en) Remote supervision relation extraction noise reduction system based on twin network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant