CN112966110A - Text type identification method and related equipment - Google Patents

Text type identification method and related equipment Download PDF

Info

Publication number
CN112966110A
CN112966110A CN202110286227.7A CN202110286227A CN112966110A CN 112966110 A CN112966110 A CN 112966110A CN 202110286227 A CN202110286227 A CN 202110286227A CN 112966110 A CN112966110 A CN 112966110A
Authority
CN
China
Prior art keywords
text
probability set
classification model
inclusionary
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110286227.7A
Other languages
Chinese (zh)
Inventor
李明凡
周凯捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110286227.7A priority Critical patent/CN112966110A/en
Publication of CN112966110A publication Critical patent/CN112966110A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text type identification method and related equipment, which are applied to electronic equipment, wherein the method comprises the following steps: obtaining a training sample; training a preset text classification model to be trained by adopting the training sample to obtain a text classification model; acquiring a text to be classified, and inputting the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set comprises prediction probabilities that the sample to be classified belongs to preset text classes; and determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set. By the adoption of the text classification method and device, text classification efficiency is improved.

Description

Text type identification method and related equipment
Technical Field
The present application relates to the field of electronic technologies, and in particular, to a text type recognition method and a related device.
Background
With the development of the internet, a large amount of text data is generated continuously, so that text classification occupies an important position in information processing. Because a large amount of information exists in text data, if the information cannot be managed and extracted quickly and effectively, significant losses of enterprise and social information technologies are caused, and therefore, how to identify texts by using an effective and quick method to realize classification is a key problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a text type identification method and related equipment, which are beneficial to quickly and effectively classifying texts.
In a first aspect, an embodiment of the present application provides a text category identification method, where the method includes:
obtaining a training sample;
training a preset text classification model to be trained by adopting the training sample to obtain a text classification model;
acquiring a text to be classified, and inputting the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set comprises prediction probabilities that the sample to be classified belongs to preset text classes;
and determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
In a second aspect, an embodiment of the present application provides a text type identification apparatus, including:
a first obtaining unit for obtaining a training sample;
the training unit is used for training a preset text classification model to be trained by adopting the training sample to obtain a text classification model;
the second acquisition unit is used for acquiring texts to be classified;
the input unit is used for inputting the text to be classified into the text classification model to obtain a class prediction probability set, and the class prediction probability set comprises prediction probabilities that the samples to be classified belong to preset text classes;
and the determining unit is used for determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing steps in the method according to the first aspect of the embodiment of the present application.
In a fourth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, where the computer program causes a computer to perform some or all of the steps described in the method according to the first aspect of the present application.
In a fifth aspect, the present application provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps described in the method according to the first aspect of the present application. The computer program product may be a software installation package.
It can be seen that, in the embodiment of the application, the electronic device first obtains a training sample, then trains a preset text classification model to be trained by using the training sample to obtain a text classification model, then obtains a text to be classified, inputs the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set includes prediction probabilities that the sample to be classified belongs to preset text classes, and finally determines target text classes to which the text to be classified belongs in the preset text classes based on the class prediction probability set. The text classification model to be trained is trained first, and then the text classification model obtained through training is adopted to classify the text, so that the text classification method is beneficial to quickly and effectively classifying the text.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a text category identification method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of another electronic device provided in the embodiment of the present application;
fig. 4 is a schematic structural diagram of a text type identification device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The following are detailed below.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Hereinafter, some terms in the present application are explained to facilitate understanding by those skilled in the art.
The electronic device may include a computing device or other processing device connected to a wireless modem, or the like.
As shown in fig. 1, fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device includes at least one of: processors, Memory, signal processors, transceivers, Random Access Memory (RAM), sensors, and so forth. The memory, the signal processor, the RAM and the sensor are connected with the processor, and the transceiver is connected with the signal processor.
Wherein the sensor comprises at least one of: light-sensitive sensors, gyroscopes, infrared proximity sensors, fingerprint sensors, pressure sensors, etc. Among them, the light sensor, also called an ambient light sensor, is used to detect the ambient light brightness. The light sensor may include a light sensitive element and an analog to digital converter. The photosensitive element is used for converting collected optical signals into electric signals, and the analog-to-digital converter is used for converting the electric signals into digital signals. Optionally, the light sensor may further include a signal amplifier, and the signal amplifier may amplify the electrical signal converted by the photosensitive element and output the amplified electrical signal to the analog-to-digital converter. The photosensitive element may include at least one of a photodiode, a phototransistor, a photoresistor, and a silicon photocell.
The processor is a control center of the electronic equipment, various interfaces and lines are used for connecting all parts of the whole electronic equipment, and various functions and processing data of the electronic equipment are executed by operating or executing software programs and/or modules stored in the memory and calling data stored in the memory, so that the electronic equipment is monitored integrally.
The processor may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.
The memory is used for storing software programs and/or modules, and the processor executes various functional applications and data processing of the electronic equipment by operating the software programs and/or modules stored in the memory. The memory mainly comprises a program storage area and a data storage area, wherein the program storage area can store an operating system, a software program required by at least one function and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The following describes embodiments of the present application in detail.
As shown in fig. 2, a flowchart of a text type identification method provided in an embodiment of the present application is applied to the electronic device, and specifically includes the following steps:
step 201: training samples are obtained.
The training samples may include an inclusionary training sample and an exclusionary training sample, where the inclusionary training sample means that the sample belongs to a certain sample class, and the exclusionary training sample means that the sample does not belong to a certain sample class.
Step 202: and training the text classification model to be trained by adopting the training sample to obtain a text classification model.
The text classification model to be trained may include an exclusivity text classification model to be trained and an inclusivity text classification model to be trained, the exclusivity text classification model to be trained corresponds to the exclusivity training sample, and the inclusivity training sample corresponds to the inclusivity text classification model to be trained.
The text classification model comprises an exclusivity text classification model and an inclusivity text classification model, the exclusivity text classification model corresponds to the exclusivity text classification model to be trained, and the inclusivity text classification model corresponds to the inclusivity text classification model to be trained.
Step 203: and acquiring a text to be classified, and inputting the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set comprises the prediction probability that the sample to be classified belongs to a preset text class.
The text to be classified may be an inclusionary text or an exclusionary text.
The preset text category comprises at least one text category, and the category prediction probability set comprises the probability that the sample to be classified belongs to each preset text category.
The class prediction probabilities of the texts to be classified on different text classes may be the same or different.
Where the text category may be news topics, spam, user comments, and the like.
Step 204: and determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
And determining the text category corresponding to the maximum prediction probability in the category prediction probability set as the target text category.
It can be seen that, in the embodiment of the application, the electronic device first obtains a training sample, then trains a preset text classification model to be trained by using the training sample to obtain a text classification model, then obtains a text to be classified, inputs the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set includes prediction probabilities that the sample to be classified belongs to preset text classes, and finally determines target text classes to which the text to be classified belongs in the preset text classes based on the class prediction probability set. The text classification model to be trained is trained first, and then the text classification model obtained through training is adopted to classify the text, so that the text classification method is beneficial to quickly and effectively classifying the text.
In a possible implementation manner, the training a preset text classification model to be trained by using the training sample to obtain a text classification model includes:
dividing the training samples into inclusionary training samples and exclusionary training samples based on sample identifications;
training an exclusionary text classification model to be trained based on the exclusionary training sample to obtain an exclusionary text classification model;
and training the inclusionary text classification model to be trained based on the inclusionary training sample to obtain the inclusionary text classification model.
The sample identifications of all inclusive training samples are the same, the sample identifications of all exclusive training samples are the same, and the sample identifications of the inclusive training samples and the exclusive training samples are different.
Wherein, there is at least one inclusive training sample and at least one inclusive training sample.
The number of inclusionary training samples may be greater than or equal to the number of exclusionary training samples, or the number of inclusionary training samples may be less than the number of exclusionary training samples.
It can be seen that, in the embodiment of the application, the inclusion training sample is adopted to train the inclusion text classification model to be trained, and the exclusion training sample is adopted to train the exclusion text classification model to be trained, so that the text to be classified can be classified into the exclusion text when the text to be classified is the text with problems, and the accuracy of text classification is ensured.
In a possible implementation manner, the training the exclusionary text classification model to be trained based on the exclusionary training sample to obtain the exclusionary text classification model includes:
inputting the exclusionary training samples into a first preset classification model to obtain an inclusionary prediction probability set and an exclusionary prediction probability set, wherein the inclusionary prediction probability set comprises the probability that the exclusionary training samples belong to the preset text category, and the exclusionary prediction probability set comprises the probability that the exclusionary training samples are excluded from the preset text category;
and training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model.
The preset text category comprises at least one text category, the number of the exclusionary training samples is greater than or equal to one, and the first preset classification model is used for carrying out normalization processing on the exclusionary training samples, so that an inclusionary prediction probability set corresponding to each exclusionary training sample and an exclusionary prediction probability set corresponding to each exclusionary training sample can be obtained, the inclusionary prediction probability set comprises the probability that the corresponding exclusionary training sample belongs to each text category, and the exclusionary prediction probability set comprises the probability that the corresponding exclusionary training sample is excluded from each text category.
In a possible implementation manner, the training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model includes:
determining an inclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set, and determining an exclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set;
obtaining a first loss value based on a first loss function, an inclusiveness prediction probability set corresponding to the exclusiveness training sample and an exclusiveness prediction probability set corresponding to the exclusiveness training sample;
and under the condition that the first loss value is greater than or equal to a first preset loss value, adjusting a first preset model parameter based on the first loss value and a back propagation algorithm until the first loss value is smaller than the first preset loss value, and obtaining the exclusionary text classification model.
The method comprises the steps of obtaining a preset text type, determining an exclusionary training sample A, determining a first loss value according to an inclusionary prediction probability set corresponding to the exclusionary training sample A, an exclusionary prediction probability set corresponding to the exclusionary training sample A and a first loss function, adjusting a first preset model parameter if the first loss value is smaller than the first preset loss value, and randomly selecting one exclusionary training sample except the exclusionary training sample A from the exclusionary training sample A again to serve as the exclusionary training sample A until the first loss value is smaller than the first preset loss value.
Wherein the first loss function is
Figure BDA0002980585970000071
piRepresenting the probability of a training sample belonging to the ith text class, qiRepresenting the probability of the training sample being excluded from the ith text category.
For example, if there are 3 text categories (text category a, text category B, and text category C), the probability that the training sample a belongs to the text category a is 0.8, the probability that the training sample a belongs to the text category B is 0.7, the probability that the training sample a belongs to the text category C is 0.5, the probability that the training sample a is excluded from the text category a is 0.8, the probability that the training sample a is excluded from the text category B is 0.6, and the probability that the training sample a is excluded from the text category C is 0.3, the loss value corresponding to the training sample a is- (0.8 log (0.8) +0.7 log (0.6) +0.5 log (0.3)).
The adjustment of the current first preset model parameter is performed on the basis of the first preset model parameter obtained by the last adjustment.
The first preset model parameter may be a positive number or a negative number.
It can be seen that, in the embodiment of the present application, under the condition that the first loss value is smaller than the first preset loss value, the exclusionary text classification model is obtained, which is favorable for improving the correctness of determining the text classification.
In a possible implementation manner, the training the inclusionary text classification model to be trained based on the inclusionary training sample to obtain an inclusionary text classification model includes:
inputting the inclusiveness training sample into a second preset classification model to obtain an inclusiveness prediction probability set, wherein the inclusiveness prediction probability set comprises the probability that the inclusiveness training sample belongs to the preset text category;
determining an annotation probability set based on the annotation of the inclusionary training sample, the annotation probability set comprising a probability that the inclusionary training sample is annotated as the preset text classification;
and training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model.
The preset text classes comprise at least one text class, the number of inclusionary training samples is greater than or equal to one, the second preset classification model is used for carrying out normalization processing on the inclusionary training samples, so that an inclusionary prediction probability set corresponding to each inclusionary training sample and a labeling probability set corresponding to each inclusionary training sample can be obtained, the inclusionary prediction probability set comprises the probability that the corresponding inclusionary training sample belongs to each text class, and the labeling probability set comprises the probability that the corresponding inclusionary training sample is labeled as each text class.
Wherein different text categories correspond to different labels.
The method comprises the steps of determining which text type an inclusiveness training sample is labeled through labeling of a training text, wherein if the inclusiveness training sample is labeled as a certain text type, the labeling probability corresponding to the certain text type is 1, and the labeling probability corresponding to the text types except the certain text type is 0.
For example, if there are 3 text types (text type a, text type B, and text type C), and the training sample is labeled as belonging to the text type a, the labeling probability set corresponding to the training sample is (1,0, 0).
In a possible implementation manner, the training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model includes:
determining an inclusionary prediction probability set corresponding to the inclusionary training sample in the inclusionary prediction probability set, and determining an annotation probability set corresponding to the inclusionary training sample in the annotation probability set;
obtaining a second loss value based on a second loss function, an inclusiveness prediction probability set corresponding to the inclusiveness training sample and an annotation probability set corresponding to the inclusiveness training sample;
and under the condition that the second loss value is greater than or equal to a second preset loss value, adjusting second preset model parameters based on the second loss value and a back propagation algorithm until the second loss value is smaller than the second preset loss value, and obtaining the inclusive text classification model.
The method comprises the steps of determining a first loss value, determining a second loss value based on an inclusiveness prediction probability set corresponding to the inclusiveness training sample B, a labeling probability set corresponding to the inclusiveness training sample B and a first loss function, adjusting parameters of a first preset model if the first loss value is smaller than the first loss value, and selecting one inclusiveness training sample except the inclusiveness training sample B from the inclusiveness training samples as the inclusiveness training sample B until the first loss value is smaller than the first preset loss value.
And adjusting the current second preset model parameter on the basis of the second preset model parameter obtained by the last adjustment.
The second preset model parameter may be a positive number or a negative number.
Wherein the second loss function is
Figure BDA0002980585970000091
piRepresenting the probability that the training sample belongs to the ith text class,
Figure BDA0002980585970000092
is a probability that the training sample is labeled as the ith text category.
For example, if there are 3 text classes (text class a, text class B, and text class C), the probability that the training sample a belongs to the text class a is 0.8, the probability that the training sample a belongs to the text class B is 0.7, the probability that the training sample a belongs to the text class C is 0.5, and the training sample a is labeled as the text class a, the corresponding loss value of the training sample a is-1 × log (0.8).
It can be seen that, in the embodiment of the present application, under the condition that the second loss value is smaller than the second preset loss value, the second trained text classification model is obtained, which is beneficial to improving the correctness of determining the text classification.
In a possible implementation manner, after determining the target text category to which the text to be classified belongs based on the N first prediction probabilities, the method further includes:
and outputting the target text category and the prediction probability corresponding to the target text category.
Wherein, the target text category and the corresponding prediction probability can be associated and output.
It can be seen that, in the embodiment of the present application, the target text category and the prediction probability are output, which is beneficial to understanding the accuracy of text classification.
Referring to fig. 3, in accordance with the embodiment shown in fig. 2, fig. 3 is a schematic structural diagram of another electronic device provided in an embodiment of the present application, as shown, the electronic device includes a processor, a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the programs include instructions for performing the following steps:
obtaining a training sample;
training a preset text classification model to be trained by adopting the training sample to obtain a text classification model;
acquiring a text to be classified, and inputting the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set comprises prediction probabilities that the sample to be classified belongs to preset text classes;
and determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
In an implementation manner of the present application, in terms of training a preset text classification model to be trained by using the training samples to obtain a text classification model, the above-mentioned program is specifically used for executing instructions of the following steps:
dividing the training samples into inclusionary training samples and exclusionary training samples based on sample identifications;
training an exclusionary text classification model to be trained based on the exclusionary training sample to obtain an exclusionary text classification model;
and training the inclusionary text classification model to be trained based on the inclusionary training sample to obtain the inclusionary text classification model.
In an implementation manner of the present application, in terms of training an exclusionary text classification model to be trained based on the exclusionary training samples to obtain the exclusionary text classification model, the above-mentioned program is specifically configured to execute instructions of the following steps:
inputting the exclusionary training samples into a first preset classification model to obtain an inclusionary prediction probability set and an exclusionary prediction probability set, wherein the inclusionary prediction probability set comprises the probability that the exclusionary training samples belong to the preset text category, and the exclusionary prediction probability set comprises the probability that the exclusionary training samples are excluded from the preset text category;
and training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model.
In an implementation manner of the present application, in terms of training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model, the above-mentioned program is specifically configured to execute instructions of the following steps:
determining an inclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set, and determining an exclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set;
obtaining a first loss value based on a first loss function, an inclusiveness prediction probability set corresponding to the exclusiveness training sample and an exclusiveness prediction probability set corresponding to the exclusiveness training sample;
and under the condition that the first loss value is greater than or equal to a first preset loss value, adjusting a first preset model parameter based on the first loss value and a back propagation algorithm until the first loss value is smaller than the first preset loss value, and obtaining the exclusionary text classification model.
In an implementation manner of the present application, in terms of training an inclusionary text classification model to be trained based on the inclusionary training sample to obtain an inclusionary text classification model, the above-mentioned program is specifically used for executing instructions of the following steps:
inputting the inclusiveness training sample into a second preset classification model to obtain an inclusiveness prediction probability set, wherein the inclusiveness prediction probability set comprises the probability that the inclusiveness training sample belongs to the preset text category;
determining an annotation probability set based on the annotation of the inclusionary training sample, the annotation probability set comprising a probability that the inclusionary training sample is annotated as the preset text classification;
and training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model.
In an implementation manner of the present application, in terms of training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model, the above-mentioned program is specifically configured to execute instructions of the following steps:
determining an inclusionary prediction probability set corresponding to the inclusionary training sample in the inclusionary prediction probability set, and determining an annotation probability set corresponding to the inclusionary training sample in the annotation probability set;
obtaining a second loss value based on a second loss function, an inclusiveness prediction probability set corresponding to the inclusiveness training sample and an annotation probability set corresponding to the inclusiveness training sample;
and under the condition that the second loss value is greater than or equal to a second preset loss value, adjusting second preset model parameters based on the second loss value and a back propagation algorithm until the second loss value is smaller than the second preset loss value, and obtaining the inclusive text classification model.
In an implementation manner of the present application, after determining a target text category to which the text to be classified belongs in the preset text categories, the program is further specifically configured to execute instructions of the following steps:
and outputting the target text category and the prediction probability corresponding to the target text category.
It should be noted that, for the specific implementation process of the present embodiment, reference may be made to the specific implementation process described in the above method embodiment, and a description thereof is omitted here.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a text type recognition apparatus according to an embodiment of the present application, where the apparatus includes:
a first obtaining unit 401, configured to obtain a training sample;
a training unit 402, configured to train a preset text classification model to be trained by using the training sample, so as to obtain a text classification model;
a second obtaining unit 403, configured to obtain a text to be classified;
an input unit 404, configured to input the text to be classified into the text classification model, so as to obtain a class prediction probability set, where the class prediction probability set includes prediction probabilities that the samples to be classified belong to preset text classes;
a determining unit 405, configured to determine, based on the category prediction probability set, a target text category to which the text to be classified belongs in the preset text categories.
In an implementation manner of the present application, in terms of training a preset text classification model to be trained by using the training samples to obtain a text classification model, the training unit 402 is configured to execute instructions of the following steps:
dividing the training samples into inclusionary training samples and exclusionary training samples based on sample identifications;
training an exclusionary text classification model to be trained based on the exclusionary training sample to obtain an exclusionary text classification model;
and training the inclusionary text classification model to be trained based on the inclusionary training sample to obtain the inclusionary text classification model.
In an implementation manner of the present application, in terms of training an exclusionary text classification model to be trained based on the exclusionary training samples to obtain the exclusionary text classification model, the training unit 402 is configured to execute the following instructions:
inputting the exclusionary training samples into a first preset classification model to obtain an inclusionary prediction probability set and an exclusionary prediction probability set, wherein the inclusionary prediction probability set comprises the probability that the exclusionary training samples belong to the preset text category, and the exclusionary prediction probability set comprises the probability that the exclusionary training samples are excluded from the preset text category;
and training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model.
In an implementation manner of the present application, in terms of training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model, the training unit 402 is configured to execute the following steps:
determining an inclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set, and determining an exclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set;
obtaining a first loss value based on a first loss function, an inclusiveness prediction probability set corresponding to the exclusiveness training sample and an exclusiveness prediction probability set corresponding to the exclusiveness training sample;
and under the condition that the first loss value is greater than or equal to a first preset loss value, adjusting a first preset model parameter based on the first loss value and a back propagation algorithm until the first loss value is smaller than the first preset loss value, and obtaining the exclusionary text classification model.
In an implementation manner of the present application, in terms of training a to-be-trained inclusionary text classification model based on the inclusionary training sample to obtain an inclusionary text classification model, the training unit 402 is configured to execute instructions of the following steps:
inputting the inclusiveness training sample into a second preset classification model to obtain an inclusiveness prediction probability set, wherein the inclusiveness prediction probability set comprises the probability that the inclusiveness training sample belongs to the preset text category;
determining an annotation probability set based on the annotation of the inclusionary training sample, the annotation probability set comprising a probability that the inclusionary training sample is annotated as the preset text classification;
and training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model.
In an implementation manner of the present application, in terms of training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model, the training unit 402 is configured to execute the following steps:
determining an inclusionary prediction probability set corresponding to the inclusionary training sample in the inclusionary prediction probability set, and determining an annotation probability set corresponding to the inclusionary training sample in the annotation probability set;
obtaining a second loss value based on a second loss function, an inclusiveness prediction probability set corresponding to the inclusiveness training sample and an annotation probability set corresponding to the inclusiveness training sample;
and under the condition that the second loss value is greater than or equal to a second preset loss value, adjusting second preset model parameters based on the second loss value and a back propagation algorithm until the second loss value is smaller than the second preset loss value, and obtaining the inclusive text classification model.
In a possible implementation, the text category identifying means further comprises an output unit 406.
In an implementation manner of the present application, after determining a target text category to which the text to be classified belongs in the preset text categories, the output unit 406 is configured to execute the following steps:
and outputting the target text category and the prediction probability corresponding to the target text category.
It should be noted that the first obtaining unit 401, the training unit 402, the second obtaining unit 403, the input unit 404, the determining unit 405, and the output unit 406 may be implemented by a processor.
Embodiments of the present application further provide a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed by a processor to implement part or all of the steps of any one of the text category identification methods as described in the above method embodiments.
Embodiments of the present application also provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform some or all of the steps described in the electronic device in the above method. The computer program product may be a software installation package.
The steps of a method or algorithm described in the embodiments of the present application may be implemented in hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in an access network device, a target network device, or a core network device. Of course, the processor and the storage medium may reside as discrete components in an access network device, a target network device, or a core network device.
Those skilled in the art will appreciate that in one or more of the examples described above, the functionality described in the embodiments of the present application may be implemented, in whole or in part, by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the embodiments of the present application in further detail, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present application, and are not intended to limit the scope of the embodiments of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims (10)

1. A text category identification method is applied to an electronic device, and comprises the following steps:
obtaining a training sample;
training a preset text classification model to be trained by adopting the training sample to obtain a text classification model;
acquiring a text to be classified, and inputting the text to be classified into the text classification model to obtain a class prediction probability set, wherein the class prediction probability set comprises prediction probabilities that the sample to be classified belongs to preset text classes;
and determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
2. The method according to claim 1, wherein the training a preset text classification model to be trained by using the training samples to obtain a text classification model comprises:
dividing the training samples into inclusionary training samples and exclusionary training samples based on sample identifications;
training an exclusionary text classification model to be trained based on the exclusionary training sample to obtain an exclusionary text classification model;
and training the inclusionary text classification model to be trained based on the inclusionary training sample to obtain the inclusionary text classification model.
3. The method of claim 2, wherein training the exclusionary text classification model to be trained based on the exclusionary training samples to obtain an exclusionary text classification model comprises:
inputting the exclusionary training samples into a first preset classification model to obtain an inclusionary prediction probability set and an exclusionary prediction probability set, wherein the inclusionary prediction probability set comprises the probability that the exclusionary training samples belong to the preset text category, and the exclusionary prediction probability set comprises the probability that the exclusionary training samples are excluded from the preset text category;
and training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model.
4. The method according to claim 3, wherein the training the exclusionary text classification model to be trained based on the inclusionary prediction probability set and the exclusionary prediction probability set to obtain the exclusionary text classification model comprises:
determining an inclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set, and determining an exclusionary prediction probability set corresponding to the exclusionary training sample in the inclusionary prediction probability set;
obtaining a first loss value based on a first loss function, an inclusiveness prediction probability set corresponding to the exclusiveness training sample and an exclusiveness prediction probability set corresponding to the exclusiveness training sample;
and under the condition that the first loss value is greater than or equal to a first preset loss value, adjusting a first preset model parameter based on the first loss value and a back propagation algorithm until the first loss value is smaller than the first preset loss value, and obtaining the exclusionary text classification model.
5. The method of claim 2, wherein training the inclusionary text classification model to be trained based on the inclusionary training samples to obtain an inclusionary text classification model comprises:
inputting the inclusiveness training sample into a second preset classification model to obtain an inclusiveness prediction probability set, wherein the inclusiveness prediction probability set comprises the probability that the inclusiveness training sample belongs to the preset text category;
determining an annotation probability set based on the annotation of the inclusionary training sample, the annotation probability set comprising a probability that the inclusionary training sample is annotated as the preset text classification;
and training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model.
6. The method of claim 5, wherein the training the inclusionary text classification model to be trained based on the inclusionary prediction probability set and the labeling probability set to obtain the inclusionary text classification model comprises:
determining an inclusionary prediction probability set corresponding to the inclusionary training sample in the inclusionary prediction probability set, and determining an annotation probability set corresponding to the inclusionary training sample in the annotation probability set;
obtaining a second loss value based on a second loss function, an inclusiveness prediction probability set corresponding to the inclusiveness training sample and an annotation probability set corresponding to the inclusiveness training sample;
and under the condition that the second loss value is greater than or equal to a second preset loss value, adjusting second preset model parameters based on the second loss value and a back propagation algorithm until the second loss value is smaller than the second preset loss value, and obtaining the inclusive text classification model.
7. The method according to claim 4 or 6, wherein after determining a target text category to which the text to be classified belongs from the preset text categories, the method further comprises:
and outputting the target text category and the prediction probability corresponding to the target text category.
8. A text category identification apparatus, characterized in that the apparatus comprises:
a first obtaining unit for obtaining a training sample;
the training unit is used for training a preset text classification model to be trained by adopting the training sample to obtain a text classification model;
the second acquisition unit is used for acquiring texts to be classified;
the input unit is used for inputting the text to be classified into the text classification model to obtain a class prediction probability set, and the class prediction probability set comprises prediction probabilities that the samples to be classified belong to preset text classes;
and the determining unit is used for determining a target text category to which the text to be classified belongs in the preset text categories based on the category prediction probability set.
9. An electronic device, comprising a processor, memory, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, wherein the computer program is processed to execute instructions of the steps in the method according to any one of claims 1-7.
CN202110286227.7A 2021-03-17 2021-03-17 Text type identification method and related equipment Pending CN112966110A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110286227.7A CN112966110A (en) 2021-03-17 2021-03-17 Text type identification method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110286227.7A CN112966110A (en) 2021-03-17 2021-03-17 Text type identification method and related equipment

Publications (1)

Publication Number Publication Date
CN112966110A true CN112966110A (en) 2021-06-15

Family

ID=76279029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110286227.7A Pending CN112966110A (en) 2021-03-17 2021-03-17 Text type identification method and related equipment

Country Status (1)

Country Link
CN (1) CN112966110A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806542A (en) * 2021-09-18 2021-12-17 上海幻电信息科技有限公司 Text analysis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170323202A1 (en) * 2016-05-06 2017-11-09 Fujitsu Limited Recognition apparatus based on deep neural network, training apparatus and methods thereof
WO2019019860A1 (en) * 2017-07-24 2019-01-31 华为技术有限公司 Method and apparatus for training classification model
CN110059647A (en) * 2019-04-23 2019-07-26 杭州智趣智能信息技术有限公司 A kind of file classification method, system and associated component
CN111814810A (en) * 2020-08-11 2020-10-23 Oppo广东移动通信有限公司 Image recognition method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170323202A1 (en) * 2016-05-06 2017-11-09 Fujitsu Limited Recognition apparatus based on deep neural network, training apparatus and methods thereof
WO2019019860A1 (en) * 2017-07-24 2019-01-31 华为技术有限公司 Method and apparatus for training classification model
CN110059647A (en) * 2019-04-23 2019-07-26 杭州智趣智能信息技术有限公司 A kind of file classification method, system and associated component
CN111814810A (en) * 2020-08-11 2020-10-23 Oppo广东移动通信有限公司 Image recognition method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806542A (en) * 2021-09-18 2021-12-17 上海幻电信息科技有限公司 Text analysis method and system
CN113806542B (en) * 2021-09-18 2024-05-17 上海幻电信息科技有限公司 Text analysis method and system

Similar Documents

Publication Publication Date Title
CN112041815B (en) Malware detection
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN111814923B (en) Image clustering method, system, device and medium
US20170185913A1 (en) System and method for comparing training data with test data
WO2017173093A1 (en) Method and device for identifying spam mail
CN109766496B (en) Content risk identification method, system, device and medium
WO2023272850A1 (en) Decision tree-based product matching method, apparatus and device, and storage medium
CN115221516B (en) Malicious application program identification method and device, storage medium and electronic equipment
CN112966102A (en) Classification model construction and text sentence classification method, equipment and storage medium
CN115935344A (en) Abnormal equipment identification method and device and electronic equipment
CN112181835A (en) Automatic testing method and device, computer equipment and storage medium
CN115080972A (en) Method and device for detecting abnormal access of interface of electric mobile terminal
CN112966110A (en) Text type identification method and related equipment
CN113051911B (en) Method, apparatus, device, medium and program product for extracting sensitive words
CN113918949A (en) Recognition method of fraud APP based on multi-mode fusion
CN108021713B (en) Document clustering method and device
CN111444364B (en) Image detection method and device
CN110855740B (en) Information pushing method and related equipment
CN112669850A (en) Voice quality detection method and device, computer equipment and storage medium
US11934421B2 (en) Unified extraction platform for optimized data extraction and processing
CN111353039A (en) File class detection method and device
CN110727759A (en) Method and device for determining theme of voice information
CN115171136A (en) Method, equipment and storage medium for classifying and identifying content of banking business material
CN111966339B (en) Buried point parameter input method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination