CN114519399A - Text classification method, device, equipment and storage medium based on artificial intelligence - Google Patents

Text classification method, device, equipment and storage medium based on artificial intelligence Download PDF

Info

Publication number
CN114519399A
CN114519399A CN202210161803.XA CN202210161803A CN114519399A CN 114519399 A CN114519399 A CN 114519399A CN 202210161803 A CN202210161803 A CN 202210161803A CN 114519399 A CN114519399 A CN 114519399A
Authority
CN
China
Prior art keywords
text
text information
classified
supplementary
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210161803.XA
Other languages
Chinese (zh)
Inventor
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210161803.XA priority Critical patent/CN114519399A/en
Priority to PCT/CN2022/090727 priority patent/WO2023159762A1/en
Publication of CN114519399A publication Critical patent/CN114519399A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to artificial intelligence and provides a text classification method, a text classification device, text classification equipment and a storage medium based on the artificial intelligence. The text classification method based on artificial intelligence comprises the following steps: acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information which corresponds to preset text categories one by one; for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information; calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information; and determining the predicted text category of the text information to be classified according to the prediction probability value of the next sentence. According to the technical scheme of the embodiment of the invention, the classification result of the input text can be quickly and accurately obtained by the NSP mechanism based on the BERT model without a fine-tuning stage, so that the waste of user time and computing resources is reduced.

Description

Text classification method, device, equipment and storage medium based on artificial intelligence
Technical Field
The embodiments of the present invention relate to, but not limited to, the technical field of artificial intelligence, and in particular, to a text classification method, a text classification apparatus, a computer device, and a computer-readable storage medium based on artificial intelligence.
Background
With the rapid development of computer technology, a huge amount of information resources exist and are continuously produced on the internet, and the resources often exist in the form of unlabeled text data. Since a large amount of real and valuable information is often contained in massive information resources, natural language processing and information mining on text data become one of important research directions. The text data is mined and analyzed, so that the text data is classified, and the method has important significance in many scenes. For example, in a scene of news topic mining, by analyzing huge amounts of news every day and then quickly and accurately classifying the news, the time for manual classification can be greatly saved.
In the related art, a BERT series model is generally applied to classify unlabeled text data. The BERT is a pre-trained language Representation model, which is called simply Encoder retrieval from transforms. The training of the model is mainly divided into a pre-trained stage and a fine-tuning stage, firstly, grammar, syntax, semantic relation and the like of a text are learned through mass data in the pre-trained stage, then, information learned in the pre-trained stage is applied to real text classification data in the fine-tuning stage, the training mode can be analogized to transfer learning, and compared with the traditional supervised learning, the accuracy of the model can be greatly improved by the method.
However, in the training process of the BERT series model, if too little training data is available in the fine-tuning stage, the model cannot be sufficiently trained, and thus text classification cannot be accurately performed, resulting in waste of information. In addition, in the training process of the fine-tuning stage, the text data needs to be labeled manually, thereby causing a great deal of waste of time and resources.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention provides a text classification method based on artificial intelligence, a text classification device, computer equipment and a computer readable storage medium, which can quickly and accurately classify texts, thereby reducing the waste of time and computing resources of users.
In a first aspect, an embodiment of the present invention provides a text classification method based on artificial intelligence, including:
acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information which corresponds to preset text categories one by one;
for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information;
Calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information;
and determining the predicted text category of the text information to be classified according to the prediction probability value of the next sentence.
According to some embodiments of the first aspect of the present invention, before the obtaining the text information to be classified and the template text information, the method further includes:
acquiring a preset supplementary text template and at least one preset text category;
and generating supplementary text information according to the supplementary text template for each preset text category.
According to some embodiments of the first aspect of the present invention, after the obtaining the text information to be classified and the template text information, the method further includes:
and carrying out de-duplication useless character processing on the text information to be classified.
According to some embodiments of the first aspect of the present invention, for each of the text information to be classified, generating reconstructed text information corresponding to the preset text category in a one-to-one manner according to the supplementary text information includes:
and combining the text information to be classified with the separators and the supplementary text information in sequence to obtain reconstructed text information. According to some embodiments of the first aspect of the present invention, the calculating a next sentence prediction probability value of the reconstructed textual information comprises:
And for each text information to be classified, inputting the corresponding reconstructed text information into a BERT model, and respectively calculating the prediction probability value of the next sentence corresponding to the reconstructed text information.
According to some embodiments of the first aspect of the present invention, the determining a predicted text category of the text information to be classified according to the next sentence prediction probability value comprises:
and for each text information to be classified, taking the preset text category corresponding to the reconstructed text information with the maximum prediction probability value of the next sentence as a predicted text category.
According to some embodiments of the first aspect of the present invention, the determining the text category of the text information to be classified according to the next sentence prediction probability value may be determined by the following formula:
Figure BDA0003515095830000021
wherein, the
Figure BDA0003515095830000022
Representing the predicted-text category, Y representing a collection of the preset-text categories, xiTable ith the text information to be classified, yjRepresenting a first of said supplemental textual information.
In a second aspect, an embodiment of the present invention further provides a text classification apparatus, including:
the device comprises an acquisition unit, a classification unit and a classification unit, wherein the acquisition unit is used for acquiring text information to be classified and template text information, and the template text information comprises at least one supplementary text information which corresponds to preset text categories one to one;
The reconstruction unit is used for generating reconstructed text information which corresponds to the preset text type one by one according to the supplementary text information for each text information to be classified;
the data processing unit is used for calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information;
and the prediction unit is used for determining the predicted text category of the text information to be classified according to the next sentence prediction probability value.
According to some embodiments of the second aspect of the present invention, the data processing unit is further configured to input the reconstructed text information corresponding to each text information to be classified into a BERT model, and respectively calculate a next sentence prediction probability value corresponding to the reconstructed text information.
In a third aspect, an embodiment of the present invention further provides a computer device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing the method of text classification as described in the first aspect above.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium storing computer-executable instructions for performing the text classification method according to the first aspect.
The invention provides a text classification method based on artificial intelligence, a text classification device, computer equipment and a computer readable storage medium, wherein the text classification method based on artificial intelligence comprises the following steps: acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information corresponding to a preset text category one to one; for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information; calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information; and determining the predicted text type of the text information to be classified according to the predicted probability value of the next sentence. According to the technical scheme provided by the embodiment of the invention, the template text information corresponding to the preset text category is obtained, the reconstructed text information is respectively generated by the text information to be classified and each supplementary text information in the template text information, and the next sentence prediction probability value of the reconstructed text information is calculated, so that the predicted text category of the text to be classified can be determined according to the matching degree between the text information to be classified and the supplementary text information, and the text classification is realized. Due to the technical scheme of the embodiment of the invention, the classification result of the input text can be quickly and accurately obtained based on the NSP mechanism of the BERT model without passing through a fine-tuning stage or using tagged text data as a training sample, so that the waste of time and computing resources of a user can be greatly reduced, and the working efficiency of text classification is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram of a system architecture platform for performing an artificial intelligence based text classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to another embodiment of the present invention;
FIG. 5 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to another embodiment of the present invention;
FIG. 6 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of a BERT model of an artificial intelligence based text classification method according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating steps of a method for artificial intelligence based text classification according to another embodiment of the present invention;
fig. 9 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms "first," "second," and the like in the description, in the claims, or in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
With the rapid development of computer technology, a huge amount of information resources exist and are continuously produced on the internet, and the resources often exist in the form of unlabeled text data. Since a large amount of real and valuable information is often contained in massive information resources, natural language processing and information mining on text data become one of important research directions. The text data is mined and analyzed, so that the text data is classified, and the method has important significance in many scenes. For example, in a scene of news topic mining, by analyzing huge amounts of news every day and then quickly and accurately classifying the news, the time for manual classification can be greatly saved.
In the related art, the BERT series model is generally applied to classify unlabeled text data. The BERT is a pre-trained language Representation model, which is called simply Encoder retrieval from transforms. The training of the Model is mainly divided into a pre-trained stage and a fine-tuning stage, firstly, in the pre-trained stage, mass data is utilized to train the Model through two tasks of MLM (Mask Language Model) and NSP (Next sequence Prediction); secondly, in the fine-tuning stage, real text classification data is input into the BERT model, and then the text type is predicted by using [ CLS ] in combination with the full-connection layer network. The model trained based on the method can be analogized to transfer learning, firstly, grammar, syntax, semantic relation and the like of a text are learned through mass data in a pre-trained stage, and then information learned in the pre-trained stage is applied to real text classification data in a fine-tuning stage, so that compared with the traditional supervised learning, the method can greatly improve the accuracy of the model.
However, in the training process of the BERT series model, if too little training data is available in the fine-tuning stage, the model cannot be sufficiently trained, and thus text classification cannot be accurately performed, resulting in waste of information. In addition, in the training process of the fine-tuning stage, the text data needs to be labeled manually, thereby causing a great deal of waste of time and resources.
Based on the above situation, the present invention provides an artificial intelligence based text classification method, a text classification apparatus, a computer device, and a computer-readable storage medium, wherein the artificial intelligence based text classification method includes: acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information which corresponds to preset text categories one by one; for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information; calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information; and determining the predicted text type of the text information to be classified according to the predicted probability value of the next sentence. According to the technical scheme provided by the embodiment of the invention, the template text information corresponding to the preset text category is obtained, the reconstructed text information is respectively generated by the text information to be classified and each supplementary text information in the template text information, and the next sentence prediction probability value of the reconstructed text information is calculated, so that the predicted text category of the text to be classified can be determined according to the matching degree between the text information to be classified and the supplementary text information, and the text classification is realized. Due to the technical scheme of the embodiment of the invention, the classification result of the input text can be quickly and accurately obtained based on the NSP mechanism of the BERT model without passing through a fine-tuning stage or using text data with labels as training samples, so that the waste of user time and computing resources can be greatly reduced, and the work efficiency of text classification is improved.
The embodiments of the present invention will be further explained with reference to the drawings.
As shown in fig. 1, fig. 1 is a schematic diagram of a system architecture platform for executing an artificial intelligence based text classification method according to an embodiment of the present invention.
In the example of fig. 1, the system architecture platform 100 includes a processor 110 and a memory 120, wherein the processor 110 and the memory 120 may be connected by a bus or other means, and fig. 1 illustrates the example of the connection by the bus.
The memory 120, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 120 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to the system architecture platform via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It can be understood by those skilled in the art that the system architecture platform may be applied to a 3G communication network system, an LTE communication network system, a 5G communication network system, a mobile communication network system that is evolved later, and the like, which is not limited in this embodiment.
Those skilled in the art will appreciate that the system architecture platform illustrated in FIG. 1 does not constitute a limitation on embodiments of the invention, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.
In the system architecture platform shown in FIG. 1, the processor 110 may invoke a control program stored in the memory 120 to perform an artificial intelligence based text classification method.
Based on the system architecture platform, the following provides various embodiments of the text classification method based on artificial intelligence.
FIG. 2 is a flowchart illustrating steps of a text classification method based on artificial intelligence according to an embodiment of the present invention; the method includes, but is not limited to, step S100, step S200, step S300, and step S400.
Step S100: acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information which corresponds to preset text categories one by one;
It should be noted that the number of the text information to be classified is one, one text information to be classified corresponds to one template text information, and one template text information includes at least one supplementary text information, that is, one text information to be classified corresponds to at least one supplementary text information. In addition, the text information to be classified may be text data with a tag or text data without a tag, which is not limited in this embodiment.
It should be noted that the supplementary text information is used to emphasize the consistency and fluency of the context language between the text information to be classified and the supplementary text information, that is, the specific template text information may be supplemented by the user according to the application scenarios in different fields, which is not limited in this embodiment.
Step S200: for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information;
specifically, each piece of text information to be classified and one piece of supplemental text information generate reconstructed text information each time, that is, according to the number of preset text categories, each piece of text information to be classified generates a corresponding number of reconstructed text information.
Step S300: calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information;
specifically, for example, the following prediction probability value is obtained as follows: the reconstructed text information is input to the BERT model, and then the probability of NSP in the model, i.e., the value of P (NSP ═ 0), is calculated. In the Bert model, NSP ═ 0 indicates that two input sentences have a contextual relationship.
Step S400: and determining the predicted text type of the text information to be classified according to the predicted probability value of the next sentence.
It can be understood that, in the BERT model, the larger the value of P (NSP ═ 0), the larger the matching degree of the two sentences is, the more matched the contained information is, that is, the better the continuity and fluency of the context language between the text information to be classified and the supplementary text information is, so that the preset text category corresponding to the supplementary text information can be used as the predicted text category of the text information to be classified.
Through the steps from S100 to S400, template text information corresponding to the preset text category is obtained, reconstructed text information is generated by the text information to be classified and each supplementary text information in the template text information, next sentence prediction probability values of the reconstructed text information are calculated, the predicted text category of the text to be classified can be determined according to the matching degree between the text information to be classified and the supplementary text information, and therefore text classification is achieved.
It can be understood that, according to the technical solution of the embodiment of the present invention, the classification result of the input text can be quickly and accurately obtained based on the NSP mechanism of the BERT model without going through the fine-tuning stage or using the text data with tags as the training sample, so that the waste of user time and computing resources can be greatly reduced, thereby improving the work efficiency of text classification.
It is worth noting that compared with the traditional supervised learning, the technical scheme provided by the embodiment of the invention does not need to label the sample manually or train the model, so that the training consumption of the model and the consumption of manual labeling can be effectively reduced, and the time consumption and the resource waste of the user can be saved.
In addition, compared with pre-trained and fine-tuning tasks of two stages of the BERT model, the technical scheme of the embodiment of the invention only uses the parameters of the pre-trained without performing the fine-tuning task, namely without manually marking the text information to be classified, and can obtain good training effect in small sample training or even no sample training task.
It can be understood that the technical scheme of the embodiment of the invention deeply excavates the model capability of the pre-trained stage, so that the information of grammar, semantics, syntax and the like learned by the stage can be fully utilized. Although text information to be classified is required to be combined with preset text categories and input into the model for multiple times, the training time of the model is saved, so that compared with the method of directly applying the BERT model for classification, the method and the device for classifying the text information to be classified can obviously improve the working efficiency in the aspect of computing resources, and meanwhile, the accuracy is obviously improved in the aspect of accuracy compared with the fine-tuning stage.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Based on the technical scheme of the embodiment of the invention, the embodiment of the invention can classify and process the text data in an artificial intelligence mode.
Referring to fig. 3, before the step S100, the text classification method may specifically include, but is not limited to, the following steps S110 and S120.
Step S110: acquiring a preset supplementary text template and at least one preset text category;
Step S120: and generating supplementary text information according to the supplementary text template for each preset text category.
Specifically, for each preset text category, supplementary text information is generated according to the supplementary text template, that is, only the preset text categories are different among the supplementary text information, so that the variables can be better controlled, and a more accurate text classification effect can be obtained.
Referring to fig. 4, after the step S100, the text classification method may specifically include, but is not limited to, the following step S130.
Step S130: and carrying out de-duplication useless character processing on the text information to be classified.
It can be understood that a great amount of useless characters are often contained in the text data, so that the useless character removing processing needs to be carried out on the text information to be classified, and the model of the BERT model in the pre-trailing stage belongs to an end-to-end model, so that the operations of word segmentation, word stop and the like are not needed.
Referring to fig. 5, as for the step S200, the text classification method may specifically include, but is not limited to, the following step S210.
Step S210: and combining the text information to be classified with the separators and the supplementary text information in sequence to obtain reconstructed text information.
Specifically, a separator is set between the text information to be classified and the supplementary text information to divide two sentences, thereby calculating the continuity and fluency of language between contexts in which the text information is reconstructed. Illustratively, the reconstructed text information is to-be-classified information + a separator + supplementary text information, the to-be-classified information, the separator and the supplementary text information are sequentially combined for each to-be-classified information, so that the to-be-classified information and the supplementary text information form an upper sentence and a lower sentence, and then the reconstructed text information is obtained through calculation, so that the technical scheme of determining the text category by calculating the prediction probability value of the lower sentence of the reconstructed text information can be applied.
Referring to fig. 6 and 7, as an example, regarding the step S300, the text classification method may specifically include, but is not limited to, the following step S310.
Step S310: and for each text information to be classified, inputting the corresponding reconstructed text information into a BERT model, and respectively calculating the prediction probability value of the next sentence corresponding to the reconstructed text information.
Specifically, (x)i,yj) The representation contains the ith text information x to be classifiediAnd contains the jth supplementary text information yjThe reconstructed text information is input into the Bert model, and then the probability of NSP in the model, that is, the value of P (NSP ═ 0), is calculated, so that the text information x to be classified can be obtained iAnd supplementary text information yjThe context of (3).
Referring to fig. 8, as an example, regarding the step S400, the text classification method may specifically include, but is not limited to, the following step S410.
Step S410: and for each text information to be classified, taking the preset text category corresponding to the reconstructed text information with the maximum prediction probability value of the lower sentence as a prediction text category.
Specifically, in the BERT model, the larger the value of P (NSP ═ 0), the more matching the two sentences, the more matching the contained information, that is, the better the continuity and fluency of the context language between the text information to be classified and the supplementary text information, so that the preset text category corresponding to the supplementary text information can be used as the predicted text category of the text information to be classified.
Illustratively, regarding the step S400, the text category of the text information to be classified is determined according to the following prediction probability value, and may be determined according to the following formula:
Figure BDA0003515095830000071
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003515095830000072
representing a predicted text category, Y representing a collection of preset text categories, xiRepresenting the ith text information to be classified,yjRepresenting the jth supplemental text information.
Based on the text classification method based on artificial intelligence, the following provides an overall embodiment of the technical scheme of the invention.
The method takes news topic mining as a scene, gives a piece of news text data, and predicts the news category of the text. The specific steps combined with the figure are as follows:
(1) a data set is constructed. Suppose a news text dataset is denoted G ═ x1,x2,...,xi,...xnIn which xiThe ith piece of text data in the data set G is represented, and n represents the number of data sets. Suppose the news category is five, denoted as Y ═ Y1,y2,y3,y4,y5Where the categories may be entertainment, sports, competitions, economics, education, etc. The text in the data set G is news text without tags.
(2) And (4) preprocessing data. The news text data contains a large number of useless characters, so that the text in the step (1) needs to be subjected to data preprocessing, the data preprocessing mainly comprises the operation of removing the useless characters, and because pre-trained models such as Bert belong to end-to-end models, the operations of word segmentation, word stop and the like are not needed. Suppose that the ith text data, after being ANDed, can be represented as xi=[wi1,wi2,...,wil]Where w represents a preprocessed word or word (token), and l represents a preprocessed text length.
(3) Template construction
Reconstructing each preprocessed sample and each category into template texts based on the preprocessing in the step (2) and the news categories in the step (1) and taking data x iFor example, the results of the reconstructed text of the respective and five categories are as follows:
(xi,y1)=[wi1,wi2,…,wil,[SEP]this is, one, y1News]
(xi,y2)=[wi1,wi2,…,wil,[SEP]This is, one, y2News]
(xi,y3)=[wi1,wi2,…,wil,[SEP]This is, one, y3News]
(xi,y4)=[wi1,wi2,…,wil,[SEP]This is, one, y4News]
(xi,y5)=[wi1,wi2,…,wil,[SEP]This is, one, y5News]
Where [ SEP ] is the separator of two sentences in the Bert model. The words "this is", "one" and "news" are called supplementary texts, and the purpose is to emphasize the continuity and fluency of the language between the reconstructed text contexts, so that the customized supplementary texts can be used in different fields, and the specific template is not limited and can be customized according to the specific scene of the user. Based on the above template, each sample of the data set G and each category in the category set Y are combined into a new reconstructed text,
(4) and (5) constructing a model.
The method utilizes the probability value of the NSP mechanism in the pre-trailing stage of the Bert model to predict the original text xiThe specific model diagram of (1) is mainly shown in fig. 7. Wherein (x)i,yj) Representing the original text x in step (3)iAnd contains a class yjThe reconstructed text of (2) is input into a Bert model, and then a probability of NSP in the model, that is, a value of P (NSP ═ 0) is calculated, where NSP ═ 0 in the Bert model indicates that two input sentences have a contextual relationship. The purpose of the model is therefore to predict the original text x iAnd supplementary text y containing categoriesiProbability values belonging to the context. In the Bert model, the larger P (NSP ═ 0) is, the greater the matching degree of two sentences is, the more matched the contained information is, and the matching degree of the original sentence and the category is represented in the present invention.
(5) Text category prediction
Inputting each reconstructed text in the step (3) into the model in the step (4), and obtaining the final original text xiThe categories of (d) can be expressed as:
Figure BDA0003515095830000081
wherein the content of the first and second substances,
Figure BDA0003515095830000082
representing original text xiThe above formula can thus be expressed as: enabling text xiAfter reconstruction, the class with the maximum probability of NSP 0 is the text xiThe label of (1).
Specifically, in the embodiment of the invention in which the text of the news category is classified as the subject, a data set G is constructed through the step (1), short text data of news are acquired for 1213 items, and then the category of the data is five categories of entertainment, sports, electronic contest, economy and education; then, in step (2), useless characters in the news data are removed; then, in the step (3), combining each original sample and the five categories respectively to reconstruct a new template sample; then, selecting a Chinese pre-training model bert-wwm of Harbin university as a language model in the step (4); finally, in step (5), the reconstructed samples in step (3) are sequentially input to the model in step (4), a probability value of NSP 0 is obtained, and then a category that maximizes the probability value of NSP 0 of the original text is obtained as a predicted category of the sample based on the formula in step (5).
It is worth noting that, through experiments of the embodiment of the invention, in a test set, the accuracy of the model is obviously improved by the text classification method based on the Bert model NSP mechanism, and the accuracy of the model is improved by 4.2% in the online ABtest.
Based on the above text classification method based on artificial intelligence, the following respectively proposes various embodiments of the text classification apparatus, the computer device and the computer-readable storage medium of the present invention.
As shown in fig. 9, fig. 9 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present invention. The text classification apparatus 200 of the embodiment of the present invention includes, but is not limited to, an acquisition unit 210, a reconstruction unit 220, a data processing unit 230, and a prediction unit 240.
Specifically, the obtaining unit 210 is configured to obtain text information to be classified and template text information, where the template text information includes at least one supplementary text information corresponding to a preset text category one to one; the reconstruction unit 220 is configured to generate, for each text information to be classified, reconstructed text information corresponding to the preset text category one to one according to the supplementary text information; the data processing unit 230 is configured to calculate a next sentence prediction probability value of the reconstructed text information, where the next sentence prediction probability value represents a matching degree between the text information to be classified and the supplemental text information; the prediction unit 240 is configured to determine a predicted text category of the text information to be classified according to the next sentence prediction probability value.
It should be noted that the embodiment of the text classification device 200 and the corresponding technical effects of the embodiment of the present invention can be referred to the above embodiment of the text classification method based on artificial intelligence and the corresponding technical effects.
Illustratively, the data processing unit 230 is further configured to input reconstructed text information corresponding to each text information to be classified into the BERT model, and calculate a next sentence prediction probability value corresponding to the reconstructed text information respectively.
Specifically, referring to fig. 7, (x)i,yj) The representation contains the ith text information x to be classifiediAnd contains the jth supplementary text information yjThe data processing unit 230 inputs the reconstructed text information into the Bert model, and then calculates a next sentence prediction probability value in the model, that is, a value of P (NSP ═ 0), so as to obtain the text information x to be classifiediAnd supplementary text information yjThe context of (1).
In addition, an embodiment of the present invention also provides a computer apparatus including: a memory, a processor, and a computer program stored on the memory and executable on the processor.
The processor and memory may be connected by a bus or other means.
It should be noted that, the computer device in this embodiment may be applied to the system architecture platform in the embodiment shown in fig. 1, and the computer device in this embodiment can form a part of the system architecture platform in the embodiment shown in fig. 1, and both belong to the same inventive concept, so both have the same implementation principle and beneficial effect, and are not described in detail herein.
The non-transitory software programs and instructions required to implement the artificial intelligence based text classification method of the above embodiments are stored in a memory and, when executed by a processor, perform the artificial intelligence based text classification method of the above embodiments, e.g., perform the method steps in fig. 2-6 and 8 described above.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions for performing the artificial intelligence based text classification method described above. For example, by a processor of the text classification apparatus, the processor may be caused to perform the artificial intelligence based text classification method in the above embodiment, for example, the method steps in fig. 2 to 6 and 8 described above.
It will be understood by those of ordinary skill in the art that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (10)

1. A text classification method based on artificial intelligence is characterized by comprising the following steps:
acquiring text information to be classified and template text information, wherein the template text information comprises at least one supplementary text information which corresponds to preset text categories one by one;
for each text information to be classified, generating reconstructed text information corresponding to the preset text categories one by one according to the supplementary text information;
calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information;
and determining the predicted text category of the text information to be classified according to the next sentence prediction probability value.
2. The method for classifying texts according to claim 1, wherein before the obtaining the text information to be classified and the template text information, the method further comprises:
Acquiring a preset supplementary text template and at least one preset text category;
and generating supplementary text information according to the supplementary text template for each preset text category.
3. The method for classifying texts according to claim 1, wherein for each of the text information to be classified, generating reconstructed text information corresponding to the preset text category in a one-to-one manner according to the supplementary text information comprises:
and combining the text information to be classified with the separators and the supplementary text information in sequence to obtain reconstructed text information.
4. The method of claim 1, wherein the calculating a probability value of a next sentence prediction for the reconstructed text information comprises:
and for each text information to be classified, inputting the corresponding reconstructed text information into a BERT model, and respectively calculating the prediction probability value of the next sentence corresponding to the reconstructed text information.
5. The method for classifying texts according to claim 1, wherein the determining the predicted text category of the text information to be classified according to the next sentence prediction probability value comprises:
and for each text information to be classified, taking the preset text category corresponding to the reconstructed text information with the maximum next sentence prediction probability value as a predicted text category.
6. The method for classifying texts according to any one of claims 1 to 5, wherein the text category of the text information to be classified is determined according to the prediction probability value of the next sentence, and can be determined by the following formula:
Figure FDA0003515095820000011
wherein, the first and the second end of the pipe are connected with each other,the described
Figure FDA0003515095820000012
Representing the predicted-text category, Y representing a collection of the preset-text categories, xiRepresents the ith text information to be classified, yjRepresenting the jth of said supplemental text information.
7. A text classification apparatus, comprising:
the device comprises an acquisition unit, a classification unit and a classification unit, wherein the acquisition unit is used for acquiring text information to be classified and template text information, and the template text information comprises at least one supplementary text information which corresponds to preset text categories one to one;
the reconstruction unit is used for generating reconstructed text information which corresponds to the preset text type one by one according to the supplementary text information for each text information to be classified;
the data processing unit is used for calculating a next sentence prediction probability value of the reconstructed text information, wherein the next sentence prediction probability value represents the matching degree between the text information to be classified and the supplementary text information;
and the prediction unit is used for determining the predicted text type of the text information to be classified according to the next sentence prediction probability value.
8. The text classification device of claim 7, wherein the data processing unit is further configured to input the reconstructed text information corresponding to each piece of text information to be classified into a BERT model, and respectively calculate a next sentence prediction probability value corresponding to the reconstructed text information.
9. A computer device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the text classification method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of text classification of any of claims 1 to 6.
CN202210161803.XA 2022-02-22 2022-02-22 Text classification method, device, equipment and storage medium based on artificial intelligence Pending CN114519399A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210161803.XA CN114519399A (en) 2022-02-22 2022-02-22 Text classification method, device, equipment and storage medium based on artificial intelligence
PCT/CN2022/090727 WO2023159762A1 (en) 2022-02-22 2022-04-29 Text classification method and apparatus based on artificial intelligence, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210161803.XA CN114519399A (en) 2022-02-22 2022-02-22 Text classification method, device, equipment and storage medium based on artificial intelligence

Publications (1)

Publication Number Publication Date
CN114519399A true CN114519399A (en) 2022-05-20

Family

ID=81598312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210161803.XA Pending CN114519399A (en) 2022-02-22 2022-02-22 Text classification method, device, equipment and storage medium based on artificial intelligence

Country Status (2)

Country Link
CN (1) CN114519399A (en)
WO (1) WO2023159762A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN112597298A (en) * 2020-10-14 2021-04-02 上海勃池信息技术有限公司 Deep learning text classification method fusing knowledge maps
CN112905795A (en) * 2021-03-11 2021-06-04 证通股份有限公司 Text intention classification method, device and readable medium
CN113869017A (en) * 2021-09-30 2021-12-31 平安科技(深圳)有限公司 Table image reconstruction method, device, equipment and medium based on artificial intelligence

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444695B (en) * 2020-03-25 2022-03-01 腾讯科技(深圳)有限公司 Text generation method, device and equipment based on artificial intelligence and storage medium
US11074412B1 (en) * 2020-07-25 2021-07-27 Sas Institute Inc. Machine learning classification system
CN113254604B (en) * 2021-07-15 2021-10-01 山东大学 Reference specification-based professional text generation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN112597298A (en) * 2020-10-14 2021-04-02 上海勃池信息技术有限公司 Deep learning text classification method fusing knowledge maps
CN112905795A (en) * 2021-03-11 2021-06-04 证通股份有限公司 Text intention classification method, device and readable medium
CN113869017A (en) * 2021-09-30 2021-12-31 平安科技(深圳)有限公司 Table image reconstruction method, device, equipment and medium based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
伊凡·瓦西列夫: "《 Python深度学习 模型、方法与实现》", 31 December 2021, pages: 219 - 225 *
高扬: "《人工智能与机器人先进技术丛书 智能摘要与深度学习》", 31 December 2019, pages: 51 - 52 *

Also Published As

Publication number Publication date
WO2023159762A1 (en) 2023-08-31

Similar Documents

Publication Publication Date Title
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN110489395B (en) Method for automatically acquiring knowledge of multi-source heterogeneous data
CN109472024B (en) Text classification method based on bidirectional circulation attention neural network
CN112732919B (en) Intelligent classification label method and system for network security threat information
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN108182175B (en) Text quality index obtaining method and device
CN109255119A (en) A kind of sentence trunk analysis method and system based on the multitask deep neural network for segmenting and naming Entity recognition
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN103678285A (en) Machine translation method and machine translation system
CN110910283A (en) Method, device, equipment and storage medium for generating legal document
CN110705206A (en) Text information processing method and related device
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN114238571A (en) Model training method, knowledge classification method, device, equipment and medium
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN112417862A (en) Knowledge point prediction method, system and readable storage medium
CN115203338A (en) Label and label example recommendation method
CN112685374B (en) Log classification method and device and electronic equipment
CN114519399A (en) Text classification method, device, equipment and storage medium based on artificial intelligence
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN113535946A (en) Text identification method, device and equipment based on deep learning and storage medium
CN112200268A (en) Image description method based on encoder-decoder framework
CN112185361B (en) Voice recognition model training method and device, electronic equipment and storage medium
CN115034318B (en) Method, device, equipment and medium for generating title discrimination model
CN117033733B (en) Intelligent automatic classification and label generation system and method for library resources
CN114036946B (en) Text feature extraction and auxiliary retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination