CN114676255A - Text processing method, device, equipment, storage medium and computer program product - Google Patents

Text processing method, device, equipment, storage medium and computer program product Download PDF

Info

Publication number
CN114676255A
CN114676255A CN202210318205.9A CN202210318205A CN114676255A CN 114676255 A CN114676255 A CN 114676255A CN 202210318205 A CN202210318205 A CN 202210318205A CN 114676255 A CN114676255 A CN 114676255A
Authority
CN
China
Prior art keywords
text
classification
classification model
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210318205.9A
Other languages
Chinese (zh)
Inventor
杨韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210318205.9A priority Critical patent/CN114676255A/en
Publication of CN114676255A publication Critical patent/CN114676255A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The application discloses a text processing method, a text processing device, a text processing apparatus, a storage medium and a computer program product, wherein the method comprises the following steps: acquiring a text to be processed; and inputting the text to be processed into the target text classification model for processing to obtain a text classification result of the text to be processed. The target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters; the classification loss parameter is determined based on a first text feature obtained by processing a first text by an initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing a second text through the initial text classification model. Through the text classification method and device, the accuracy of text classification can be effectively improved.

Description

Text processing method, device, equipment, storage medium and computer program product
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a text processing method, a text processing apparatus, a text processing device, a computer storage medium, and a computer program product.
Background
The application of artificial intelligence in various scientific fields greatly improves the efficiency of business processing. Among them, machine learning techniques in artificial intelligence have been increasingly applied to natural language processing tasks.
Text classification is a typical type of natural language processing task executed by a computer, and is widely applied to various service implementation scenarios. For example, in an intelligent knowledge question and answer service system, it is necessary to classify a proposed question as an input text for intention recognition, automatic question and answer, or to provide information retrieval or the like according to the classification result of the input text.
In various business implementation scenarios, the accuracy of text classification is a core concern. Therefore, a scheme capable of further improving the text classification accuracy is a research hotspot at present.
Disclosure of Invention
The application provides a text processing method, a text processing device, text processing equipment, a storage medium and a computer program product, which can effectively improve the accuracy of text classification.
In one aspect, the present application provides a text classification method, including:
acquiring a text to be processed, inputting the text to be processed into a target text classification model for processing, and obtaining a text classification result of the text to be processed;
the target text classification model is obtained by performing model training by using a plurality of text sets, wherein each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters determined in the training process; the classification loss parameter is determined based on a first text feature obtained by processing a first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing a second text through the initial text classification model.
In one aspect, the present application provides a text classification apparatus, including:
the acquisition unit is used for acquiring a text to be processed;
the processing unit is used for inputting the text to be processed into the target text classification model for processing to obtain a text classification result of the text to be processed;
the target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is the same as the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters determined in the training process; the classification loss parameter is determined based on a first text feature obtained by processing a first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing a second text through the initial text classification model.
In one implementation, the obtaining unit may be further configured to obtain a plurality of text sets for training the initial text classification model.
The text classification device also comprises a training unit, a text classification unit and a text classification unit, wherein the training unit is used for inputting a first text included in any text set into the initial text classification model for processing in the process of training the initial text classification model to obtain a first text characteristic; inputting a second text included in any text set into the initial text classification model for processing to obtain a second text characteristic; determining a classification loss parameter based on the first text feature and a matching loss parameter based on the first text feature and the second text feature; and determining a target loss parameter based on the classification loss parameter and the matching loss parameter, and adjusting the model parameter of the initial text classification model based on the target loss parameter to obtain the target text classification model.
In one implementation, the obtaining unit may be further configured to obtain an annotation tag of the first text feature; the training unit can also be used for carrying out classification processing based on the first text characteristic and determining the matching probability between the first text characteristic and the labeling label; a classification loss parameter is determined based on a match probability between the first text feature and the annotation tag.
In one implementation, the training unit may be further configured to perform a multiplication operation on the first text feature and the weight matrix to obtain an initial matching probability between the first text feature and each of the plurality of reference tags; the weight matrix is generated in the process of training the initial text classification model, and the plurality of reference labels comprise labeling labels; normalizing the plurality of initial matching probabilities to obtain matching probabilities between the first text feature and each reference label; and determining the matching probability between the first text feature and the label from the matching probability between the first text feature and each reference label.
In one implementation mode, the number of the second texts is one or more, and each second text corresponds to one second text feature; the training unit can be further used for respectively matching the first text features with the second text features to obtain matching parameters between the first text features and the second text features; determining target text features with the same prediction labels as the labeling labels of the first text features from the second text features; a match loss parameter is determined based on a match parameter between the first text feature and the target text feature.
In an implementation manner, if the second text included in any text set is determined based on the first text, the training unit may be further configured to perform synonym replacement processing on the first text to obtain the second text which is the synonym text with the first text; or translating the first text into a reference language, and translating the translation result back to the original language to which the first text belongs to obtain a second text which is the synonymous text with the first text; or inputting the first text into the synonymy text generation model for processing to obtain a second text which is the synonymy text with the first text.
In one implementation, the training unit may be further configured to determine gradient loss parameters corresponding to the plurality of text sets in a process of training the initial text classification model by using the plurality of text sets; determining countermeasure disturbance information based on the gradient loss parameters, and determining countermeasure disturbance samples based on the countermeasure disturbance information; and training the initial text classification model based on the anti-disturbance sample and the determined classification loss parameter and matching loss parameter to obtain a target text classification model.
In one aspect, the present application provides a text processing apparatus comprising a processor adapted to implement one or more computer programs; and a computer storage medium storing one or more computer programs that are loaded by the processor and implement the text processing method in an aspect of the present application.
In one aspect, the present application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to implement the text processing method in the above-described aspect.
In one aspect, the present application provides a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device realizes the text processing method provided in various alternatives of the above aspect and the like.
The text classification model for performing text classification on a text to be processed is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters, the classification loss parameters are determined based on first text characteristics obtained by processing a first text by the initial text classification model, and the matching loss parameters are determined based on the first text characteristics and second text characteristics obtained by processing a second text by the initial text classification model. Wherein, the first text and the second text are synonymous texts. By adopting the training mode of the text classification model provided by the application, the joint learning of the model can be realized by utilizing a plurality of synonymous texts, namely, the loss parameters for referring to the parameters of the adjusted model not only contain the classification loss of the training texts, but also contain the matching loss between the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the present application or prior art, the drawings used in the embodiments or prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts.
Fig. 1 is a schematic application flow diagram of a text classification system according to an embodiment of the present application;
fig. 2a is a schematic diagram of an implementation environment of a text processing method according to an embodiment of the present application;
fig. 2b is a schematic diagram of an implementation environment of another text processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a text classification model training method according to an embodiment of the present application;
fig. 5a is a schematic view of a scene of data enhancement provided by an embodiment of the present application;
fig. 5b is a schematic diagram of a method for generating a synonymous text according to an embodiment of the present application;
fig. 6 is a schematic flowchart of a method for generating a synonymous text according to an embodiment of the present application;
FIG. 7 is a diagram of a BERT model architecture provided in an embodiment of the present application;
FIG. 8 is a diagram illustrating a text classification model architecture based on joint learning according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For ease of understanding, the terms referred to in this application will be described first.
1. Machine Learning (Machine Learning, ML)
Machine learning studies how computers simulate or implement human learning behavior to acquire new knowledge or skills, reorganizing existing knowledge structures to improve their performance. It is the core of artificial intelligence and is the fundamental way to make computer have intelligence. Deep Learning (DL) is a new research direction in the field of machine Learning, and is an intrinsic rule and expression level of Learning sample data, and information obtained in the Learning process is greatly helpful for interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds.
2. Cloud Technology (Cloud Technology)
The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data; the block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like, is essentially a decentralized database, is a series of data blocks which are produced by correlation through a cryptography method, and each data block contains information of a batch of network transactions for verifying the validity (anti-counterfeiting) of the information and generating the next block.
3. Natural Language Processing (NLP)
Natural Language processing studies enable various theories and methods for efficient communication between humans and computers using Natural Language, including both Natural Language Understanding (NLU) and Natural Language Generation (NLG). Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
4. Text classification
Text classification is a common and important technology in NLP, and is a technology for automatically classifying and marking a text set (or other entities or objects) according to a certain classification system or standard, which finds a relationship model between text features and text categories according to a labeled training text set, and then uses the learned relationship model to perform category judgment on a new document. Text classification is an important basic technology in NLP, and is widely applied to various service scenes, such as news consultation classification and advertisement classification. Text classification is generally classified into two types, short text classification and long text classification, according to the length of text. The long text is a relatively long text as the name implies, such as articles of news information; short text has a small text content and a short number of words, typically only a few tens of words or a few words, such as an advertisement title, a search term, and the like.
To implement this technique of text classification, the usual solution is to train a model. The model is a function with learnable parameters that can map inputs to outputs. The optimal parameters were obtained by training the model on the data. The trained model may provide an accurate mapping from input to desired output.
And for the text classification model, the text classification model is used for classifying the input text and determining the type of the text. For example, the type may be an emotion expressed by the text, an attribute of an object embodied by the text, or an intention expressed by the text. According to the specific application of the text classification model, the types of texts to be determined by the text classification model are different.
5、BERT
Pre-training of Deep Bidirectional converters (BERT) for Language Understanding is a Pre-training model. The BERT can extract all used word vectors in a sample, store the word vectors into a vector file, and provide embedding vectors for subsequent models, namely word embedding information.
6. Word embedded information
Word embedding is the general term for language models and characterization learning techniques in natural language processing. Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. The word embedding process is a dimension reduction process, and is used for mapping a word to a real number domain to obtain a vector expression of the word, and the vector expression of the word can be called as word embedding information.
Word Embedding is a process of Embedding, which is a way to convert discrete variables into continuous vectors. Embedding is also the process of mapping source data to another space. Word embedding, which may also be referred to as word embedding, is understood to be the mapping of a word in the space to which X belongs to a multidimensional vector into the space to which Y belongs, which corresponds to embedding into the space to which Y belongs. The mapping process is the process of generating an expression on a new space. The method for representing words or phrases by using word embedding information can improve the analysis effect of texts in natural language processing.
The text processing method provided by the embodiment of the application can be applied to any scene needing text classification, such as emotion analysis, commodity classification, intention classification and the like. For example, the method can be applied to a text classification system for classifying the intention of the text. The text classification system is used for analyzing the intention of the text. For example, in a search scenario, the text may be text input by an operator, and after the text is input by the operator, the text classification system can analyze the corresponding intention. Referring to fig. 1, as shown in fig. 1, a specific search application scenario is provided, an operator may input a searched text as an input text 101 into a text classification system 102, and the text classification system 102 may perform text classification on the input text 101, recognize an intention of the input text 101, and classify the intention, thereby determining a function 103 required by the input text 101, where the function 103 may include one or more specific functions, such as functions 1 to 3 shown in fig. 1. For example, if the input text 101 is "weather of city tomorrow", the text classification system 102 determines that the intention of the input text is: and inquiring weather. The corresponding query function may be provided to the operator according to the intent.
Please refer to fig. 2a, which is a schematic diagram illustrating an environment of a text processing method according to the present application. As shown, the implementation environment includes one or more text processing devices 201, and one or more text databases 202, where the text databases 202 store a plurality of texts, which can be used for model training and can be classified as texts to be processed. The text processing device 201 may be a server or a terminal having a data (e.g., text) processing function, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like. The text database 202 shown in fig. 2a may be a local database of the text processing device 201, or may be a cloud database accessible by the text processing device 201.
The text processing method provided by the embodiment of the application can be executed by the text processing device 201, for example, the text to be processed is acquired from the text database 202, and the text to be processed is input into the target text classification model for processing, so as to obtain the classification result of the text to be processed. The training method of the target text classification model may also be executed by the text processing apparatus 201. The text processing device 201 can obtain a plurality of text sets from the text database 202, each text set including a first text and a second text which is a synonymous text with the first text, obtain a classification loss parameter and a matching loss parameter by inputting the plurality of text sets into an initial text classification model for training, and adjust a model parameter of the initial text classification model based on the classification loss parameter and the matching loss parameter to obtain a target text classification model.
Referring to fig. 2b, fig. 2b is a schematic diagram of an implementation environment of another text processing method according to an embodiment of the present application. The implementation environment includes one or more terminals 203 and a text processing platform 204. The terminal 203 establishes a communication connection with the text processing platform 204 through a network (a wireless network or a wired network) and performs data interaction.
The terminal 203 may be a smart phone, a tablet computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, a smart home appliance, a smart voice interaction device, or the like. The terminal 203 is installed and running with an application that supports text classification, for example, the application may be a system application, an instant messaging application, a news push application, a shopping application, a social application, and the like.
The terminal 203 may obtain a text to be processed, and input the text to be processed into a target text classification model for processing, so as to obtain a classification result of the text to be processed, where the target text classification model may be obtained by the text processing platform 204 through training the initial text classification model, and is also obtained by the terminal 203 through training the initial text classification model. Optionally, when the target text classification model is obtained by training the terminal 203, the terminal 203 may further obtain a plurality of text sets, each text set includes a first text and a second text which is a synonymous text with the first text, and train the initial text classification model by using the plurality of text sets to obtain a classification loss parameter and a matching loss parameter, so as to adjust a model parameter of the initial text classification model to obtain the target text classification model. Optionally, during the training process of the terminal 203 on the initial text classification model, the multiple text sets used may be obtained from the text processing platform 204.
The text processing platform 204 includes one or more servers, where the servers may be independent physical servers, may also be a server cluster or distributed system formed by a plurality of physical servers, and may also be cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, and big data and artificial intelligence platforms.
The text processing platform 204 may be configured to classify a text to be processed by using a target text classification model, and may also be configured to train an initial text classification model to obtain the target text classification model, which is not limited in the present application. When the text processing platform 204 is used for classifying the text to be processed, the text to be processed can be obtained, and the text to be processed is input into the target text classification model, so that a classification result of the text to be processed is obtained; when the text processing platform 204 is used for model training, a plurality of text sets can be obtained, wherein each text set comprises a first text and a second text which is a synonymous text with the first text; training the initial text classification model by using a plurality of text sets to obtain a classification loss parameter and a matching loss parameter, so as to adjust the model parameter of the initial text classification model to obtain a target text classification model. Optionally, the text processing platform 204 may obtain the text to be processed by obtaining the text from the terminal 203.
The text processing method provided by the embodiment of the present application is briefly introduced above, and a specific implementation of the text processing method is described in detail below.
Referring to fig. 3, fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application, where the method is applied to a text processing device, and the text processing device is a terminal or a text processing platform. As shown in fig. 3, the method includes, but is not limited to, the following steps:
s301: and acquiring a text to be processed.
The text to be processed refers to a sample in the form of text, the sample (specimen) refers to a part of an individual under observation or investigation, and in this embodiment, the text to be processed is a character string, may be one or more sentences, or may be one or more words, and the like. The text to be processed may be text acquired in various application scenarios, for example, in a search application scenario, a searcher inputs a search word "strawberry" in a search input box, and then takes the "strawberry" as the text to be processed. For another example, in the speech recognition application scenario, the text content after the obtained speech is recognized is "weather in the future of fifteen days in the province a", and the text to be processed is "weather in the future of fifteen days in the province a".
Optionally, the text to be processed may be a short text or a long text according to the long-short classification of the text, the short text is determined if the number of characters of the text to be processed is less than N characters, and the long text is determined if the number of characters of the text to be processed is greater than or equal to N. Where N is a positive integer, for example N may be 7. Common long texts include articles, treatises and the like of news information; the short text may be news headlines, slogans, search queries, etc., such as:
news headlines: cherry blossom and crabapple are all in one, and tourist can enjoy flower and tide
Query 1: air temperature in B zone
Query 2: constellation of C
S302: and inputting the text to be processed into the target text classification model for processing to obtain a text classification result of the text to be processed.
The target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters determined in the training process; the classification loss parameter is determined based on a first text feature obtained by processing a first text by an initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing the second text through the initial text classification model.
For example, if the text to be processed is the temperature in "B area" of Query 1 in the above example, the category to which the text to be processed belongs may be obtained as "weather" through the target text classification model.
In an embodiment of the present application, the training sample for training the initial text classification model may be a plurality of text sets, each of which includes a first text and a second text that is synonymous with the first text. Inputting the first text into the initial text classification model to obtain a classification loss parameter, inputting the second text into the initial text classification model to obtain a matching loss parameter, and adjusting the model parameter of the initial text classification model based on the classification loss parameter and the matching loss parameter to obtain the target text classification model.
In an implementation manner, the execution subject using the target text classification model and the execution subject trained by the initial text training model to obtain the target text classification model may be the same text processing device or two different text processing devices, which is not limited in this application. For example, the executing agent using the target text classification model may be the terminal 203 shown in fig. 2b, and the executing agent training the initial text classification model may be the text processing platform 204 shown in fig. 2 b.
The text classification model for performing text classification on a text to be processed is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters, the classification loss parameters are determined based on first text characteristics obtained by processing a first text by the initial text classification model, and the matching loss parameters are determined based on the first text characteristics and second text characteristics obtained by processing a second text by the initial text classification model. Wherein the first text and the second text are synonymous texts. By adopting the training mode of the text classification model provided by the application, the joint learning of the model can be realized by utilizing a plurality of synonymous texts, namely, the loss parameters for referring to the parameters of the adjusted model not only contain the classification loss of the training texts, but also contain the matching loss between the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
The traditional solution for classifying short texts is to directly train a classification model, which leads to two problems: firstly, because the number of characters of a short text is small, the text features are usually not obvious enough, so that the model is difficult to learn, and the classification effect of the model is poor; secondly, also because the number of characters of the short text is small, a large number of labeled training samples are usually required in the process of training the model, which directly results in that a large number of training samples need to be labeled, and the labeling is time-consuming and labor-consuming. Aiming at the two problems, the embodiment of the application provides a text classification model based on sample enhancement and counterstudy, which firstly constructs training samples based on a plurality of modes for sample enhancement so as to reduce the number of manually labeled samples; the model also adopts an antagonistic learning technology, so that the generalization capability of the model can be obviously improved; in addition, a plurality of samples are utilized to carry out combined learning in the model training process, so that model parameters are adjusted, and the training effect of the model is enhanced. Experiments show that the model trained in the above way can remarkably improve the effect of short text classification models and improve the classification accuracy of short texts. Meanwhile, the model has remarkable classification accuracy when classifying long texts.
Referring to fig. 4, fig. 4 is a schematic flow chart of a text classification model training method adopted in a text processing method according to the present application. The main execution body of the text classification model training method is a text processing device, and the text processing device and the main execution body of the text processing method can be the same device or two different devices. As shown in fig. 4, the text classification model training method includes, but is not limited to, the following steps:
s401: and acquiring a plurality of text sets, and training the initial text classification model by using the plurality of text sets. Wherein each text set comprises a first text and a second text that is synonymous text with the first text. Optionally, the text set may further include an annotation tag of the first text, and the annotation tag may be determined manually.
In the embodiment of the present application, the text set refers to a sample in the form of text, and the sample (specimen) refers to a part of individuals observed or investigated. The first text and the second text are synonymous texts, that is, the first text and the second text express the same meaning, or the first text and the second text are synonymous words. For example, the first text may be "a contributes primarily to B" and the second text may be "a contributes primarily to B"; for another example, the first text may be "happy", the second text may be "happy", and in the above two examples, the first text and the second text are synonymous texts. The plurality of text sets are used to train an initial text classification model. The label of the first text is a label considered to be labeled and represents the real category to which the first text belongs. For example, when the first text is "happy", its annotation tag may be "mood".
In one implementation, the second text included in any of the text sets is determined based on the first text. The mode of determining the second text through the first text can be a synonym replacement method, a retranslation method, a synonym text generation model method and the like. Referring to fig. 5a, fig. 5a is a schematic view of a data enhancement scenario according to an embodiment of the present disclosure. As shown in fig. 5a, data enhancement is performed on the first text as an original sample in any one of the above manners to obtain a plurality of second texts, and a text set is composed of the first text and the plurality of second samples and used in model training. Please refer to fig. 5b, fig. 5b is a schematic diagram of a method for generating a synonymous text according to an embodiment of the present application, where the diagram includes three methods for generating a synonymous text and a synonymous judging mechanism according to the embodiment of the present application, and the three methods for generating a synonymous text and the synonymous judging mechanism are described in detail below.
Alternatively, the manner of generating the synonymous text by using the synonym replacement method may be as follows: extracting keywords of the first text, finding out words which are synonyms with the keywords of the first text based on a synonym table, and replacing the keywords in the first text with the words to obtain a second text. For example, if the content of the first text is "a has which main achievements", the keyword of the first text may be extracted as "achievement", and then the synonym of "achievement" may be determined as "contribution" through the synonym table, and then the second sample may be "a has which contributions".
Optionally, the method for generating the synonymous text by using the translation method may be as follows: the first text can be translated into a reference language in a translation mode, and then the translation result is translated back into the original language to which the first text belongs, so that a second text which is synonymous with the first text is obtained. The reference language may be english, french, japanese, korean, italian, or other various languages different from the original language to which the first text belongs, and the present application does not limit the reference language. Illustratively, the first text is "What the constellation of B is", the translation into english is "What is B's constellation", and the translation back into chinese results in the second text being "What the constellation of B is".
Alternatively, the method for generating the synonymous text by using the synonymous text generation model method may be as follows: and training a synonymous text generation model, and generating a second text by using the first text. The synonymous text generation model may be a Sequence to Sequence (Seq 2Seq) model. The Seq2Seq model includes two parts, an encoder and a decoder, where the encoder is a coding model, and may adopt a long-short-term memory (LSTM), Convolutional Neural Network (CNN), or transform model (transformer) architecture, which is not limited in this application. Illustratively, as shown in FIG. 6, the first text (decomposed into X in the figure) is input by the encoder1、X2、X3、X4) Coding to obtain coded C1、C2、C3(ii) a Inputting the obtained codes into a decoder to generate each word Y in turn1、Y2、Y3Finally, a complete one text, i.e., the second text, is generated.
The text generated by the three above-described ways may be spurious and may not be synonymous text with the first text. For example, by the above-mentioned translation manner, a text "where B lives" is generated for the first text "what the constellation of B is," and the text cannot be used as a synonymous text of the first text. In view of this problem, a synonymy discrimination mechanism as shown in fig. 6 may be employed to determine whether the generated text is a synonymy text of the first text. Alternatively, a BERT-based interactive matching model may be used, the first text and the generated text may be stitched together and input to the BERT model, and the model may be subjected to a second classification to determine whether the first text and the generated text are synonymous texts.
Exemplarily, referring to fig. 7, description a and description b are respectively decomposed into a plurality of words (token), and each of the words in description a and description b is regarded as a word (token); splicing the token-based description a and the description b, and inserting [ CLS ] into the beginning of the description a]Symbols, inserting one [ SEP ] into each of the word end of description a and the word end of description b]Sign, inputting the signal after splicing and inserting into BERT model for processing to obtain a classification vector TclsThe classification vector is a vector fusing all semantic information describing a and b, and the classification vector can be labeled (label) through two classifications. For example, description b is synonymous with description a, then label is 1, otherwise label is 0. Through the synonymy discrimination mechanism, the second text which is the synonymy text with the first text can be determined, and then the first text and the second text are used for training the initial text classification model, so that the training sample size is expanded, and the generalization capability of the target text classification model can be improved.
In order to further improve the generalization capability of the target text classification model and make the target text classification model more robust to interference, confrontation training can be performed in the process of training the initial text classification model. The method comprises the following steps that countercheck training is a mode for enhancing model robustness, in the process of the countercheck training, a few tiny countercheck disturbances can be mixed with original samples to obtain countercheck samples, the countercheck samples are slightly changed compared with the original samples, the countercheck samples serve as model input, the model can give an error output with high confidence, misclassification is caused, the model adapts to the change, and therefore robustness is achieved on the countercheck samples. In the text field, the counterdisturbance refers to an interference factor added in an original sample, and the counterdisturbance may be some word changes added in the text, or changes to word embedded information of the text, and the like.
In one implementation, countertraining may be utilized to enhance the robustness of the target text classification model. In training the initial text classification model using a plurality of text sets, a countertraining is performed, and an overall optimization goal of the countertraining can be expressed by the following formula:
Figure BDA0003570497410000131
wherein θ represents a model parameter, x represents a text set, y represents a classification result of the text set (which may be a label if the classification result is represented in the form of a label), E and L are a reconstruction loss function and a classification loss function, respectively, D is a data distribution of the text set, and r is a valueadvRepresenting the countering perturbation, S represents the range to which the mode of countering perturbation belongs (ensuring that countering perturbation does not change the meaning of the corpus within this range). The overall optimization goal can be understood as obtaining an antagonistic disturbance by maximizing the classification loss function, adding the antagonistic disturbance into the original word embedding of the text set to obtain an antagonistic sample in a word embedding form, and minimizing the classification loss and reconstruction loss of the antagonistic sample, so that the model can correctly classify the antagonistic sample and can restore the antagonistic sample to the original text set.
Optionally, applied to the embodiment of the present application, based on the above-mentioned overall optimization objective, gradient loss parameters corresponding to a plurality of text sets are determined, and a process of determining the gradient parameters may be represented by the following formula:
Figure BDA0003570497410000132
wherein g represents a gradient, a gradient loss parameter corresponding to the text set, θ represents a model parameter, x represents the text set, y represents a classification result of the text set (which may be a label if the classification result is represented in the form of a label), and L is a classification loss functionAnd (4) counting. Determining the disturbance rejection information based on the obtained gradient loss parameter by the following formula, namely the disturbance rejection radv
Figure BDA0003570497410000133
Wherein, | g | | represents the modulus of solving the gradient loss parameter. By countering the perturbation information, countering the perturbation samples can be determined. Optionally, determining the anti-disturbance sample may be implemented in various ways, and the following two ways are provided in this application, and the embodiments of the present application may take any one way, and are not limited thereto.
The first method is as follows: and adding the anti-disturbance information into the word embedding information of the text set to obtain the word embedding information of the anti-disturbance sample corresponding to the text set. Then in subsequent S402 and S404, the anti-perturbation information in the form of word embedding is included in the word embedding information of both the first text and the second text.
The second method comprises the following steps: and adding the anti-disturbance information into the text content of the text set to obtain the text content of the anti-disturbance sample corresponding to the text set. In subsequent S402, the text content of any first text includes the text content of the disturbance resisting sample, the first text and words in the text content of the disturbance resisting sample in the first text may be mapped to a real number field to obtain word embedding information of the first text, and then the subsequent steps are performed; similarly, in S404, the text content of any second text includes the text content of the disturbance resisting sample, and the word embedding information of the second sample is obtained through the similar process described above, where the word embedding information of the second sample includes the word embedding information of the disturbance resisting sample.
S402: in the process of training the initial text classification model, a first text included in any text set is input into the initial text classification model and processed to obtain a first text feature. Any text set is any one of the plurality of text sets in S401.
The initial text classification model may be a pre-training language model, a BERT model is taken as an example in the embodiment of the present application, and fig. 7 may be referred to for a model architecture of BERT. When the BERT model is used for text classification tasks, a [ CLS ] symbol is inserted in front of a single sentence input into the BERT model, a hidden state vector of the [ CLS ] in the last layer of an encoder is taken as a semantic vector of the whole sentence, and the semantic vector is input into a classifier. Specifically, referring to fig. 7, the sentence description a may be decomposed into a plurality of words, resulting in tokens 1 (e.g., "congratulatory" in fig. 7), tokens 2 (e.g., "congratulatory" in fig. 7), and so on. The BERT model can convert each token into word embedding information, and then perform feature extraction on the word embedding information through a plurality of hidden layers. Based on the above process, the first text is input into the initial text classification model for training, so as to obtain the first text feature and one or more reference labels corresponding to the first text, and then S403 is executed. The first text feature is in a vector form; the reference label is a classification result of the initial text classification model on the first text and represents a category to which the first text belongs.
It should be noted that the present application does not limit the execution sequence between S402 and S404, S402 and S404 may be executed simultaneously, S402 may be executed first and S404 may be executed first, or S404 may be executed first and S402 may be executed later.
S403: a classification loss parameter is determined based on the first text feature.
Using a vector output by the last hidden layer in the BERT model as a text feature of a text set through an initial text classification model, such as the BERT model, namely inputting a first text feature into a classifier for classification to obtain a plurality of reference labels (including a labeling label of a first text) of the first text, wherein the reference label of the first text is a reference label of the first text feature, and the labeling label of the first text is a labeling label of the first text feature; the annotation label of the first text represents the real category to which the first text belongs. Multiplying the first text characteristic with the weight matrix to obtain initial matching probabilities logits between the first text and each reference label in the multiple reference labels; normalizing the multiple initial matching probabilities to obtain the matching between the first text and each reference labelProbability Probs; from the obtained plurality of matching probabilities, a matching probability Probs between the first text and the labeling label can be determinediThe above calculation process can be expressed by the following formula:
logits=W·Vcls
Probs=Softmax(logits)
Lossc=-log(Probsi)
the weight matrix W is generated in the process of training the initial text classification model, and is related to reference labels corresponding to training texts such as a first text and a second text, or the weight matrix is related to reference labels learned in the model training process, for example, if the number of the reference labels is 5, the weight matrix has 5 values; vclsA first text feature representing a first text; softmax is a logistic regression function, and logits of the reference label can be normalized into the probability with the value range of 0-1 through Softmax; losscRepresenting a classification loss parameter.
S404: and inputting a second text included in any text set into the initial text classification model for processing to obtain a second text characteristic. Wherein, the second text included in any text set has a corresponding relationship with the first text in S402, that is, belongs to two texts in the same text set.
Alternatively, a pre-trained language model BERT model can be adopted, the second text input into the BERT model is decomposed into a plurality of words, a [ CLS ] symbol is inserted in front of a sentence, a hidden state vector of the [ CLS ] at the last layer of an encoder is taken as a semantic vector of the whole sentence, and the semantic vector is input into a classifier. Converting each word of the second text into word embedding information through a BERT model, and then performing feature extraction on the word embedding information through a plurality of hidden layers. Based on the above process, the second text is input into the initial text classification model for training, so that the second text features and one or more prediction labels corresponding to the second text can be obtained. Wherein the second text feature is in a vector form; the prediction label is a classification result of the initial text classification model on the second text and represents a category to which the second text belongs.
S405: a match loss parameter is determined based on the first textual feature and the second textual feature.
And if the second text is the synonymous text of the first text, the second text can be one or more, and each second text corresponds to one second text characteristic. Calculating inner products of the first text features and the second text features respectively to obtain matching parameters between the first text features and the second text features; obtaining the matching probability corresponding to each matching parameter through a logistic regression function Softmax; determining target text features with the same prediction labels as the labeling labels of the first text from the second text features; and calculating a matching loss parameter through a classification loss function based on the matching probability between the first text feature and the target text feature.
The above process can be expressed by the following formula:
scorek=<Vcls,Vk>,1≤k≤n
Probsk=Softmax(scorek),1≤k≤n
Lossm=-log(Probsi)
wherein, VclsAs a first text feature, VkA kth second text feature representing the second text, n being a positive integer, n second text features representing the second text, scorekRepresenting a first text feature VclsWith a second text feature VkMatch score between ProbskRepresenting a first text feature VclsWith a second text feature VkMatch probability between ProbsiAs a first text feature VclsAnd target text feature ViMatch probability between, LossmIs a matching loss parameter calculated by a classification loss function.
Illustratively, if the second text features of the second text are respectively V1、V2、V3、V4Then the first text feature VclsInner products need to be calculated separately with the four vectors. If V2Prediction of corresponding second textIf the label is the same as the label of the first text, V2I.e. the target text characteristic ViFirst text feature VclsAnd target text feature ViProbs of match betweeniIs equal to the first text feature VclsAnd V2The matching probability between them; if there are a plurality of the predicted labels corresponding to the second text that are the same as the labeled labels of the first text, for example, V1And V2If the corresponding prediction labels are the same as the label of the first text, then V can be calculated by the above formulaclsAnd V1、VclsAnd V2And respectively calculate to obtain VclsAnd V1Probs of match between1And VclsAnd V2Probs of match between2Determining a candidate match Loss parameter Lossm1And Lossm2Can be obtained by matching Lossm1And Lossm2Determining a matching Loss parameter Loss by means of averaging and the likem
The above is an example of a method for processing a first text in one of a plurality of text sets and a second text that is a synonymous text with the first text, and determining a classification loss parameter and a matching loss parameter, and it can be understood that the method for processing other text sets in the plurality of text sets is similar to the method described in S401-S405, and is not described herein again.
S406: and determining a target loss parameter based on the classification loss parameter and the matching loss parameter, and adjusting the model parameter of the initial text classification model based on the target loss parameter to obtain the target text classification model.
Referring to fig. 8, fig. 8 is a diagram illustrating a text classification model architecture based on joint learning according to an embodiment of the present application. As shown in fig. 8, the first text is "how to plant strawberries," the second text, which is synonymous with the first text, is "how to breed strawberries," the first text and the second text are simultaneously used for model training, and the classification result of the first text and the classification result of the second text are subjected to matching processing. Based on the classification loss parameter L determined in S403 and S405osscMatch Loss parameter LossmThe target Loss parameter Loss can be found by the following formula:
Loss=Lossc+Lossm
and adjusting the model parameters of the initial text classification model based on the obtained target loss parameters to obtain a target text classification model. Through the text classification model based on the joint learning provided by the embodiment of the application, a plurality of synonymous texts are utilized for model training, classification loss parameters of training texts (such as a first text) are considered, the training texts and the synonymous texts of the training texts are matched at the same time, and matching loss parameters among the synonymous texts are obtained, namely the loss parameters for referring to and adjusting the model parameters not only include the classification loss of the training texts, but also include the matching loss among the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
It should be noted that the target loss parameter obtained by training any one of the text sets is obtained through the above process, and in order to determine the target loss parameter finally used for model parameter adjustment, the initial text classification model may be trained multiple times by using a plurality of different or same text sets. The model parameters of the initial text classification model are adjusted based on the target loss parameters, and the mode of obtaining the target text classification model can be as follows: repeating the steps S402-S406 for each text set in the plurality of text sets, and training the initial text classification model by using two synonymous texts in each text set to obtain a plurality of target loss parameters corresponding to each text set in the plurality of text sets; and determining fusion loss parameters based on the obtained target loss parameters, and using the fusion loss parameters to adjust model parameters of the initial text classification model to obtain a target text classification model. Optionally, the target loss parameter may be determined by averaging a plurality of target loss parameters corresponding to the plurality of text sets in a weighted manner, and the target loss parameter is used to adjust a model parameter of the initial text classification model, so as to obtain the target text classification model through training. Optionally, the fusion loss parameter may be determined by averaging a plurality of target loss parameters corresponding to the plurality of text sets, and the fusion loss parameter is used to adjust the model parameter.
For example, the fusion loss parameter may be determined by averaging: for N text sets, the above-mentioned ways of S401-S406 are adopted to obtain target Loss parameters (Loss) corresponding to the N text sets1,Loss2,…,LossN). Wherein, N is a positive integer, the final target Loss parameter Loss is obtained through the above process, i.e. the formula of the fusion Loss parameter is as follows:
Figure BDA0003570497410000181
the text classification model for performing text classification on a text to be processed is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters, the classification loss parameters are determined based on first text characteristics obtained by processing a first text by the initial text classification model, and the matching loss parameters are determined based on the first text characteristics and second text characteristics obtained by processing a second text by the initial text classification model. Wherein the first text and the second text are synonymous texts. By adopting the training mode of the text classification model, data enhancement can be carried out on the sample for training, and a second text which is the same as the first text is obtained, so that the sample for model training is expanded, and the generalization capability of the model is improved; the counterdisturbance sample can be used for model training through countertraining, so that the robustness of the model to interference is improved; the combined learning of the models can be realized by utilizing a plurality of synonymous texts, namely, the loss parameters for referring to and adjusting the model parameters not only include the classification loss of the training texts, but also include the matching loss among the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
Please refer to fig. 9, which is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. As shown in fig. 9, the text processing apparatus includes:
an obtaining unit 901, configured to obtain a text to be processed;
the processing unit 902 is configured to input the text to be processed into the target text classification model for processing, so as to obtain a text classification result of the text to be processed;
the target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters determined in the training process; the classification loss parameter is determined based on a first text feature obtained by processing a first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing a second text through the initial text classification model.
In one implementation, the obtaining unit 901 is further configured to obtain a plurality of text sets, where the text sets are used for training an initial text classification model.
The text processing device further comprises a training unit 903, configured to input a first text included in any text set into the initial text classification model for processing in a process of training the initial text classification model, so as to obtain a first text feature; inputting a second text included in any text set into the initial text classification model for processing to obtain a second text characteristic; determining a classification loss parameter based on the first text feature and a matching loss parameter based on the first text feature and the second text feature; and determining a target loss parameter based on the classification loss parameter and the matching loss parameter, and adjusting the model parameter of the initial text classification model based on the target loss parameter to obtain a target text classification model.
In one implementation, the obtaining unit 901 is further configured to obtain an annotation tag of the first text feature; the training unit 903 is further configured to perform classification processing based on the first text feature, and determine a matching probability between the first text feature and the label; a classification loss parameter is determined based on a match probability between the first text feature and the annotation tag.
In one implementation, the training unit 903 may further be configured to perform multiplication operation on the first text feature and the weight matrix to obtain an initial matching probability between the first text feature and each reference label in the multiple reference labels; the weight matrix is generated in the process of training the initial text classification model, and the plurality of reference labels comprise labeling labels; normalizing the plurality of initial matching probabilities to obtain matching probabilities between the first text feature and each reference label; and determining the matching probability between the first text feature and the label from the matching probability between the first text feature and each reference label.
In one implementation mode, the number of the second texts is one or more, and each second text corresponds to one second text feature; the training unit 903 may also be configured to perform matching processing on the first text feature and each second text feature, to obtain a matching parameter between the first text feature and each second text feature; determining target text features with the same prediction labels as the labeling labels of the first text features from the second text features; a match loss parameter is determined based on a match parameter between the first text feature and the target text feature.
In an implementation manner, if the second text included in any text set is determined based on the first text, the training unit 903 may further be configured to perform synonym replacement processing on the first text to obtain a second text which is a synonym text with the first text; or translating the first text into a reference language, and translating the translation result back to the original language to which the first text belongs to obtain a second text which is the synonymous text with the first text; or inputting the first text into the synonymy text generation model for processing to obtain a second text which is the synonymy text with the first text.
In an implementation manner, the training unit 903 may be further configured to determine gradient loss parameters corresponding to a plurality of text sets in a process of training an initial text classification model by using the plurality of text sets; determining countermeasure disturbance information based on the gradient loss parameters, and determining countermeasure disturbance samples based on the countermeasure disturbance information; and training the initial text classification model based on the anti-disturbance sample and the determined classification loss parameter and matching loss parameter to obtain a target text classification model.
According to an embodiment of the present application, the steps involved in the text processing methods shown in fig. 3 and 4 may be performed by respective units in the text processing apparatus shown in fig. 9. For example, step S301 shown in fig. 3 and step S401 shown in fig. 4 may be executed by the acquisition unit 901 in fig. 9, and step S302 shown in fig. 3 may be executed by the processing unit 902 in fig. 9; steps S402, S403, S404, S405, S406 shown in fig. 4 may be performed by the training unit 903 in fig. 9.
According to an embodiment of the present application, each unit in the text processing apparatus shown in fig. 9 may be respectively or entirely combined into one or several units to form the text processing apparatus, or some unit(s) may be further split into multiple sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the text processing apparatus may also include other units, and in practical applications, these functions may also be implemented by assistance of other units, and may be implemented by cooperation of a plurality of units.
It can be understood that the functions of the functional units of the text processing apparatus described in the embodiments of the present application can be specifically implemented according to the method in the foregoing method embodiments, and the specific implementation process of the method can refer to the relevant description of the foregoing method embodiments, which is not described herein again.
The text classification model for performing text classification on a text to be processed is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters, the classification loss parameters are determined based on first text characteristics obtained by processing a first text by the initial text classification model, and the matching loss parameters are determined based on the first text characteristics and second text characteristics obtained by processing a second text by the initial text classification model. Wherein the first text and the second text are synonymous texts. By adopting the training mode of the text classification model provided by the application, the joint learning of the model can be realized by utilizing a plurality of synonymous texts, namely, the loss parameters for referring to the parameters of the adjusted model not only contain the classification loss of the training texts, but also contain the matching loss between the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
Please refer to fig. 10, which is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. The text processing device described in the embodiment of the present application is configured to execute the text processing method described above, and includes: a processor 1001, a communication interface 1002, and a memory 1003. The processor 1001, the communication interface 1002, and the memory 1003 may be connected by a bus or in other manners, and in the embodiment of the present application, the connection by the bus is taken as an example.
The processor 1001 (or referred to as a Central Processing Unit (CPU)) is a computing core and a control core of the computer device, and can analyze various instructions in the computer device and process various data of the computer device, for example: the CPU can be used for analyzing the startup and shutdown instruction sent to the computer equipment and controlling the computer equipment to perform startup and shutdown operations; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the computer device, and so on. The communication interface 1002 may optionally include a standard wired interface, a wireless interface (e.g., Wi-Fi, mobile communication interface, etc.), controlled by the processor 1001 for transmitting and receiving data. The Memory 1003(Memory) is a Memory device in the computer device for storing programs and data. It is understood that the memory 1003 here may include a built-in memory of the computer device, and may also include an expansion memory supported by the computer device. The memory 1003 provides storage space that stores the operating system of the computer device, which may include, but is not limited to: android system, iOS system, Windows Phone system, etc., which are not limited in this application.
In the embodiment of the present application, the processor 1001 executes the executable program code in the memory 1003 to perform the following operations:
and acquiring a text to be processed, and inputting the text to be processed into the target text classification model for processing to obtain a text classification result of the text to be processed. The target text classification model is obtained by performing model training by using a plurality of text sets, wherein each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of the initial text classification model based on the classification loss parameters and the matching loss parameters determined in the training process; the classification loss parameter is determined based on a first text feature obtained by processing a first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing a second text through the initial text classification model.
In one implementation, the processor 1001, by executing the executable program code in the memory 1003, may further perform the following operations: acquiring a plurality of text sets for training an initial text classification model; in the process of training the initial text classification model, inputting a first text included in any text set into the initial text classification model for processing to obtain a first text characteristic; inputting a second text included in any text set into the initial text classification model for processing to obtain a second text characteristic; determining a classification loss parameter based on the first text feature and a matching loss parameter based on the first text feature and the second text feature; and determining a target loss parameter based on the classification loss parameter and the matching loss parameter, and adjusting the model parameter of the initial text classification model based on the target loss parameter to obtain the target text classification model.
In one implementation, the processor 1001, by executing the executable program code in the memory 1003, may further perform the following operations: acquiring a label of a first text characteristic; classifying the first text features, and determining the matching probability between the first text features and the labeling labels; a classification loss parameter is determined based on a match probability between the first text feature and the annotation tag.
In one implementation, the processor 1001, by executing the executable program code in the memory 1003, may further perform the following operations: multiplying the first text characteristic with the weight matrix to obtain an initial matching probability between the first text characteristic and each reference label in the plurality of reference labels; the weight matrix is generated in the process of training the initial text classification model, and the plurality of reference labels comprise labeling labels; normalizing the plurality of initial matching probabilities to obtain matching probabilities between the first text feature and each reference label; and determining the matching probability between the first text feature and the label from the matching probability between the first text feature and each reference label.
In one implementation mode, the number of the second texts is one or more, and each second text corresponds to one second text feature; the processor 1001, by executing the executable program code in the memory 1003, may also perform the following operations: matching the first text features with the second text features respectively to obtain matching parameters between the first text features and the second text features; determining target text features of which the predicted labels are the same as the labeling labels of the first text features from the second text features; a match loss parameter is determined based on a match parameter between the first text feature and the target text feature.
In one implementation, the second text included in any text set is determined based on the first text; the processor 1001, by executing the executable program code in the memory 1003, may also perform the following operations: carrying out synonym replacement processing on the first text to obtain a second text which is the synonym text with the first text; or translating the first text into a reference language, and translating the translation result back to the original language to which the first text belongs to obtain a second text which is the synonymous text with the first text; or inputting the first text into the synonymy text generation model for processing to obtain a second text which is the synonymy text with the first text.
In one implementation, the processor 1001, by executing the executable program code in the memory 1003, may further perform the following operations: determining gradient loss parameters corresponding to a plurality of text sets in the process of training the initial text classification model by using the plurality of text sets; determining countermeasure disturbance information based on the gradient loss parameters, and determining countermeasure disturbance samples based on the countermeasure disturbance information; and training the initial text classification model based on the anti-disturbance sample and the determined classification loss parameter and matching loss parameter to obtain a target text classification model.
The text classification model for performing text classification on a text to be processed is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters, the classification loss parameters are determined based on first text characteristics obtained by processing a first text by the initial text classification model, and the matching loss parameters are determined based on the first text characteristics and second text characteristics obtained by processing a second text by the initial text classification model. Wherein the first text and the second text are synonymous texts. By adopting the training mode of the text classification model provided by the application, the joint learning of the model can be realized by utilizing a plurality of synonymous texts, namely, the loss parameters for referring to the parameters of the adjusted model not only contain the classification loss of the training texts, but also contain the matching loss between the synonymous texts, so that the text classification model obtained by training has strong generalization capability, and the accuracy of text classification can be improved.
Embodiments of the present application further provide a computer-readable storage medium, in which instructions are stored, and when the instructions are executed on a computer, the computer is enabled to execute a text processing method according to an embodiment of the present application. For a specific implementation, reference may be made to the foregoing description, which is not repeated herein.
Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and executes the computer instructions, so that the computer device executes the text processing method according to the embodiment of the application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.
The terms "first," "second," and the like in the description and claims of embodiments of the present application and in the drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.
Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (11)

1. A method of text processing, the method comprising:
acquiring a text to be processed;
inputting the text to be processed into a target text classification model for processing to obtain a text classification result of the text to be processed;
the target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters determined in a training process; the classification loss parameter is determined based on a first text feature obtained by processing the first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing the second text by the initial text classification model.
2. The method of claim 1, further comprising:
acquiring the plurality of text sets, and training the initial text classification model by using the plurality of text sets;
in the process of training the initial text classification model, inputting the first text included in any text set into the initial text classification model for processing to obtain the first text feature;
inputting the second text included in any text set into the initial text classification model for processing to obtain the second text characteristics;
determining the classification loss parameter based on the first text feature and the matching loss parameter based on the first text feature and the second text feature;
and determining a target loss parameter based on the classification loss parameter and the matching loss parameter, and adjusting the model parameter of the initial text classification model based on the target loss parameter to obtain the target text classification model.
3. The method of claim 2, wherein determining the classification loss parameter based on the first text feature comprises:
acquiring a label of the first text characteristic;
classifying the first text feature, and determining the matching probability between the first text feature and the label;
determining the classification loss parameter based on a probability of a match between the first text feature and the annotation tag.
4. The method of claim 3, wherein the classifying based on the first text feature to determine the matching probability between the first text feature and the label tag comprises:
multiplying the first text feature by a weight matrix to obtain initial matching probability between the first text feature and each reference label in a plurality of reference labels; the weight matrix is generated in the process of training the initial text classification model, and the plurality of reference labels comprise the labeling labels;
normalizing the plurality of initial matching probabilities to obtain matching probabilities between the first text feature and the reference labels;
determining a matching probability between the first text feature and the label tag from the matching probabilities between the first text feature and the respective reference tags.
5. The method according to any one of claims 2-4, wherein the second text is one or more, and each second text corresponds to one second text feature;
the determining the match loss parameter based on the first textual feature and the second textual feature includes:
matching the first text features with the second text features respectively to obtain matching parameters between the first text features and the second text features;
determining target text features with the same prediction labels as the labeling labels of the first text features from the second text features;
determining the match loss parameter based on a match parameter between the first text feature and the target text feature.
6. The method of claim 1, wherein the second text included in any set of text is determined based on the first text, the method further comprising:
carrying out synonym replacement processing on the first text to obtain a second text which is the synonym text with the first text; alternatively, the first and second liquid crystal display panels may be,
translating the first text into a reference language, and translating a translation result back to an original language to which the first text belongs to obtain a second text which is a synonymous text with the first text; alternatively, the first and second electrodes may be,
and inputting the first text into a synonymy text generation model for processing to obtain the second text which is the synonymy text with the first text.
7. The method of claim 1, further comprising:
determining gradient loss parameters corresponding to the plurality of text sets in the process of training the initial text classification model by using the plurality of text sets;
determining countermeasure disturbance information based on the gradient loss parameters, and determining countermeasure disturbance samples based on the countermeasure disturbance information;
and training the initial text classification model based on the antitorque disturbance sample and the determined classification loss parameter and the matching loss parameter to obtain the target text classification model.
8. A text processing apparatus, characterized in that the apparatus comprises:
the acquisition unit is used for acquiring a text to be processed;
the processing unit is used for inputting the text to be processed into a target text classification model for processing to obtain a text classification result of the text to be processed;
the target text classification model is obtained by performing model training by using a plurality of text sets, and each text set comprises a first text and a second text which is a synonymous text with the first text; the target text classification model is obtained by adjusting model parameters of an initial text classification model based on classification loss parameters and matching loss parameters determined in a training process; the classification loss parameter is determined based on a first text feature obtained by processing the first text by the initial text classification model; the matching loss parameter is determined based on the first text feature and a second text feature obtained by processing the second text by the initial text classification model.
9. A text processing apparatus characterized by comprising:
a processor adapted to implement one or more computer programs; and (c) a second step of,
a computer storage medium storing one or more computer programs loaded by the processor and implementing the text processing method of any of claims 1-7.
10. A computer storage medium, characterized in that the computer storage medium comprises a computer program which, when being executed by a processor, is adapted to carry out the text processing method according to any one of claims 1-7.
11. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer storage medium, which computer program, when being executed by a processor, is adapted to carry out the text processing method of any one of claims 1-7.
CN202210318205.9A 2022-03-29 2022-03-29 Text processing method, device, equipment, storage medium and computer program product Pending CN114676255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210318205.9A CN114676255A (en) 2022-03-29 2022-03-29 Text processing method, device, equipment, storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210318205.9A CN114676255A (en) 2022-03-29 2022-03-29 Text processing method, device, equipment, storage medium and computer program product

Publications (1)

Publication Number Publication Date
CN114676255A true CN114676255A (en) 2022-06-28

Family

ID=82075974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210318205.9A Pending CN114676255A (en) 2022-03-29 2022-03-29 Text processing method, device, equipment, storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN114676255A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357720A (en) * 2022-10-20 2022-11-18 暨南大学 Multi-task news classification method and device based on BERT
CN115880120A (en) * 2023-02-24 2023-03-31 江西微博科技有限公司 Online government affair service system and service method
CN115905547A (en) * 2023-02-10 2023-04-04 中国航空综合技术研究所 Aeronautical field text classification method based on belief learning
CN116467607A (en) * 2023-03-28 2023-07-21 阿里巴巴(中国)有限公司 Information matching method and storage medium
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
CN111767405A (en) * 2020-07-30 2020-10-13 腾讯科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN111897964A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Text classification model training method, device, equipment and storage medium
CN113392210A (en) * 2020-11-30 2021-09-14 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190095432A1 (en) * 2017-09-26 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for building text classification model, and text classification method and apparatus
CN111767405A (en) * 2020-07-30 2020-10-13 腾讯科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN111897964A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Text classification model training method, device, equipment and storage medium
CN113392210A (en) * 2020-11-30 2021-09-14 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357720A (en) * 2022-10-20 2022-11-18 暨南大学 Multi-task news classification method and device based on BERT
CN115905547A (en) * 2023-02-10 2023-04-04 中国航空综合技术研究所 Aeronautical field text classification method based on belief learning
CN115905547B (en) * 2023-02-10 2023-11-14 中国航空综合技术研究所 Aviation field text classification method based on confidence learning
CN115880120A (en) * 2023-02-24 2023-03-31 江西微博科技有限公司 Online government affair service system and service method
CN116467607A (en) * 2023-03-28 2023-07-21 阿里巴巴(中国)有限公司 Information matching method and storage medium
CN116467607B (en) * 2023-03-28 2024-03-01 阿里巴巴(中国)有限公司 Information matching method and storage medium
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium
CN116738298B (en) * 2023-08-16 2023-11-24 杭州同花顺数据开发有限公司 Text classification method, system and storage medium

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN107679039B (en) Method and device for determining statement intention
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN110704576B (en) Text-based entity relationship extraction method and device
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN113987209A (en) Natural language processing method and device based on knowledge-guided prefix fine tuning, computing equipment and storage medium
CN107861954B (en) Information output method and device based on artificial intelligence
CN110895559B (en) Model training method, text processing method, device and equipment
CN111062217A (en) Language information processing method and device, storage medium and electronic equipment
CN112069312B (en) Text classification method based on entity recognition and electronic device
US20210004602A1 (en) Method and apparatus for determining (raw) video materials for news
CN112528654A (en) Natural language processing method and device and electronic equipment
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
Yu et al. Cross-Domain Slot Filling as Machine Reading Comprehension.
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN116483314A (en) Automatic intelligent activity diagram generation method
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN112990388B (en) Text clustering method based on concept words
CN112528653A (en) Short text entity identification method and system
Brugués i Pujolràs et al. A multilingual approach to scene text visual question answering
CN114692610A (en) Keyword determination method and device
Seo et al. FAGON: fake news detection model using grammatical transformation on deep neural network
Li et al. A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination