WO2023173555A1 - Model training method and apparatus, text classification method and apparatus, device, and medium - Google Patents

Model training method and apparatus, text classification method and apparatus, device, and medium Download PDF

Info

Publication number
WO2023173555A1
WO2023173555A1 PCT/CN2022/090737 CN2022090737W WO2023173555A1 WO 2023173555 A1 WO2023173555 A1 WO 2023173555A1 CN 2022090737 W CN2022090737 W CN 2022090737W WO 2023173555 A1 WO2023173555 A1 WO 2023173555A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
training data
target
text
model
Prior art date
Application number
PCT/CN2022/090737
Other languages
French (fr)
Chinese (zh)
Inventor
王彦
谢淋
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023173555A1 publication Critical patent/WO2023173555A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a model training method, text classification method and device, equipment, and media.
  • embodiments of this application propose a model training method, which is used to train a target classification model.
  • the method includes:
  • the original training data includes first original data and second original data
  • a preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  • the embodiment of this application proposes a text classification method, which method includes:
  • the target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the model training method includes: obtaining original training Data, wherein the original training data includes first original data and second original data;
  • a preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  • the embodiment of the present application proposes a model training device.
  • the device includes:
  • a training data acquisition module configured to acquire original training data, where the original training data includes first original data and second original data;
  • An upsampling module used to upsample the second original data to obtain initial training data
  • a data enhancement module configured to enhance the initial training data according to preset enhancement parameters to obtain enhanced training data
  • An encoding module used to encode the enhanced training data to obtain a target word embedding vector
  • a perturbation module used to perturb the target word embedding vector to obtain target training data
  • a model training module configured to train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model, used to classify the target Text data is classified.
  • the embodiment of the present application proposes a text classification device, which includes:
  • Text data acquisition module used to obtain target text data to be classified
  • a label classification module used to input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the training of the model
  • the method includes: obtaining original training data, wherein the original training data includes first original data and second original data;
  • a preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  • inventions of the present application provide an electronic device.
  • the electronic device includes a memory, a processor, a program stored on the memory and executable on the processor, and a program for implementing the processor. and a data bus for connection and communication between the memory, and when the program is executed by the processor, a model training method or a text classification method is implemented;
  • the training method of the model includes:
  • the original training data includes first original data and second original data
  • a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
  • the text classification method includes:
  • the target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
  • a preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  • inventions of the present application provide a storage medium.
  • the storage medium is a computer-readable storage medium for computer-readable storage.
  • the storage medium stores one or more programs, and the one or more programs are stored in the storage medium.
  • a program can be executed by one or more processors to implement a model training method or a text classification method;
  • the training method of the model includes:
  • the original training data includes first original data and second original data
  • a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
  • the text classification method includes:
  • the target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
  • a preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  • the model training method, text classification method and device, electronic equipment and storage medium proposed by this application obtain original training data, where the original training data includes first original data and second original data; the second original data is The upsampling process obtains the initial training data, which can effectively correct the abnormal data in the second original data and improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data. In this way, the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data.
  • training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs, where the target
  • the classification model is a text classification model, which can be used to classify target text data. Classifying target text data through the target classification model can improve the accuracy of text classification.
  • Figure 1 is a flow chart of a model training method provided by an embodiment of the present application.
  • FIG. 2 is a flow chart of step S103 in Figure 1;
  • FIG. 3 is another flowchart of step S103 in Figure 1;
  • FIG. 4 is a flow chart of step S106 in Figure 1;
  • Figure 5 is a flow chart of the text classification method provided by the embodiment of the present application.
  • Figure 6 is a flow chart of step S502 in Figure 5;
  • Figure 7 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a text classification device provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application.
  • Artificial intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Natural language processing uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP is a branch of artificial intelligence and an interdisciplinary subject of computer science and linguistics. It's called computational linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is commonly used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-text conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, etc. It involves language Processing related data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research and linguistic research related to language computing, etc.
  • NER Information Extraction
  • Text processing technology that extracts specified types of factual information such as entities, relationships, events, etc. from natural language text and forms structured data output.
  • Information extraction is a technique for extracting specific information from text data.
  • Text data is composed of some specific units, such as sentences, paragraphs, and chapters.
  • Text information is composed of some small specific units, such as words, words, phrases, sentences, paragraphs, or a combination of these specific units.
  • Extracting noun phrases, person names, place names, etc. from text data is text information extraction.
  • the information extracted by text information extraction technology can be various types of information.
  • Data upsampling refers to amplifying a small number of samples to the same number of samples as the majority of samples. For example, take one data from a few samples, find the distance between the sample and other samples, sort according to the Euclidean distance, and take out the first 5 data.
  • Data Augmentation is also called data amplification, which means that limited data can generate value equivalent to more data without substantially increasing the data.
  • Data augmentation can be divided into supervised data augmentation and unsupervised data augmentation methods. Among them, supervised data enhancement can be divided into single-sample data enhancement and multi-sample data enhancement methods, while unsupervised data enhancement can be divided into two directions: generating new data and learning enhancement strategies.
  • Encoding is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, or video; The output sequence can be text or images.
  • BERT Bidirectional Encoder Representations from Transformers: is a language representation model. BERT uses Transformer Encoder block for connection, which is a typical bidirectional encoding model.
  • Embedding is a vector representation, which refers to using a low-dimensional vector to represent an object.
  • the object can be a word, a product, a movie, etc.; the nature of this embedding vector is that it can Objects corresponding to vectors with similar distances have similar meanings. For example, the distance between embedding (Avengers) and embedding (Iron Man) will be very close, but the distance between embedding (Avengers) and embedding (Gone with the Wind) It will be further away.
  • Embedding is essentially a mapping, a mapping from semantic space to vector space, while maintaining the relationship between the original sample and the semantic space in the vector space as much as possible.
  • Embedding can encode objects with low-dimensional vectors and retain their meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded into a low-dimensional dense vector and then passed to the DNN to improve efficiency.
  • Softmax classifier It is a general induction of multiple classifications faced by the logistic regression classifier, and the output is the probability value belonging to different categories.
  • embodiments of the present application provide a model training method, text classification method and device, equipment, and media, aiming to improve the model's recognition accuracy of sample text data, thereby improving the training effect of the model.
  • model training method, text classification method and device, equipment, and medium provided by the embodiments of the present application are specifically described through the following embodiments. First, the model training method in the embodiment of the present application is described.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the model training method, text classification method and device, equipment, and media provided by the embodiments of this application relate to the field of artificial intelligence technology.
  • the model training method, text classification method and device, equipment, and media provided by the embodiments of the present application can be applied to terminals or servers, or can be software running in terminals or servers.
  • the terminal can be a smartphone, a tablet, a laptop, a desktop computer, etc.
  • the server can be configured as an independent physical server, or as a server cluster or distributed system composed of multiple physical servers.
  • a cloud that can be configured to provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • Server software can be an application that implements text classification methods, etc., but is not limited to the above forms.
  • the application may be used in a variety of general or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above systems or devices, etc.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • Figure 1 is an optional flow chart of the model training method provided by the embodiment of the present application.
  • the method in Figure 1 may include, but is not limited to, steps S101 to S106.
  • Step S101 obtain original training data, wherein the original training data includes first original data and second original data;
  • Step S102 perform upsampling processing on the second original data to obtain initial training data
  • Step S103 perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data
  • Step S104 Encode the enhanced training data to obtain a target word embedding vector
  • Step S105 perform perturbation processing on the target word embedding vector to obtain target training data
  • Step S106 Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, where the target classification model is a text classification model used to perform target text data classification. Classification.
  • Steps S101 to S106 illustrated in the embodiment of the present application obtain initial training data by upsampling the second original data, which can effectively correct abnormal data in the second original data and improve the rationality of the data.
  • the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector.
  • the target word embedding vector is perturbed to obtain the target training data, which can be convenient
  • the target training data that meets the needs can be obtained accurately, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data.
  • Training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs.
  • sample data can be obtained by writing a web crawler, setting the data source, and then crawling data in a targeted manner.
  • Sample data can also be obtained through other methods and is not limited to this. It should be noted that this sample data is text data with text category labels. According to the preset proportion parameters, the sample data is divided into original training data, original verification data and original test data. In order to improve the training effect of the model, it is necessary to perform data enhancement processing on the original training data. Specifically, first perform data statistics on the original training data to obtain the number of samples of each text category in the original training data.
  • the original training data is divided into the first original data and the second original data, that is, the first original data and the second original data can be distinguished according to the text category labels on the original training data.
  • the first original data is when the number of samples is greater than the predetermined number.
  • the original training data with a quantity threshold is labeled as majority class sample data label 0, while the second original data is the original training data with a sample number less than or equal to the preset quantity threshold, labeled as minority class sample data label 1, where, for the majority class
  • the class sample data label 0 i.e., the first original data
  • the minority class sample data label 1 (i.e., the second original data) needs to undergo data enhancement processing.
  • the majority class sample data label 0 i.e., the second original data
  • the number of samples of the first original data is m
  • the number of samples of the minority class sample data label 1 is n. Then it is necessary to perform data enhancement on the minority class sample data label 1 (i.e., the second original data), and we get m-n sample data, so that the sample number ratio between the enhanced second original data and the first original data is 1:1.
  • the minority class sample data label 1 (i.e., the second original data) that needs to be enhanced is randomly upsampled, and the number of samples is the majority class sample data label 0 (i.e., the first original data). Therefore, The second original data after sampling will generate m-n repeated sample data, thereby obtaining new training data, which is recorded as initial training data.
  • the enhancement parameter includes a first disturbance ratio
  • step S103 may include but is not limited to steps S201 to S203:
  • Step S201 obtain the first sentence length of the initial training data
  • Step S202 calculate the first disturbance amount based on the first sentence length and the first disturbance ratio
  • Step S203 Delete the initial training data according to the first disturbance amount to obtain enhanced training data.
  • step S201 of some embodiments count the first sentence length s1 of each text sentence in the initial training data set in units of characters. For example, if a certain text sentence consists of five words and three punctuation marks, then the text The first sentence length s1 of the sentence is 8.
  • the first disturbance ratio may be set according to actual requirements. For example, if the first disturbance ratio r1 is set to 0.1, then the first disturbance amount d1 is calculated based on the first sentence length s1 and the first disturbance ratio r1.
  • the first disturbance amount d1 can be the corresponding value when s1*r1 is rounded, that is The first disturbance amount d1 is int(s1*r1).
  • step S203 of some embodiments int (s1*r1) positions are randomly selected from the current text sentence as replacement positions, and the characters at these replacement positions are replaced with nulls, thereby realizing the text sentence of the initial training data. Deletion processing to obtain enhanced training data.
  • the enhancement parameter includes a second disturbance ratio
  • step S103 may include but is not limited to steps S301 to S303:
  • Step S301 obtain the second sentence length of the initial training data
  • Step S302 calculate the second disturbance amount based on the second sentence length and the second disturbance ratio
  • Step S303 Expand the initial training data according to the second disturbance amount and preset punctuation marks to obtain enhanced training data.
  • step S301 of some embodiments count the second sentence length s2 of each text sentence in the initial training data set in units of characters. For example, if a certain text sentence consists of six words and two punctuation marks, then the text The second sentence length s2 of the sentence is 8.
  • the second disturbance ratio may be set according to actual requirements. For example, if the second disturbance ratio r2 is set to 0.1, then the second disturbance amount d2 is calculated based on the second sentence length s2 and the second disturbance ratio r2.
  • the second disturbance amount d2 can be the corresponding value when s2*r2 is rounded, that is The second disturbance amount d2 is int(s2*r2).
  • the preset punctuation mark p is a neutral mark, such as a comma, a comma, a colon, a semicolon, a period, an ellipsis, etc. Randomly select int(s2*r2) positions from the current text sentence as replacement positions, randomly extract int(s2*r2) symbols from p (repeated extraction is allowed), and replace the characters at the replacement positions with punctuation marks. In this way, the text sentences of the initial training data are expanded and processed, and enhanced training data is obtained.
  • a neutral mark such as a comma, a comma, a colon, a semicolon, a period, an ellipsis, etc.
  • first perturbation ratio and the second perturbation ratio can be understood as enhancement ratios, which are used to determine the proportion of the number of characters that need to be modified in a certain text sentence.
  • the first perturbation amount and the second perturbation amount can be understood as The number of enhanced characters is used to determine the number of characters that need to be modified in a certain text sentence.
  • the above two data enhancement methods can be selected for data enhancement at the same time, or one of the data enhancement methods can be used alone for data enhancement.
  • select the above two data enhancement methods to enhance the initial training data at the same time and set the proportion of one of the data enhancement methods to k, then there are (m-n)*k in the initial training data
  • the sample data is enhanced using this method, while other sample data in the initial training data is enhanced using another data enhancement method.
  • the (m-n)*k sample data are deleted through the above-mentioned steps S201 to step S203, and the other sample data excluding the m-n)*k sample data are expanded through the above-mentioned steps S301 to step S303, thereby obtaining enhancement. training data.
  • a BERT encoder may be used to encode the enhanced training data to obtain a target word embedding vector. Because BERT uses Transformer Encoder block for connection, it is a typical bidirectional encoding model. Therefore, the enhanced training data can be bidirectionally encoded through the BERT encoder, that is, the enhanced training data can be encoded from left to right and from right to left, respectively, to obtain the target word embedding vector (token embedding).
  • step S105 of some embodiments when performing perturbation processing on the target word embedding vector, perturbation can be added to the target word embedding vector (token embedding) along the gradient direction according to a preset perturbation factor.
  • the preset perturbation factor can be Represented as a word embedding weight matrix, that is, the target word embedding vector and the preset word embedding weight matrix are matrix multiplied along the gradient direction to obtain the target training data.
  • step S106 may include, but is not limited to, steps S401 to S403:
  • Step S401 Perform perturbation calculation on the first original data and target training data through a preset function to obtain the text perturbation value
  • Step S402 Calculate the loss function of the neural network model based on the text disturbance value to obtain the loss value
  • Step S403 Use the loss value as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model.
  • step S401 of some embodiments first input the first original data and target training data into the preset neural network model, and set the number of iterations (epoches_num) and data batch size (batch size) of the neural network model, and Divide the first original data and target training data into multiple batches according to the data batch size to obtain batch data.
  • the preset function is the cross-entropy function.
  • norm The text perturbation value is The value range of hyperparameter ⁇ is (0,1]. If you want the above target word embedding vector to add greater disturbance, set the hyperparameter ⁇ to a larger value. After many verifications, when the hyperparameter ⁇ is 0.1 to 0.3 , the model training effect is better.
  • parameter k is set to control the number of disturbances, and the above-mentioned intermediate parameter calculation process and target parameter calculation process are cycled k times. Since excessive number of perturbations will bring too much noise and affect the prediction accuracy of the neural network model on each text category, the number of perturbations is generally set to 2 or 3 times to obtain the final text perturbation value.
  • the loss function of the neural network model is calculated based on the final text perturbation value to obtain the loss value. Specifically, the loss function corresponding to the fully connected layer of the neural network model can be calculated to obtain the loss value.
  • the loss value is used as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model, so that the labeled text data generated by the neural network model is more accurate. Improve the recognition accuracy of neural network models for minority text data.
  • the model training method of the embodiment of the present application obtains original training data, where the original training data includes first original data and second original data; performs upsampling processing on the second original data to obtain initial training data, which can effectively Correct the abnormal data in the second original data to improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data. In this way, the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data. Finally, training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs.
  • Figure 5 is an optional flow chart of the text classification method provided by the embodiment of the present application.
  • the method in Figure 5 may include, but is not limited to, steps S501 to S502.
  • Step S501 obtain the target text data to be classified
  • Step S502 Input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to the training method of the embodiment of the first aspect.
  • the target text data to be classified can be obtained by writing a web crawler, setting the data source, and then crawling data in a targeted manner.
  • Sample data can also be obtained through other methods and is not limited to this. It should be noted that the target text data can be articles, text fields, text segments, etc.
  • the target text data is input into the target classification model, the target text data is mapped to a preset vector space through the target classification model, the target text vector is obtained, and the target text vector is obtained through the preset classification function.
  • the text vector is subjected to label classification processing to obtain label text data.
  • step S502 may also include, but is not limited to, steps S601 to S602:
  • Step S601 map the target text data to a preset vector space through the fully connected layer of the target classification model to obtain the target text vector;
  • Step S602 Perform label classification processing on the target text vector through the classification function of the fully connected layer and the preset text category label to obtain label text data.
  • step S601 of some embodiments the feature dimensions of the preset text category labels are obtained, and the target text data is mapped from semantic space to vector space through the MLP network of the fully connected layer, and the target text data is mapped to the preset text.
  • the feature dimensions of the category labels are the same in the vector space to obtain the target text vector.
  • the classification function may be a softmax function. For example, a probability distribution is created on each text category label through the softmax function to obtain a predicted probability value that the target text vector belongs to each text category. Finally, according to the size of the classification probability value, the text category judgment and labeling processing are performed on the target text vector to obtain label text data.
  • preset text category labels can be set according to actual needs, and the text category labels in different business scenarios can be different.
  • the preset text category labels include classical literature, foreign literature, prose, novels, poetry collections, etc.
  • preset text category labels can include transportation, weather conditions, time information, etc.
  • the text classification method of the embodiment of the present application obtains the target text data to be classified and inputs the target text data into the target classification model for label classification processing.
  • the target classification model has good recognition accuracy for minority text data.
  • the target classification model can identify target text data of different categories, and classify the target text data according to different category labels to obtain labeled text data, which improves the accuracy of text classification.
  • the model training device includes:
  • the training data acquisition module 701 is used to acquire original training data, where the original training data includes first original data and second original data;
  • the upsampling module 702 is used to perform upsampling processing on the second original data to obtain initial training data;
  • the data enhancement module 703 is used to enhance the initial training data according to preset enhancement parameters to obtain enhanced training data;
  • Encoding module 704 is used to encode the enhanced training data to obtain the target word embedding vector
  • the perturbation module 705 is used to perturb the target word embedding vector to obtain target training data
  • the model training module 706 is used to train a preset neural network model according to the first original data and the target training data to obtain a target classification model, where the target classification model is a text classification model, used for Target text data is classified.
  • data enhancement module 703 includes:
  • the first sentence length acquisition unit is used to obtain the first sentence length of the initial training data
  • a first disturbance amount calculation unit configured to calculate the first disturbance amount based on the first sentence length and the first disturbance ratio
  • the data deletion unit is used to delete the initial training data according to the first disturbance amount to obtain enhanced training data.
  • the data enhancement module 703 includes:
  • the second sentence length acquisition unit is used to obtain the second sentence length of the initial training data
  • a second disturbance amount calculation unit configured to calculate the second disturbance amount based on the second sentence length and the second disturbance ratio
  • the data expansion unit is used to expand the initial training data according to the second disturbance amount and preset punctuation marks to obtain enhanced training data.
  • model training module 706 includes:
  • a perturbation calculation unit used to perform perturbation calculation on the first original data and target training data through a preset function to obtain the text perturbation value
  • the loss value calculation unit is used to calculate the loss function of the neural network model based on the text disturbance value to obtain the loss value;
  • the training unit is used to use the loss value as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model.
  • the model training device in the embodiment of the present application is used to perform the model training method in the above embodiment.
  • the specific processing process is the same as the model training method in the above embodiment, and will not be described again here.
  • the text classification device includes:
  • Text data acquisition module 801 used to acquire target text data to be classified
  • the label classification module 802 is configured to input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to the training method of any one of the embodiments of the first aspect.
  • tag classification module 802 includes:
  • the mapping unit is used to map the target text data to the preset vector space through the fully connected layer of the target classification model to obtain the target text vector;
  • the label classification unit is used to perform label classification processing on the target text vector through the classification function of the fully connected layer and the preset text category label to obtain label text data.
  • the text classification device in the embodiment of the present application is used to perform the text classification method in the above embodiment. Its specific processing process is the same as the text classification method in the above embodiment, and will not be described again here.
  • Embodiments of the present application also provide an electronic device.
  • the electronic device includes: a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for realizing connection and communication between the processor and the memory.
  • a model training method or a text classification method is implemented, wherein the model training method includes: obtaining original training data, wherein the original training data includes first original data and second original data; Perform upsampling processing on the second original data to obtain initial training data; perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data; perform encoding processing on the enhanced training data to obtain the target word embedding vector; performing perturbation processing on the target word embedding vector to obtain target training data; training a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein,
  • the target classification model is a text classification model, used to classify target text data; wherein the text classification method includes: obtaining target text data to be classified; in
  • the electronic device includes:
  • the processor 901 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement The technical solutions provided by the embodiments of this application;
  • the memory 902 can be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage device, dynamic storage device, or random access memory (RandomAccessMemory, RAM).
  • the memory 902 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the implementation of this application.
  • Example model training methods or text classification methods are stored in the memory 902 and called by the processor 901 to execute the implementation of this application.
  • Communication interface 904 is used to realize communication interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wirelessly (such as mobile network, WIFI, Bluetooth, etc.);
  • Bus 905 which transmits information between various components of the device (such as processor 901, memory 902, input/output interface 903, and communication interface 904);
  • the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 implement communication connections between each other within the device through the bus 905.
  • Embodiments of the present application also provide a storage medium.
  • the storage medium is a computer-readable storage medium for computer-readable storage.
  • the storage medium stores one or more programs, and the one or more programs can be processed by one or more
  • the processor is executed to implement a model training method or a text classification method, wherein the model training method includes: obtaining original training data, wherein the original training data includes first original data and second original data; The second original data is subjected to upsampling processing to obtain initial training data; the initial training data is enhanced according to preset enhancement parameters to obtain enhanced training data; the enhanced training data is encoded to obtain the target word Embedding vectors; performing perturbation processing on the target word embedding vector to obtain target training data; training a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein
  • the target classification model is a text classification model, used to classify target text data; wherein, the text classification method includes: obtaining target text data to be classified; inputting the target
  • memory can be used to store non-transitory software programs and non-transitory computer executable programs.
  • the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the memory may optionally include memory located remotely from the processor, and the remote memory may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the model training method, model training device, text classification method, text classification device, electronic device and storage medium provided by the embodiments of the present application obtain original training data, where the original training data includes first original data and second Original data; perform upsampling processing on the second original data to obtain initial training data, which can effectively correct abnormal data in the second original data and improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data.
  • the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data.
  • training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs, where the target
  • the classification model is a text classification model, which can be used to classify target text data. Classifying target text data through the target classification model can improve the accuracy of text classification.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of this embodiment.
  • At least one (item) refers to one or more, and “plurality” refers to two or more.
  • “And/or” is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, “A and/or B” can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character “/” generally indicates that the related objects are in an "or” relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ”, where a, b, c can be single or multiple.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the above units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described above as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc. that can store programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A model training method and apparatus, a text classification method and apparatus, a device, and a storage medium, relating to the technical field of artificial intelligence. The training method comprises: obtaining original training data, the original training data comprising first original data and second original data (S101); performing up-sampling processing on the second original data to obtain initial training data (S102); performing enhancement processing on the initial training data according to a preset enhancement parameter to obtain enhanced training data (S103); encoding the enhanced training data to obtain a target word embedding vector (S104); performing disturbance processing on the target word embedding vector to obtain target training data (S105); and training a preset neural network model according to the first original data and the target training data to obtain a target classification model, the target classification model being a text classification model and being used for classifying target text data (S106). The present method can improve the recognition accuracy of a model on sample text data and the training effect of the model.

Description

模型的训练方法、文本分类方法和装置、设备、介质Model training methods, text classification methods and devices, equipment, and media
本申请要求于2022年3月15日提交中国专利局、申请号为202210253301.X,发明名称为“模型的训练方法、文本分类方法和装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the China Patent Office on March 15, 2022, with the application number 202210253301. The entire contents are incorporated herein by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种模型的训练方法、文本分类方法和装置、设备、介质。This application relates to the field of artificial intelligence technology, and in particular to a model training method, text classification method and device, equipment, and media.
背景技术Background technique
目前,在对文本进行分类时,常常采用将相关的文本数据集输入至训练好的监督学习模型,通过监督学习模型对相关的文本数据集进行分类处理。Currently, when classifying text, it is common to input relevant text data sets into a trained supervised learning model, and classify the relevant text data sets through the supervised learning model.
技术问题technical problem
以下是发明人意识到的现有技术的技术问题:相关技术中,常用的监督学习模型往往无法准确识别少数类文本数据,影响模型的训练效果。因此,如何提高模型对样本文本数据的识别准确性,以提高模型的训练效果成为了亟待解决的技术问题。The following are the technical problems of the prior art that the inventor is aware of: In related technologies, commonly used supervised learning models often cannot accurately identify minority text data, which affects the training effect of the model. Therefore, how to improve the model's recognition accuracy of sample text data to improve the model's training effect has become an urgent technical issue that needs to be solved.
技术解决方案Technical solutions
第一方面,本申请实施例提出了一种模型的训练方法,所述方法用于训练目标分类模型,所述方法包括:In the first aspect, embodiments of this application propose a model training method, which is used to train a target classification model. The method includes:
获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
第二方面,本申请实施例的提出了一种文本分类方法,所述方法包括:In the second aspect, the embodiment of this application proposes a text classification method, which method includes:
获取待分类的目标文本数据;Obtain the target text data to be classified;
将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the model training method includes: obtaining original training Data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
第三方面,本申请实施例提出了一种模型的训练装置,所述装置包括:In the third aspect, the embodiment of the present application proposes a model training device. The device includes:
训练数据获取模块,用于获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;A training data acquisition module, configured to acquire original training data, where the original training data includes first original data and second original data;
上采样模块,用于对所述第二原始数据进行上采样处理,得到初始训练数据;An upsampling module, used to upsample the second original data to obtain initial training data;
数据增强模块,用于根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;A data enhancement module, configured to enhance the initial training data according to preset enhancement parameters to obtain enhanced training data;
编码模块,用于对所述增强训练数据进行编码处理,得到目标词嵌入向量;An encoding module, used to encode the enhanced training data to obtain a target word embedding vector;
扰动模块,用于对所述目标词嵌入向量进行扰动处理,得到目标训练数据;A perturbation module, used to perturb the target word embedding vector to obtain target training data;
模型训练模块,用于根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A model training module, configured to train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model, used to classify the target Text data is classified.
第四方面,本申请实施例提出了一种文本分类装置,所述装置包括:In the fourth aspect, the embodiment of the present application proposes a text classification device, which includes:
文本数据获取模块,用于获取待分类的目标文本数据;Text data acquisition module, used to obtain target text data to be classified;
标签分类模块,用于将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;A label classification module, used to input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the training of the model The method includes: obtaining original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
第五方面,本申请实施例提出了一种电子设备,所述电子设备包括存储器、处理器、存储在所述存储器上并可在所述处理器上运行的程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线,所述程序被所述处理器执行时实现一种模型的训练方法或者一种文本分类方法;In a fifth aspect, embodiments of the present application provide an electronic device. The electronic device includes a memory, a processor, a program stored on the memory and executable on the processor, and a program for implementing the processor. and a data bus for connection and communication between the memory, and when the program is executed by the processor, a model training method or a text classification method is implemented;
其中,所述模型的训练方法包括:Wherein, the training method of the model includes:
获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
其中,所述文本分类方法包括:Wherein, the text classification method includes:
获取待分类的目标文本数据;Obtain the target text data to be classified;
将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
第六方面,本申请实施例提供了一种存储介质,所述存储介质为计算机可读存储介质,用于计算机可读存储,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现一种模型的训练方法或者一种文本分类方法;In a sixth aspect, embodiments of the present application provide a storage medium. The storage medium is a computer-readable storage medium for computer-readable storage. The storage medium stores one or more programs, and the one or more programs are stored in the storage medium. A program can be executed by one or more processors to implement a model training method or a text classification method;
其中,所述模型的训练方法包括:Wherein, the training method of the model includes:
获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
其中,所述文本分类方法包括:Wherein, the text classification method includes:
获取待分类的目标文本数据;Obtain the target text data to be classified;
将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
有益效果beneficial effects
本申请提出的模型的训练方法、文本分类方法和装置、电子设备及存储介质,其通过获取原始训练数据,其中,原始训练数据包括第一原始数据和第二原始数据;对第二原始数据进行上采样处理,得到初始训练数据,能够有效地修正第二原始数据中的异常数据,提高数据合理性。进而,根据预设的增强参数对初始训练数据进行增强处理,得到增强训练数据,再对增强训练数据进行编码处理,得到目标词嵌入向量,对目标词嵌入向量进行扰动处理,得到目标训练数据,通过这一方式能够方便地得到符合需求的目标训练数据,使得得到的目标训练数据能够更好地突显出少数类训练数据的特征,提高神经网络模型对少数类训练数据的关注度。最后,根据第一原始数据和目标训练数据对预设的神经网络模型进行训练,能够提高模型对样本文本数据的识别准确性,提高模型的训练效果,得到符合需求的目标分类模型,其中,目标分类模型为文本分类模型,能够用于对目标文本数据进行分类,通过目标分类模型对目标文本数据进行分类,能够提高文本分类的准确性。The model training method, text classification method and device, electronic equipment and storage medium proposed by this application obtain original training data, where the original training data includes first original data and second original data; the second original data is The upsampling process obtains the initial training data, which can effectively correct the abnormal data in the second original data and improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data. In this way, the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data. Finally, training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs, where the target The classification model is a text classification model, which can be used to classify target text data. Classifying target text data through the target classification model can improve the accuracy of text classification.
附图说明Description of the drawings
图1是本申请实施例提供的模型的训练方法的流程图;Figure 1 is a flow chart of a model training method provided by an embodiment of the present application;
图2是图1中的步骤S103的流程图;Figure 2 is a flow chart of step S103 in Figure 1;
图3是图1中的步骤S103的另一流程图;Figure 3 is another flowchart of step S103 in Figure 1;
图4是图1中的步骤S106的流程图;Figure 4 is a flow chart of step S106 in Figure 1;
图5是本申请实施例提供的文本分类方法的流程图;Figure 5 is a flow chart of the text classification method provided by the embodiment of the present application;
图6是图5中的步骤S502的流程图;Figure 6 is a flow chart of step S502 in Figure 5;
图7是本申请实施例提供的模型的训练装置的结构示意图;Figure 7 is a schematic structural diagram of a model training device provided by an embodiment of the present application;
图8是本申请实施例提供的文本分类装置的结构示意图;Figure 8 is a schematic structural diagram of a text classification device provided by an embodiment of the present application;
图9是本申请实施例提供的电子设备的硬件结构示意图。Figure 9 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the device schematic diagram and the logical sequence is shown in the flow chart, in some cases, the modules can be divided into different modules in the device or the order in the flow chart can be executed. The steps shown or described. The terms "first", "second", etc. in the description, claims, and above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific sequence or sequence.
除非另有定义,本申请所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本申请中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used in this application have the same meaning as commonly understood by a person skilled in the technical field of this application. The terms used in this application are only for the purpose of describing the embodiments of the application and are not intended to limit the application.
首先,对本申请中涉及的若干名词进行解析:First, let’s analyze some terms involved in this application:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
自然语言处理(natural language processing,NLP):NLP用计算机来处理、理解以及运用人类语言(如中文、英文等),NLP属于人工智能的一个分支,是计算机科学与语言学的交叉学科,又常被称为计算语言学。自然语言处理包括语法分析、语义分析、篇章理解等。自然语言处理常用于机器翻译、手写体和印刷体字符识别、语音识别及文语转换、信息意图识别、信息抽取与过滤、文本分类与聚类、舆情分析和观点挖掘等技术领域,它涉及与语言处理相关的数据挖掘、机器学习、知识获取、知识工程、人工智能研究和与语言计算相关的语言学研究等。Natural language processing (NLP): NLP uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP is a branch of artificial intelligence and an interdisciplinary subject of computer science and linguistics. It's called computational linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is commonly used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-text conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, etc. It involves language Processing related data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research and linguistic research related to language computing, etc.
信息抽取(Information Extraction,NER):从自然语言文本中抽取指定类型的实体、关系、事件等事实信息,并形成结构化数据输出的文本处理技术。信息抽取是从文本数据中抽取特定信息的一种技术。文本数据是由一些具体的单位构成的,例如句子、段落、篇章,文本信息正是由一些小的具体的单位构成的,例如字、词、词组、句子、段落或是这些具体的单位的组合。抽取文本数据中的名词短语、人名、地名等都是文本信息抽取,当然,文本信息抽取技术所抽取的信息可以是各种类型的信息。Information Extraction (NER): Text processing technology that extracts specified types of factual information such as entities, relationships, events, etc. from natural language text and forms structured data output. Information extraction is a technique for extracting specific information from text data. Text data is composed of some specific units, such as sentences, paragraphs, and chapters. Text information is composed of some small specific units, such as words, words, phrases, sentences, paragraphs, or a combination of these specific units. . Extracting noun phrases, person names, place names, etc. from text data is text information extraction. Of course, the information extracted by text information extraction technology can be various types of information.
数据上采样(Data SMOTE):数据上采样指的是将少数的样本扩增到与多数样本相同的样本数。例如,取少数样本中的一个数据,求出该样本与其他样本的距离,根据欧式距离进行排序,取出前5个数据。Data upsampling (Data SMOTE): Data upsampling refers to amplifying a small number of samples to the same number of samples as the majority of samples. For example, take one data from a few samples, find the distance between the sample and other samples, sort according to the Euclidean distance, and take out the first 5 data.
数据增强(Data Augmentation):数据增强也叫数据扩增,意思是在不实质性的增加数据的情况下,让有限的数据产生等价于更多数据的价值。数据增强可以分为,有监督的数据增强和无监督的数据增强方法。其中有监督的数据增强又可以分为单样本数据增强和多样本数据增强方法,无监督的数据增强分为生成新的数据和学习增强策略两个方向。Data Augmentation: Data augmentation is also called data amplification, which means that limited data can generate value equivalent to more data without substantially increasing the data. Data augmentation can be divided into supervised data augmentation and unsupervised data augmentation methods. Among them, supervised data enhancement can be divided into single-sample data enhancement and multi-sample data enhancement methods, while unsupervised data enhancement can be divided into two directions: generating new data and learning enhancement strategies.
编码(Encoder):编码就是将输入序列转化成一个固定长度的向量;解码(decoder),就是将之前生成的固定向量再转化成输出序列;其中,输入序列可以是文字、语音、图像、视频;输出序列可以是文字、图像。Encoding (Encoder): Encoding is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, or video; The output sequence can be text or images.
BERT(Bidirectional Encoder Representations from Transformers):是一个语言表示模型(language representation model)。BERT采用了Transformer Encoder block进行连接,是一个典型的双向编码模型。BERT (Bidirectional Encoder Representations from Transformers): is a language representation model. BERT uses Transformer Encoder block for connection, which is a typical bidirectional encoding model.
嵌入(embedding):embedding是一种向量表征,是指用一个低维的向量表示一个物体,该物体可以是一个词,或是一个商品,或是一个电影等等;这个embedding向量的性质是能使距离相近的向量对应的物体有相近的含义,比如embedding(复仇者联盟)和embedding(钢铁侠)之间的距离就会很接近,但embedding(复仇者联盟)和embedding(乱世佳人)的距离就会远一些。embedding实质是一种映射,从语义空间到向量空间的映射,同时尽可能在向量 空间保持原样本在语义空间的关系,如语义接近的两个词汇在向量空间中的位置也比较接近。embedding能够用低维向量对物体进行编码还能保留其含义,常应用于机器学习,在机器学习模型构建过程中,通过把物体编码为一个低维稠密向量再传给DNN,以提高效率。Embedding: Embedding is a vector representation, which refers to using a low-dimensional vector to represent an object. The object can be a word, a product, a movie, etc.; the nature of this embedding vector is that it can Objects corresponding to vectors with similar distances have similar meanings. For example, the distance between embedding (Avengers) and embedding (Iron Man) will be very close, but the distance between embedding (Avengers) and embedding (Gone with the Wind) It will be further away. Embedding is essentially a mapping, a mapping from semantic space to vector space, while maintaining the relationship between the original sample and the semantic space in the vector space as much as possible. For example, the positions of two words with close semantics are also relatively close in the vector space. Embedding can encode objects with low-dimensional vectors and retain their meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded into a low-dimensional dense vector and then passed to the DNN to improve efficiency.
Softmax分类器:为逻辑回归分类器面对多个分类的一般化归纳,输出的是属于不同类别的概率值。Softmax classifier: It is a general induction of multiple classifications faced by the logistic regression classifier, and the output is the probability value belonging to different categories.
目前,在对文本进行分类时,常常采用将相关的文本数据集输入至训练好的监督学习模型,通过监督学习模型对相关的文本数据集进行分类处理;由于监督学习模型的训练效果往往取决于训练集的数量和质量,在文本分类场景中,广泛存在训练数据不均衡的问题,需要关注的样本类别往往是少数样本类别,而少数样本类别在整个数据集中的占比较小,如果直接将数据输入模型训练,模型往往倾向于将样本全部预测为多数类,对少数类样本数据的识别准确性较差。因此,如何提高模型对样本文本数据的识别准确性,以提高模型的训练效果成为了亟待解决的技术问题。At present, when classifying text, it is often used to input relevant text data sets into a trained supervised learning model, and then classify the relevant text data sets through the supervised learning model; because the training effect of the supervised learning model often depends on The quantity and quality of the training set. In text classification scenarios, there is a widespread problem of imbalanced training data. The sample categories that need attention are often minority sample categories, and the minority sample categories account for a relatively small proportion in the entire data set. If the data is directly Input model training, the model often tends to predict all samples as the majority class, and has poor recognition accuracy for minority class sample data. Therefore, how to improve the model's recognition accuracy of sample text data to improve the model's training effect has become an urgent technical issue that needs to be solved.
基于此,本申请实施例提供了一种模型的训练方法、文本分类方法和装置、设备、介质,旨在提高模型对样本文本数据的识别准确性,从而提高模型的训练效果。Based on this, embodiments of the present application provide a model training method, text classification method and device, equipment, and media, aiming to improve the model's recognition accuracy of sample text data, thereby improving the training effect of the model.
本申请实施例提供的模型的训练方法、文本分类方法和装置、设备、介质,具体通过如下实施例进行说明,首先描述本申请实施例中的模型的训练方法。The model training method, text classification method and device, equipment, and medium provided by the embodiments of the present application are specifically described through the following embodiments. First, the model training method in the embodiment of the present application is described.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of this application can obtain and process relevant data based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometric technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请实施例提供的模型的训练方法、文本分类方法和装置、设备、介质,涉及人工智能技术领域。本申请实施例提供的模型的训练方法、文本分类方法和装置、设备、介质可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机等;服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现文本分类方法的应用等,但并不局限于以上形式。The model training method, text classification method and device, equipment, and media provided by the embodiments of this application relate to the field of artificial intelligence technology. The model training method, text classification method and device, equipment, and media provided by the embodiments of the present application can be applied to terminals or servers, or can be software running in terminals or servers. In some embodiments, the terminal can be a smartphone, a tablet, a laptop, a desktop computer, etc.; the server can be configured as an independent physical server, or as a server cluster or distributed system composed of multiple physical servers. A cloud that can be configured to provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Server; software can be an application that implements text classification methods, etc., but is not limited to the above forms.
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application may be used in a variety of general or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, including Distributed computing environment for any of the above systems or devices, etc. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
图1是本申请实施例提供的模型的训练方法的一个可选的流程图,图1中的方法可以包括但不限于包括步骤S101至步骤S106。Figure 1 is an optional flow chart of the model training method provided by the embodiment of the present application. The method in Figure 1 may include, but is not limited to, steps S101 to S106.
步骤S101,获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Step S101, obtain original training data, wherein the original training data includes first original data and second original data;
步骤S102,对所述第二原始数据进行上采样处理,得到初始训练数据;Step S102, perform upsampling processing on the second original data to obtain initial training data;
步骤S103,根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Step S103, perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
步骤S104,对所述增强训练数据进行编码处理,得到目标词嵌入向量;Step S104: Encode the enhanced training data to obtain a target word embedding vector;
步骤S105,对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Step S105, perform perturbation processing on the target word embedding vector to obtain target training data;
步骤S106,根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。Step S106: Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, where the target classification model is a text classification model used to perform target text data classification. Classification.
本申请实施例所示意的步骤S101至步骤S106,通过对第二原始数据进行上采样处理,得到初始训练数据,能够有效地修正第二原始数据中的异常数据,提高数据合理性。根据预设的增强参数对初始训练数据进行增强处理,得到增强训练数据,再对增强训练数据进行编码处理,得到目标词嵌入向量,对目标词嵌入向量进行扰动处理,得到目标训练数据,能够方便地得到符合需求的目标训练数据,使得得到的目标训练数据能够更好地突显出少数类训练数据的特征,提高神经网络模型对少数类训练数据的关注度。根据第一原始数据和目标训练数据对预设的神经网络模型进行训练,能够提高模型对样本文本数据的识别准确性,提高模型的训练效果,得到符合需求的目标分类模型。Steps S101 to S106 illustrated in the embodiment of the present application obtain initial training data by upsampling the second original data, which can effectively correct abnormal data in the second original data and improve the rationality of the data. The initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector. The target word embedding vector is perturbed to obtain the target training data, which can be convenient The target training data that meets the needs can be obtained accurately, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data. Training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs.
在一些实施例的步骤S101中,可以通过编写网络爬虫,设置好数据源之后进行有目标性的爬取数据,得到样本数据。也可以通过其他方式获取样本数据,不限于此。需要说明的是,该样本数据为带有文本类别标签的文本数据。根据预设的比例参数,将样本数据划分为原始训练数据、原始验证数据和原始测试数据。为了提高模型的训练效果,需要对原始训练数据进行数据增强处理,具体地,首先对原始训练数据进行数据统计,获取原始训练数据中各个文本类别的样本数量,根据每一文本类别标签对应的样本数量,将原始训练数据划分为第一原始数据和第二原始数据,即第一原始数据和第二原始数据可以根据原始训练数据上的文本类别标签进行区分,第一原始数据为样本数量大于预设数量阈值的原始训练数据,标记为多数类样本数据label 0,而第二原始数据为样本数量小于或者等于预设数量阈值的原始训练数据,标记为少数类样本数据label 1,其中,对多数类样本数据label 0(即第一原始数据)不进行数据增强处理,对少数类样本数据label 1(即第二原始数据)需要进行数据增强处理,例如,若多数类样本数据label 0(即第一原始数据)的样本数量为m,少数类样本数据label 1(即第二原始数据)的样本数量为n,则需要对少数类样本数据label 1(即第二原始数据)进行数据增强,得到m-n个样本数据,从而使得增强之后的第二原始数据与第一原始数据的样本数量比例为1:1。In step S101 of some embodiments, sample data can be obtained by writing a web crawler, setting the data source, and then crawling data in a targeted manner. Sample data can also be obtained through other methods and is not limited to this. It should be noted that this sample data is text data with text category labels. According to the preset proportion parameters, the sample data is divided into original training data, original verification data and original test data. In order to improve the training effect of the model, it is necessary to perform data enhancement processing on the original training data. Specifically, first perform data statistics on the original training data to obtain the number of samples of each text category in the original training data. According to the samples corresponding to each text category label quantity, the original training data is divided into the first original data and the second original data, that is, the first original data and the second original data can be distinguished according to the text category labels on the original training data. The first original data is when the number of samples is greater than the predetermined number. The original training data with a quantity threshold is labeled as majority class sample data label 0, while the second original data is the original training data with a sample number less than or equal to the preset quantity threshold, labeled as minority class sample data label 1, where, for the majority class The class sample data label 0 (i.e., the first original data) does not undergo data enhancement processing. The minority class sample data label 1 (i.e., the second original data) needs to undergo data enhancement processing. For example, if the majority class sample data label 0 (i.e., the second original data) The number of samples of the first original data) is m, and the number of samples of the minority class sample data label 1 (i.e., the second original data) is n. Then it is necessary to perform data enhancement on the minority class sample data label 1 (i.e., the second original data), and we get m-n sample data, so that the sample number ratio between the enhanced second original data and the first original data is 1:1.
在一些实施例的步骤S102中,对需要进行增强的少数类样本数据label 1(即第二原始数据)进行随机上采样,采样数量为多数类样本数据label 0(即第一原始数据),因此采样后的第二原始数据会产生m-n个重复样本数据,从而得到新的训练数据,记为初始训练数据。In step S102 of some embodiments, the minority class sample data label 1 (i.e., the second original data) that needs to be enhanced is randomly upsampled, and the number of samples is the majority class sample data label 0 (i.e., the first original data). Therefore, The second original data after sampling will generate m-n repeated sample data, thereby obtaining new training data, which is recorded as initial training data.
请参阅图2,在一些实施例中,增强参数包括第一扰动比率,步骤S103可以包括但不限于包括步骤S201至步骤S203:Referring to Figure 2, in some embodiments, the enhancement parameter includes a first disturbance ratio, and step S103 may include but is not limited to steps S201 to S203:
步骤S201,获取初始训练数据的第一句子长度;Step S201, obtain the first sentence length of the initial training data;
步骤S202,根据第一句子长度和第一扰动比率,计算第一扰动量;Step S202, calculate the first disturbance amount based on the first sentence length and the first disturbance ratio;
步骤S203,根据第一扰动量对初始训练数据进行删减处理,得到增强训练数据。Step S203: Delete the initial training data according to the first disturbance amount to obtain enhanced training data.
在一些实施例的步骤S201中,以字符为单位,统计初始训练数据集中每一文本句子的第一句子长度s1,例如,某一文本句子由五个字和三个标点符号构成,则该文本句子的第一句子长度s1为8。In step S201 of some embodiments, count the first sentence length s1 of each text sentence in the initial training data set in units of characters. For example, if a certain text sentence consists of five words and three punctuation marks, then the text The first sentence length s1 of the sentence is 8.
在一些实施例的步骤S202中,第一扰动比率可以根据实际需求进行设置。例如,设置第一扰动比率r1为0.1,则根据第一句子长度s1和第一扰动比率r1计算得到第一扰动量d1,第一扰动量d1可以为s1*r1取整时对应的数值,即第一扰动量d1为int(s1*r1)。In step S202 of some embodiments, the first disturbance ratio may be set according to actual requirements. For example, if the first disturbance ratio r1 is set to 0.1, then the first disturbance amount d1 is calculated based on the first sentence length s1 and the first disturbance ratio r1. The first disturbance amount d1 can be the corresponding value when s1*r1 is rounded, that is The first disturbance amount d1 is int(s1*r1).
在一些实施例的步骤S203中,从当前的文本句子中随机选择int(s1*r1)个位置作为替换位置,并将这些替换位置上的字符替换为空,从而实现对初始训练数据的文本句子的删减处理,得到增强训练数据。In step S203 of some embodiments, int (s1*r1) positions are randomly selected from the current text sentence as replacement positions, and the characters at these replacement positions are replaced with nulls, thereby realizing the text sentence of the initial training data. Deletion processing to obtain enhanced training data.
请参阅图3,在另一些实施例中,增强参数包括第二扰动比率,步骤S103可以包括但不限于包括步骤S301至步骤S303:Referring to Figure 3, in other embodiments, the enhancement parameter includes a second disturbance ratio, and step S103 may include but is not limited to steps S301 to S303:
步骤S301,获取初始训练数据的第二句子长度;Step S301, obtain the second sentence length of the initial training data;
步骤S302,根据第二句子长度和第二扰动比率,计算第二扰动量;Step S302, calculate the second disturbance amount based on the second sentence length and the second disturbance ratio;
步骤S303,根据第二扰动量和预设的标点符号对初始训练数据进行扩充处理,得到增强训练数据。Step S303: Expand the initial training data according to the second disturbance amount and preset punctuation marks to obtain enhanced training data.
在一些实施例的步骤S301中,以字符为单位,统计初始训练数据集中每一文本句子的第二句子长度s2,例如,某一文本句子由六个字和两个标点符号构成,则该文本句子的第二句子长度s2为8。In step S301 of some embodiments, count the second sentence length s2 of each text sentence in the initial training data set in units of characters. For example, if a certain text sentence consists of six words and two punctuation marks, then the text The second sentence length s2 of the sentence is 8.
在一些实施例的步骤S302中,第二扰动比率可以根据实际需求进行设置。例如,设置第二扰动比率r2为0.1,则根据第二句子长度s2和第二扰动比率r2计算得到第二扰动量d2,第二扰动量d2可以为s2*r2取整时对应的数值,即第二扰动量d2为int(s2*r2)。In step S302 of some embodiments, the second disturbance ratio may be set according to actual requirements. For example, if the second disturbance ratio r2 is set to 0.1, then the second disturbance amount d2 is calculated based on the second sentence length s2 and the second disturbance ratio r2. The second disturbance amount d2 can be the corresponding value when s2*r2 is rounded, that is The second disturbance amount d2 is int(s2*r2).
在一些实施例的步骤S303中,预设的标点符号p为中性符号,例如,逗号、顿号、冒号、分号、句号、省略号等等。从当前的文本句子中随机选择int(s2*r2)个位置作为替换位置,随机从p中抽出int(s2*r2)个符号(允许重复抽取),将替换位置上的字符替换为标点符号,从而实现对初始训练数据的文本句子的扩充处理,得到增强训练数据。In step S303 of some embodiments, the preset punctuation mark p is a neutral mark, such as a comma, a comma, a colon, a semicolon, a period, an ellipsis, etc. Randomly select int(s2*r2) positions from the current text sentence as replacement positions, randomly extract int(s2*r2) symbols from p (repeated extraction is allowed), and replace the characters at the replacement positions with punctuation marks. In this way, the text sentences of the initial training data are expanded and processed, and enhanced training data is obtained.
需要说明的是,第一扰动比率和第二扰动比率可以理解为增强比率,用来确定某一文本句子中需要进行修改的字符个数占比,第一扰动量和第二扰动量可以理解为增强字符个数,用来确定某一文本句子中需要进行修改的字符个数。It should be noted that the first perturbation ratio and the second perturbation ratio can be understood as enhancement ratios, which are used to determine the proportion of the number of characters that need to be modified in a certain text sentence. The first perturbation amount and the second perturbation amount can be understood as The number of enhanced characters is used to determine the number of characters that need to be modified in a certain text sentence.
以步骤S201至步骤S203为例,设置第一扰动比率r1为0.1,则说明在某一文本句子中需要修改10%的字符,若某一文本句子长度为10,则第一扰动量d1为int(10*0.1)=1,该文本句子需要修改的字符为1个,则在该文本句子中随机选择1个位置作为替换位置,并将这个替换位置上的字符替换为空,从而实现对该文本句子的删减处理,得到增强训练数据。Taking steps S201 to S203 as an example, setting the first disturbance ratio r1 to 0.1 means that 10% of the characters need to be modified in a certain text sentence. If the length of a certain text sentence is 10, then the first disturbance amount d1 is int. (10*0.1)=1, the text sentence needs to modify 1 character, then randomly select a position in the text sentence as the replacement position, and replace the character at this replacement position with empty, thereby realizing the The text sentences are deleted to obtain enhanced training data.
需要说明的是,在对初始训练数据进行数据增强处理时,可以同时选择上述两种数据增强方式进行数据增强,也可以单独采用其中一种数据增强方式进行数据增强。例如,为了提高数据增强的效率,同时选择上述两种数据增强方式对初始训练数据进行数据增强,设置其中一种数据增强方式的占比为k,则初始训练数据中有(m-n)*k个样本数据采用这一方式进行数据增强处理,而初始训练数据中的其他样本数据则采用另一种数据增强方式进行数据增强处理。例如,通过上述步骤S201至步骤S203对(m-n)*k个样本数据进行删减处理,通过上述步骤S301至步骤S303对除去m-n)*k个样本数据的其他样本数据进行扩充处理,从而得到增强训练数据。It should be noted that when performing data enhancement processing on the initial training data, the above two data enhancement methods can be selected for data enhancement at the same time, or one of the data enhancement methods can be used alone for data enhancement. For example, in order to improve the efficiency of data enhancement, select the above two data enhancement methods to enhance the initial training data at the same time, and set the proportion of one of the data enhancement methods to k, then there are (m-n)*k in the initial training data The sample data is enhanced using this method, while other sample data in the initial training data is enhanced using another data enhancement method. For example, the (m-n)*k sample data are deleted through the above-mentioned steps S201 to step S203, and the other sample data excluding the m-n)*k sample data are expanded through the above-mentioned steps S301 to step S303, thereby obtaining enhancement. training data.
在一些实施例的步骤S104中,可以采用BERT编码器对增强训练数据进行编码处理,得到目标词嵌入向量。由于BERT采用了Transformer Encoder block进行连接,是一个典型的双向编码模型。因而,可以通过BERT编码器对增强训练数据进行双向编码处理,即分别对增强训练数据进行从左到右的编码处理和从右到左的编码处理,从而得到目标词嵌入向量(token embedding)。In step S104 in some embodiments, a BERT encoder may be used to encode the enhanced training data to obtain a target word embedding vector. Because BERT uses Transformer Encoder block for connection, it is a typical bidirectional encoding model. Therefore, the enhanced training data can be bidirectionally encoded through the BERT encoder, that is, the enhanced training data can be encoded from left to right and from right to left, respectively, to obtain the target word embedding vector (token embedding).
在一些实施例的步骤S105中,对目标词嵌入向量进行扰动处理时,可以根据预设的扰动因子,沿着梯度方向对目标词嵌入向量(token embedding)添加扰动,该预设的扰动因子可以表示为词嵌入权重矩阵,即沿着梯度方向对目标词嵌入向量和预设的词嵌入权重矩阵进行矩阵相乘,得到目标训练数据。In step S105 of some embodiments, when performing perturbation processing on the target word embedding vector, perturbation can be added to the target word embedding vector (token embedding) along the gradient direction according to a preset perturbation factor. The preset perturbation factor can be Represented as a word embedding weight matrix, that is, the target word embedding vector and the preset word embedding weight matrix are matrix multiplied along the gradient direction to obtain the target training data.
请参阅图4,在一些实施例中,步骤S106可以包括但不限于包括步骤S401至步骤S403:Referring to Figure 4, in some embodiments, step S106 may include, but is not limited to, steps S401 to S403:
步骤S401,通过预设函数对第一原始数据和目标训练数据进行扰动计算,得到文本扰动值;Step S401: Perform perturbation calculation on the first original data and target training data through a preset function to obtain the text perturbation value;
步骤S402,根据文本扰动值对神经网络模型的损失函数进行计算,得到损失值;Step S402: Calculate the loss function of the neural network model based on the text disturbance value to obtain the loss value;
步骤S403,将损失值作为反向传播量,调整神经网络模型的模型参数,以训练神经网络模型,得到文本分类模型。Step S403: Use the loss value as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model.
在一些实施例的步骤S401中,首先将第一原始数据和目标训练数据输入至预设的神经网络模型中,并设置神经网络模型的迭代次数(epoches_num)和数据批大小(batch size),并根据数据批大小将第一原始数据和目标训练数据分为多个批次,得到批次数据。其中,预设函数为交叉熵函数。In step S401 of some embodiments, first input the first original data and target training data into the preset neural network model, and set the number of iterations (epoches_num) and data batch size (batch size) of the neural network model, and Divide the first original data and target training data into multiple batches according to the data batch size to obtain batch data. Among them, the preset function is the cross-entropy function.
具体地,在每个迭代过程中均采用交叉熵获得每个批次数据的损失值loss1,并计算该批次数据的参数梯度,将该批次数据的每个原始参数β i的原始梯度值grad_β i除以原始参数的范数L 2,并乘以一个超参数α,得到文本扰动值,并将文本扰动值添加到原始参数上,从而得到每一批次数据的中间参数β′ i,该计算过程如公式(1)所示: Specifically, in each iteration process, cross entropy is used to obtain the loss value loss1 of each batch of data, and the parameter gradient of the batch of data is calculated, and the original gradient value of each original parameter β i of the batch of data is calculated grad_β i is divided by the norm L 2 of the original parameters and multiplied by a hyperparameter α to obtain the text perturbation value, and the text perturbation value is added to the original parameters to obtain the intermediate parameter β′ i of each batch of data. The calculation process is shown in formula (1):
Figure PCTCN2022090737-appb-000001
Figure PCTCN2022090737-appb-000001
其中,范数
Figure PCTCN2022090737-appb-000002
文本扰动值为
Figure PCTCN2022090737-appb-000003
超参数α取值范围为(0,1],若希望上述目标词嵌入向量添加的扰动更大,则设置超参数α为较大值。经过多次验证,当超参数α取0.1至0.3时,模型的训练效果较好。
Among them, norm
Figure PCTCN2022090737-appb-000002
The text perturbation value is
Figure PCTCN2022090737-appb-000003
The value range of hyperparameter α is (0,1]. If you want the above target word embedding vector to add greater disturbance, set the hyperparameter α to a larger value. After many verifications, when the hyperparameter α is 0.1 to 0.3 , the model training effect is better.
进一步地,计算中间参数β′ i与原始参数β i之间的差距的绝对值r i,并设置一个阈值ε,其中,ε取值范围为(0,1],从而通过参数ε控制扰动是否添加到原始参数上。 Further, calculate the absolute value r i of the difference between the intermediate parameter β′ i and the original parameter β i , and set a threshold ε, where the value range of ε is (0,1], thereby controlling whether the disturbance is added to the original parameters.
例如,如果r i>ε,则将ε*r i/Norm(r i)作为扰动量添加到原始参数β i上,得到最终的目标参数β″ i,计算公式如公式(2)和公式(3)所示: For example, if r i >ε, then ε*r i /Norm(r i ) is added to the original parameter β i as a disturbance amount to obtain the final target parameter β″ i . The calculation formulas are as follows: formula (2) and formula ( 3) as shown:
r i=abs(β′ ii)   公式(2) r i =abs(β′ ii ) Formula (2)
Figure PCTCN2022090737-appb-000004
Figure PCTCN2022090737-appb-000004
需要说明的是,若ε取值越大,则扰动量越难以添加到原始参数的参数矩阵上,经过多次验证,当ε取0.8至1时,模型的训练效果较好。It should be noted that the larger the value of ε, the more difficult it is to add the disturbance amount to the parameter matrix of the original parameters. After many verifications, when ε is 0.8 to 1, the training effect of the model is better.
进一步地,为了提高模型的训练效果,设置参数k来控制扰动次数,对上述的中间参数计算过程和目标参数的计算过程循环k次。由于扰动次数过大会带来过多噪声,影响神经网络模型在各个文本类别上的预测准确率,因而一般将扰动次数设置为2次或者3次,从而得到最终的文本扰动值。Furthermore, in order to improve the training effect of the model, parameter k is set to control the number of disturbances, and the above-mentioned intermediate parameter calculation process and target parameter calculation process are cycled k times. Since excessive number of perturbations will bring too much noise and affect the prediction accuracy of the neural network model on each text category, the number of perturbations is generally set to 2 or 3 times to obtain the final text perturbation value.
在一些实施例的步骤S402中,根据最终的文本扰动值对神经网络模型的损失函数进行计算,得到损失值,具体地可以计算神经网络模型的全连接层对应的损失函数,得到损失值。In step S402 of some embodiments, the loss function of the neural network model is calculated based on the final text perturbation value to obtain the loss value. Specifically, the loss function corresponding to the fully connected layer of the neural network model can be calculated to obtain the loss value.
在一些实施例的步骤S403中,将损失值作为反向传播量,调整神经网络模型的模型参数,以训练神经网络模型,得到文本分类模型,使神经网络模型生成的标签文本数据更为准确,提高神经网络模型对少数类文本数据的识别准确性。In step S403 of some embodiments, the loss value is used as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model, so that the labeled text data generated by the neural network model is more accurate. Improve the recognition accuracy of neural network models for minority text data.
本申请实施例的模型的训练方法,其通过获取原始训练数据,其中,原始训练数据包括第一原始数据和第二原始数据;对第二原始数据进行上采样处理,得到初始训练数据,能够有效地修正第二原始数据中的异常数据,提高数据合理性。进而,根据预设的增强参数对初始训练数据进行增强处理,得到增强训练数据,再对增强训练数据进行编码处理,得到目标词嵌入向量,对目标词嵌入向量进行扰动处理,得到目标训练数据,通过这一方式能够方便地得到符合需求的目标训练数据,使得得到的目标训练数据能够更好地突显出少数类训练数据的特征,提高神经网络模型对少数类训练数据的关注度。最后,根据第一原始数据和目标训练数据对预设的神经网络模型进行训练,能够提高模型对样本文本数据的识别准确性,提高模型的训练效果,得到符合需求的目标分类模型。The model training method of the embodiment of the present application obtains original training data, where the original training data includes first original data and second original data; performs upsampling processing on the second original data to obtain initial training data, which can effectively Correct the abnormal data in the second original data to improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data. In this way, the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data. Finally, training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs.
图5是本申请实施例提供的文本分类方法的一个可选的流程图,图5中的方法可以包括但不限于包括步骤S501至步骤S502。Figure 5 is an optional flow chart of the text classification method provided by the embodiment of the present application. The method in Figure 5 may include, but is not limited to, steps S501 to S502.
步骤S501,获取待分类的目标文本数据;Step S501, obtain the target text data to be classified;
步骤S502,将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据如第一方面实施例的训练方法训练得到。Step S502: Input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to the training method of the embodiment of the first aspect.
在一些实施例的步骤S501中,可以通过编写网络爬虫,设置好数据源之后进行有目标性的爬取数据,得到待分类的目标文本数据。也可以通过其他方式获取样本数据,不限于此。需要说明的是,目标文本数据可以是文章、文本字段、文本词段等等。In step S501 of some embodiments, the target text data to be classified can be obtained by writing a web crawler, setting the data source, and then crawling data in a targeted manner. Sample data can also be obtained through other methods and is not limited to this. It should be noted that the target text data can be articles, text fields, text segments, etc.
在一些实施例的步骤S502中,将目标文本数据输入至目标分类模型中,通过目标分类模型将目标文本数据映射到预设的向量空间,得到目标文本向量,并通过预设的分类函数对目标文本向量进行标签分类处理,得到标签文本数据。In step S502 of some embodiments, the target text data is input into the target classification model, the target text data is mapped to a preset vector space through the target classification model, the target text vector is obtained, and the target text vector is obtained through the preset classification function. The text vector is subjected to label classification processing to obtain label text data.
请参阅图6,在一些实施例中,步骤S502还可以包括但不限于包括步骤S601至步骤S602:Referring to Figure 6, in some embodiments, step S502 may also include, but is not limited to, steps S601 to S602:
步骤S601,通过目标分类模型的全连接层将目标文本数据映射到预设的向量空间,得到目标文本向量;Step S601, map the target text data to a preset vector space through the fully connected layer of the target classification model to obtain the target text vector;
步骤S602,通过全连接层的分类函数和预设文本类别标签对目标文本向量进行标签分类处理,得到标签文本数据。Step S602: Perform label classification processing on the target text vector through the classification function of the fully connected layer and the preset text category label to obtain label text data.
在一些实施例的步骤S601中,获取预设文本类别标签的特征维度,通过全连接层的MLP网络对目标文本数据进行语义空间到向量空间的映射处理,将目标文本数据映射到与预设文本类别标签的特征维度相同的向量空间,得到目标文本向量。In step S601 of some embodiments, the feature dimensions of the preset text category labels are obtained, and the target text data is mapped from semantic space to vector space through the MLP network of the fully connected layer, and the target text data is mapped to the preset text. The feature dimensions of the category labels are the same in the vector space to obtain the target text vector.
在一些实施例的步骤S602中,分类函数可以是softmax函数,例如,通过softmax函数在每一文本类别标签上创建一个概率分布,得到目标文本向量属于每一文本类别的预测概率值。最后,根据分类概率值的大小,对目标文本向量进行文本类别判断及标注处理,得到标签文本数据。In step S602 of some embodiments, the classification function may be a softmax function. For example, a probability distribution is created on each text category label through the softmax function to obtain a predicted probability value that the target text vector belongs to each text category. Finally, according to the size of the classification probability value, the text category judgment and labeling processing are performed on the target text vector to obtain label text data.
需要说明的是,预设的文本类别标签可以根据实际需求进行设置,不同业务场景下的文本类别标签可以不相同。例如,在对书籍分类的应用场景下,预设的文本类别标签包括古典文学、外国文学、散文、小说、诗集等等。而在日常生活场景下,预设的文本类别标签可以包括交通出行、天气情况、时间信息等等。It should be noted that the preset text category labels can be set according to actual needs, and the text category labels in different business scenarios can be different. For example, in the application scenario of classifying books, the preset text category labels include classical literature, foreign literature, prose, novels, poetry collections, etc. In daily life scenarios, preset text category labels can include transportation, weather conditions, time information, etc.
本申请实施例的文本分类方法,其通过获取待分类的目标文本数据,将目标文本数据输入至目标分类模型进行标签分类处理,目标分类模型对少数类文本数据具有较好的识别准确性,通过目标分类模型能够对不同类别的目标文本数据进行识别,并根据不同的类别标签对目标文本数据进行分类处理,得到标签文本数据,提高了文本分类的准确性。The text classification method of the embodiment of the present application obtains the target text data to be classified and inputs the target text data into the target classification model for label classification processing. The target classification model has good recognition accuracy for minority text data. The target classification model can identify target text data of different categories, and classify the target text data according to different category labels to obtain labeled text data, which improves the accuracy of text classification.
请参阅图7,本申请实施例还提供一种模型的训练装置,可以实现上述模型的训练方法,模型的训练装置包括:Please refer to Figure 7. This embodiment of the present application also provides a model training device that can implement the above model training method. The model training device includes:
训练数据获取模块701,用于获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The training data acquisition module 701 is used to acquire original training data, where the original training data includes first original data and second original data;
上采样模块702,用于对所述第二原始数据进行上采样处理,得到初始训练数据;The upsampling module 702 is used to perform upsampling processing on the second original data to obtain initial training data;
数据增强模块703,用于根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;The data enhancement module 703 is used to enhance the initial training data according to preset enhancement parameters to obtain enhanced training data;
编码模块704,用于对所述增强训练数据进行编码处理,得到目标词嵌入向量; Encoding module 704 is used to encode the enhanced training data to obtain the target word embedding vector;
扰动模块705,用于对所述目标词嵌入向量进行扰动处理,得到目标训练数据;The perturbation module 705 is used to perturb the target word embedding vector to obtain target training data;
模型训练模块706,用于根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。The model training module 706 is used to train a preset neural network model according to the first original data and the target training data to obtain a target classification model, where the target classification model is a text classification model, used for Target text data is classified.
在一些实施例中,数据增强模块703包括:In some embodiments, data enhancement module 703 includes:
第一句子长度获取单元,用于获取初始训练数据的第一句子长度;The first sentence length acquisition unit is used to obtain the first sentence length of the initial training data;
第一扰动量计算单元,用于根据第一句子长度和第一扰动比率,计算第一扰动量;A first disturbance amount calculation unit, configured to calculate the first disturbance amount based on the first sentence length and the first disturbance ratio;
数据删减单元,用于根据第一扰动量对初始训练数据进行删减处理,得到增强训练数据。The data deletion unit is used to delete the initial training data according to the first disturbance amount to obtain enhanced training data.
在另一些实施例中,数据增强模块703包括:In other embodiments, the data enhancement module 703 includes:
第二句子长度获取单元,用于获取初始训练数据的第二句子长度;The second sentence length acquisition unit is used to obtain the second sentence length of the initial training data;
第二扰动量计算单元,用于根据第二句子长度和第二扰动比率,计算第二扰动量;a second disturbance amount calculation unit, configured to calculate the second disturbance amount based on the second sentence length and the second disturbance ratio;
数据扩充单元,用于根据第二扰动量和预设的标点符号对初始训练数据进行扩充处理,得到增强训练数据。The data expansion unit is used to expand the initial training data according to the second disturbance amount and preset punctuation marks to obtain enhanced training data.
在一些实施例中,模型训练模块706包括:In some embodiments, model training module 706 includes:
扰动计算单元,用于通过预设函数对第一原始数据和目标训练数据进行扰动计算,得到文本扰动值;A perturbation calculation unit, used to perform perturbation calculation on the first original data and target training data through a preset function to obtain the text perturbation value;
损失值计算单元,用于根据文本扰动值对神经网络模型的损失函数进行计算,得到损失值;The loss value calculation unit is used to calculate the loss function of the neural network model based on the text disturbance value to obtain the loss value;
训练单元,用于将损失值作为反向传播量,调整神经网络模型的模型参数,以训练神经网络模型,得到文本分类模型。The training unit is used to use the loss value as a backpropagation amount to adjust the model parameters of the neural network model to train the neural network model and obtain a text classification model.
本申请实施例的模型的训练装置用于执行上述实施例中的模型的训练方法,其具体处理过程与上述实施例中的模型的训练方法相同,此处不再一一赘述。The model training device in the embodiment of the present application is used to perform the model training method in the above embodiment. The specific processing process is the same as the model training method in the above embodiment, and will not be described again here.
请参阅图8,本申请实施例还提供一种文本分类装置,可以实现上述文本分类方法,文本分类装置包括:Please refer to Figure 8. This embodiment of the present application also provides a text classification device that can implement the above text classification method. The text classification device includes:
文本数据获取模块801,用于获取待分类的目标文本数据;Text data acquisition module 801, used to acquire target text data to be classified;
标签分类模块802,用于将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据第一方面实施例任一项的训练方法训练得到。The label classification module 802 is configured to input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to the training method of any one of the embodiments of the first aspect.
在一些实施例中,标签分类模块802包括:In some embodiments, tag classification module 802 includes:
映射单元,用于通过目标分类模型的全连接层将目标文本数据映射到预设的向量空间,得到目标文本向量;The mapping unit is used to map the target text data to the preset vector space through the fully connected layer of the target classification model to obtain the target text vector;
标签分类单元,用于通过全连接层的分类函数和预设文本类别标签对目标文本向量进行标签分类处理,得到标签文本数据。The label classification unit is used to perform label classification processing on the target text vector through the classification function of the fully connected layer and the preset text category label to obtain label text data.
本申请实施例的文本分类装置用于执行上述实施例中的文本分类方法,其具体处理过程与上述实施例中的文本分类方法相同,此处不再一一赘述。The text classification device in the embodiment of the present application is used to perform the text classification method in the above embodiment. Its specific processing process is the same as the text classification method in the above embodiment, and will not be described again here.
本申请实施例还提供了一种电子设备,电子设备包括:存储器、处理器、存储在存储器上并可在处理器上运行的程序以及用于实现处理器和存储器之间的连接通信的数据总线,程序被处理器执行时实现一种模型的训练方法或者文本分类方法,其中,模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;对所述第二原始数据进行上采样处理,得到初始训练数据;根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;对所述增强训练数据进行编码处理,得到目标词嵌入向量;对所述目标词嵌入向量进行扰动处理,得到目标训练数据;根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;其中,文本分类方法包括:获取待分类的目标文本数据;将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,目标分类模型根据如模型的训练方法训练得到。该电子设备可以为包括平板电脑、车载电脑等任意智能终端。Embodiments of the present application also provide an electronic device. The electronic device includes: a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for realizing connection and communication between the processor and the memory. , when the program is executed by the processor, a model training method or a text classification method is implemented, wherein the model training method includes: obtaining original training data, wherein the original training data includes first original data and second original data; Perform upsampling processing on the second original data to obtain initial training data; perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data; perform encoding processing on the enhanced training data to obtain the target word embedding vector; performing perturbation processing on the target word embedding vector to obtain target training data; training a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein, The target classification model is a text classification model, used to classify target text data; wherein the text classification method includes: obtaining target text data to be classified; inputting the target text data into the target classification model for label classification processing, Obtain labeled text data, in which the target classification model is trained according to the training method of the model. The electronic device can be any smart terminal including a tablet computer, a vehicle-mounted computer, etc.
请参阅图9,图9示意了另一实施例的电子设备的硬件结构,电子设备包括:Please refer to Figure 9, which illustrates the hardware structure of an electronic device according to another embodiment. The electronic device includes:
处理器901,可以采用通用的CPU(CentralProcessingUnit,中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;The processor 901 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement The technical solutions provided by the embodiments of this application;
存储器902,可以采用只读存储器(ReadOnlyMemory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(RandomAccessMemory,RAM)等形式实现。存储器902可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相 关的程序代码保存在存储器902中,并由处理器901来调用执行本申请实施例的模型的训练方法或者文本分类方法;The memory 902 can be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage device, dynamic storage device, or random access memory (RandomAccessMemory, RAM). The memory 902 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the implementation of this application. Example model training methods or text classification methods;
输入/输出接口903,用于实现信息输入及输出;Input/output interface 903, used to implement information input and output;
通信接口904,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信; Communication interface 904 is used to realize communication interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wirelessly (such as mobile network, WIFI, Bluetooth, etc.);
总线905,在设备的各个组件(例如处理器901、存储器902、输入/输出接口903和通信接口904)之间传输信息;Bus 905, which transmits information between various components of the device (such as processor 901, memory 902, input/output interface 903, and communication interface 904);
其中处理器901、存储器902、输入/输出接口903和通信接口904通过总线905实现彼此之间在设备内部的通信连接。The processor 901, the memory 902, the input/output interface 903 and the communication interface 904 implement communication connections between each other within the device through the bus 905.
本申请实施例还提供了一种存储介质,存储介质为计算机可读存储介质,用于计算机可读存储,存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现一种模型的训练方法或者一种文本分类方法,其中,模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;对所述第二原始数据进行上采样处理,得到初始训练数据;根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;对所述增强训练数据进行编码处理,得到目标词嵌入向量;对所述目标词嵌入向量进行扰动处理,得到目标训练数据;根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;其中,文本分类方法包括:获取待分类的目标文本数据;将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,目标分类模型根据如模型的训练方法训练得到。另外,计算机可读存储介质可以是非易失性,也可以是易失性。Embodiments of the present application also provide a storage medium. The storage medium is a computer-readable storage medium for computer-readable storage. The storage medium stores one or more programs, and the one or more programs can be processed by one or more The processor is executed to implement a model training method or a text classification method, wherein the model training method includes: obtaining original training data, wherein the original training data includes first original data and second original data; The second original data is subjected to upsampling processing to obtain initial training data; the initial training data is enhanced according to preset enhancement parameters to obtain enhanced training data; the enhanced training data is encoded to obtain the target word Embedding vectors; performing perturbation processing on the target word embedding vector to obtain target training data; training a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein The target classification model is a text classification model, used to classify target text data; wherein, the text classification method includes: obtaining target text data to be classified; inputting the target text data into the target classification model for label classification processing, and obtaining Labeled text data, in which the target classification model is trained according to the training method of the model. In addition, computer-readable storage media may be non-volatile or volatile.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer executable programs. In addition, the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, and the remote memory may be connected to the processor via a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
本申请实施例提供的模型的训练方法、模型的训练装置、文本分类方法、文本分类装置、电子设备及存储介质,其通过获取原始训练数据,其中,原始训练数据包括第一原始数据和第二原始数据;对第二原始数据进行上采样处理,得到初始训练数据,能够有效地修正第二原始数据中的异常数据,提高数据合理性。进而,根据预设的增强参数对初始训练数据进行增强处理,得到增强训练数据,再对增强训练数据进行编码处理,得到目标词嵌入向量,对目标词嵌入向量进行扰动处理,得到目标训练数据,通过这一方式能够方便地得到符合需求的目标训练数据,使得得到的目标训练数据能够更好地突显出少数类训练数据的特征,提高神经网络模型对少数类训练数据的关注度。最后,根据第一原始数据和目标训练数据对预设的神经网络模型进行训练,能够提高模型对样本文本数据的识别准确性,提高模型的训练效果,得到符合需求的目标分类模型,其中,目标分类模型为文本分类模型,能够用于对目标文本数据进行分类,通过目标分类模型对目标文本数据进行分类,能够提高文本分类的准确性。The model training method, model training device, text classification method, text classification device, electronic device and storage medium provided by the embodiments of the present application obtain original training data, where the original training data includes first original data and second Original data; perform upsampling processing on the second original data to obtain initial training data, which can effectively correct abnormal data in the second original data and improve the rationality of the data. Furthermore, the initial training data is enhanced according to the preset enhancement parameters to obtain enhanced training data, and then the enhanced training data is encoded to obtain the target word embedding vector, and the target word embedding vector is perturbed to obtain the target training data. In this way, the target training data that meets the needs can be easily obtained, so that the obtained target training data can better highlight the characteristics of the minority class training data and improve the neural network model's attention to the minority class training data. Finally, training the preset neural network model based on the first original data and target training data can improve the model's recognition accuracy of sample text data, improve the training effect of the model, and obtain a target classification model that meets the needs, where the target The classification model is a text classification model, which can be used to classify target text data. Classifying target text data through the target classification model can improve the accuracy of text classification.
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments described in the embodiments of the present application are for the purpose of more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application. Those skilled in the art will know that with the evolution of technology and new technologies, As application scenarios arise, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
本领域技术人员可以理解的是,图1-4、图5-6中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in Figures 1-4 and 5-6 do not limit the embodiments of the present application, and may include more or fewer steps than those shown in the figures, or a combination of certain some steps, or different steps.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根 据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of this embodiment.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some steps, systems, and functional modules/units in the devices disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof.
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe specific objects. Sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in this application, "at least one (item)" refers to one or more, and "plurality" refers to two or more. "And/or" is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character "/" generally indicates that the related objects are in an "or" relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ”, where a, b, c can be single or multiple.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the above units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例的方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc. that can store programs. medium.
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall be within the scope of rights of the embodiments of the present application.

Claims (20)

  1. 一种模型的训练方法,其中,所述方法用于训练目标分类模型,所述方法包括:A model training method, wherein the method is used to train a target classification model, the method includes:
    获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  2. 根据权利要求1所述的训练方法,其中,所述增强参数包括第一扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The training method according to claim 1, wherein the enhancement parameters include a first disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第一句子长度;Obtain the first sentence length of the initial training data;
    根据所述第一句子长度和所述第一扰动比率,计算第一扰动量;Calculate a first perturbation amount based on the first sentence length and the first perturbation ratio;
    根据所述第一扰动量对所述初始训练数据进行删减处理,得到所述增强训练数据。The initial training data is deleted according to the first disturbance amount to obtain the enhanced training data.
  3. 根据权利要求1所述的训练方法,其中,所述增强参数包括第二扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The training method according to claim 1, wherein the enhancement parameters include a second disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第二句子长度;Obtain the second sentence length of the initial training data;
    根据所述第二句子长度和所述第二扰动比率,计算第二扰动量;Calculate a second perturbation amount based on the second sentence length and the second perturbation ratio;
    根据所述第二扰动量和预设的标点符号对所述初始训练数据进行扩充处理,得到所述增强训练数据。The initial training data is expanded according to the second disturbance amount and preset punctuation marks to obtain the enhanced training data.
  4. 根据权利要求1至3任一项所述的训练方法,其中,所述根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型的步骤,包括:The training method according to any one of claims 1 to 3, wherein the step of training a preset neural network model according to the first original data and the target training data to obtain a target classification model includes :
    通过预设函数对所述第一原始数据和所述目标训练数据进行扰动计算,得到文本扰动值;Perform perturbation calculation on the first original data and the target training data through a preset function to obtain a text perturbation value;
    根据所述文本扰动值对所述神经网络模型的损失函数进行计算,得到损失值;Calculate the loss function of the neural network model according to the text perturbation value to obtain a loss value;
    将所述损失值作为反向传播量,调整所述神经网络模型的模型参数,以训练所述神经网络模型,得到所述文本分类模型。The loss value is used as a back propagation amount, and the model parameters of the neural network model are adjusted to train the neural network model to obtain the text classification model.
  5. 一种文本分类方法,其中,所述方法包括:A text classification method, wherein the method includes:
    获取待分类的目标文本数据;Obtain the target text data to be classified;
    将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the model training method includes: obtaining original training Data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  6. 根据权利要求5所述的文本分类方法,其中,所述增强参数包括第一扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The text classification method according to claim 5, wherein the enhancement parameters include a first disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第一句子长度;Obtain the first sentence length of the initial training data;
    根据所述第一句子长度和所述第一扰动比率,计算第一扰动量;Calculate a first perturbation amount based on the first sentence length and the first perturbation ratio;
    根据所述第一扰动量对所述初始训练数据进行删减处理,得到所述增强训练数据。The initial training data is deleted according to the first disturbance amount to obtain the enhanced training data.
  7. 根据权利要求5所述的文本分类方法,其中,所述增强参数包括第二扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The text classification method according to claim 5, wherein the enhancement parameters include a second disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第二句子长度;Obtain the second sentence length of the initial training data;
    根据所述第二句子长度和所述第二扰动比率,计算第二扰动量;Calculate a second perturbation amount based on the second sentence length and the second perturbation ratio;
    根据所述第二扰动量和预设的标点符号对所述初始训练数据进行扩充处理,得到所述增强训练数据。The initial training data is expanded according to the second disturbance amount and preset punctuation marks to obtain the enhanced training data.
  8. 根据权利要求5至7任一项所述的文本分类方法,其中,所述根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型的步骤,包括:The text classification method according to any one of claims 5 to 7, wherein the step of training a preset neural network model according to the first original data and the target training data to obtain a target classification model, include:
    通过预设函数对所述第一原始数据和所述目标训练数据进行扰动计算,得到文本扰动值;Perform perturbation calculation on the first original data and the target training data through a preset function to obtain a text perturbation value;
    根据所述文本扰动值对所述神经网络模型的损失函数进行计算,得到损失值;Calculate the loss function of the neural network model according to the text perturbation value to obtain a loss value;
    将所述损失值作为反向传播量,调整所述神经网络模型的模型参数,以训练所述神经网络模型,得到所述文本分类模型。The loss value is used as a back propagation amount, and the model parameters of the neural network model are adjusted to train the neural network model to obtain the text classification model.
  9. 根据权利要求5所述的文本分类方法,其中,所述将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据的步骤,包括:The text classification method according to claim 5, wherein the step of inputting the target text data into a target classification model for label classification processing to obtain label text data includes:
    通过所述目标分类模型的全连接层将所述目标文本数据映射到预设的向量空间,得到目标文本向量;Map the target text data to a preset vector space through the fully connected layer of the target classification model to obtain a target text vector;
    通过所述全连接层的分类函数和预设文本类别标签对所述目标文本向量进行标签分类处理,得到所述标签文本数据。The target text vector is subjected to label classification processing through the classification function of the fully connected layer and the preset text category label to obtain the label text data.
  10. 一种模型的训练装置,其中,所述装置包括:A model training device, wherein the device includes:
    训练数据获取模块,用于获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;A training data acquisition module, configured to acquire original training data, where the original training data includes first original data and second original data;
    上采样模块,用于对所述第二原始数据进行上采样处理,得到初始训练数据;An upsampling module, used to upsample the second original data to obtain initial training data;
    数据增强模块,用于根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;A data enhancement module, configured to enhance the initial training data according to preset enhancement parameters to obtain enhanced training data;
    编码模块,用于对所述增强训练数据进行编码处理,得到目标词嵌入向量;An encoding module, used to encode the enhanced training data to obtain a target word embedding vector;
    扰动模块,用于对所述目标词嵌入向量进行扰动处理,得到目标训练数据;A perturbation module, used to perturb the target word embedding vector to obtain target training data;
    模型训练模块,用于根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A model training module, configured to train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model, used to classify the target Text data is classified.
  11. 一种文本分类装置,其中,所述装置包括:A text classification device, wherein the device includes:
    文本数据获取模块,用于获取待分类的目标文本数据;Text data acquisition module, used to obtain target text data to be classified;
    标签分类模块,用于将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;A label classification module, used to input the target text data into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a model training method, wherein the training of the model The method includes: obtaining original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  12. 一种电子设备,其中,所述电子设备包括存储器、处理器、存储在所述存储器上并可在所述处理器上运行的程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线,所述程序被所述处理器执行时实现一种模型的训练方法或者一种文本分类方法的步骤;An electronic device, wherein the electronic device includes a memory, a processor, a program stored on the memory and executable on the processor, and a connection between the processor and the memory A data bus for communication, when the program is executed by the processor, the steps of implementing a model training method or a text classification method are implemented;
    其中,所述模型的训练方法包括:Wherein, the training method of the model includes:
    获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
    其中,所述文本分类方法包括:Wherein, the text classification method includes:
    获取待分类的目标文本数据;Obtain the target text data to be classified;
    将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法的训练方法训练得到,其中所述模型的训练方法包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  13. 根据权利要求12所述的电子设备,其中,所述增强参数包括第一扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The electronic device according to claim 12, wherein the enhancement parameters include a first disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第一句子长度;Obtain the first sentence length of the initial training data;
    根据所述第一句子长度和所述第一扰动比率,计算第一扰动量;Calculate a first perturbation amount based on the first sentence length and the first perturbation ratio;
    根据所述第一扰动量对所述初始训练数据进行删减处理,得到所述增强训练数据。The initial training data is deleted according to the first disturbance amount to obtain the enhanced training data.
  14. 根据权利要求12所述的电子设备,其中,所述增强参数包括第二扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The electronic device according to claim 12, wherein the enhancement parameters include a second disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第二句子长度;Obtain the second sentence length of the initial training data;
    根据所述第二句子长度和所述第二扰动比率,计算第二扰动量;Calculate a second perturbation amount based on the second sentence length and the second perturbation ratio;
    根据所述第二扰动量和预设的标点符号对所述初始训练数据进行扩充处理,得到所述增强训练数据。The initial training data is expanded according to the second disturbance amount and preset punctuation marks to obtain the enhanced training data.
  15. 根据权利要求12至14任一项所述的电子设备,其中,所述根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型的步骤,包括:The electronic device according to any one of claims 12 to 14, wherein the step of training a preset neural network model according to the first original data and the target training data to obtain a target classification model includes :
    通过预设函数对所述第一原始数据和所述目标训练数据进行扰动计算,得到文本扰动值;Perform perturbation calculation on the first original data and the target training data through a preset function to obtain a text perturbation value;
    根据所述文本扰动值对所述神经网络模型的损失函数进行计算,得到损失值;Calculate the loss function of the neural network model according to the text perturbation value to obtain a loss value;
    将所述损失值作为反向传播量,调整所述神经网络模型的模型参数,以训练所述神经网络模型,得到所述文本分类模型。The loss value is used as a back propagation amount, and the model parameters of the neural network model are adjusted to train the neural network model to obtain the text classification model.
  16. 根据权利要求12所述的电子设备,其中,所述将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据的步骤,包括:The electronic device according to claim 12, wherein the step of inputting the target text data into a target classification model for tag classification processing to obtain the tag text data includes:
    通过所述目标分类模型的全连接层将所述目标文本数据映射到预设的向量空间,得到目标文本向量;Map the target text data to a preset vector space through the fully connected layer of the target classification model to obtain a target text vector;
    通过所述全连接层的分类函数和预设文本类别标签对所述目标文本向量进行标签分类处理,得到所述标签文本数据。The target text vector is subjected to label classification processing through the classification function of the fully connected layer and the preset text category label to obtain the label text data.
  17. 一种存储介质,所述存储介质为计算机可读存储介质,用于计算机可读存储,其中,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现一种模型的训练方法或者一种文本分类方法的步骤:A storage medium, the storage medium is a computer-readable storage medium for computer-readable storage, wherein the storage medium stores one or more programs, and the one or more programs can be used by one or more The processor executes the steps to implement a model training method or a text classification method:
    其中,所述模型的训练方法包括:Wherein, the training method of the model includes:
    获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;Obtain original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类;Train a preset neural network model according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data;
    其中,所述文本分类方法包括:Wherein, the text classification method includes:
    获取待分类的目标文本数据;Obtain the target text data to be classified;
    将所述目标文本数据输入至目标分类模型进行标签分类处理,得到标签文本数据,其中,所述目标分类模型根据一种模型的训练方法的训练方法训练得到,其中所述模型的训练方法 包括:获取原始训练数据,其中,所述原始训练数据包括第一原始数据和第二原始数据;The target text data is input into a target classification model for label classification processing to obtain label text data, wherein the target classification model is trained according to a training method of a model, wherein the training method of the model includes: Obtain original training data, wherein the original training data includes first original data and second original data;
    对所述第二原始数据进行上采样处理,得到初始训练数据;Perform upsampling processing on the second original data to obtain initial training data;
    根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据;Perform enhancement processing on the initial training data according to preset enhancement parameters to obtain enhanced training data;
    对所述增强训练数据进行编码处理,得到目标词嵌入向量;Encoding the enhanced training data to obtain a target word embedding vector;
    对所述目标词嵌入向量进行扰动处理,得到目标训练数据;Perform perturbation processing on the target word embedding vector to obtain target training data;
    根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型,其中,所述目标分类模型为文本分类模型,用于对目标文本数据进行分类。A preset neural network model is trained according to the first original data and the target training data to obtain a target classification model, wherein the target classification model is a text classification model used to classify target text data.
  18. 根据权利要求17所述的存储介质,其中,所述增强参数包括第一扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The storage medium according to claim 17, wherein the enhancement parameters include a first disturbance ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第一句子长度;Obtain the first sentence length of the initial training data;
    根据所述第一句子长度和所述第一扰动比率,计算第一扰动量;Calculate a first perturbation amount based on the first sentence length and the first perturbation ratio;
    根据所述第一扰动量对所述初始训练数据进行删减处理,得到所述增强训练数据。The initial training data is deleted according to the first disturbance amount to obtain the enhanced training data.
  19. 根据权利要求17所述的存储介质,其中,所述增强参数包括第二扰动比率,所述根据预设的增强参数对所述初始训练数据进行增强处理,得到增强训练数据的步骤,包括:The storage medium according to claim 17, wherein the enhancement parameters include a second perturbation ratio, and the step of performing enhancement processing on the initial training data according to the preset enhancement parameters to obtain enhanced training data includes:
    获取所述初始训练数据的第二句子长度;Obtain the second sentence length of the initial training data;
    根据所述第二句子长度和所述第二扰动比率,计算第二扰动量;Calculate a second perturbation amount based on the second sentence length and the second perturbation ratio;
    根据所述第二扰动量和预设的标点符号对所述初始训练数据进行扩充处理,得到所述增强训练数据。The initial training data is expanded according to the second disturbance amount and preset punctuation marks to obtain the enhanced training data.
  20. 根据权利要求17至19任一项所述的存储介质,其中,所述根据所述第一原始数据和所述目标训练数据对预设的神经网络模型进行训练,得到目标分类模型的步骤,包括:The storage medium according to any one of claims 17 to 19, wherein the step of training a preset neural network model according to the first original data and the target training data to obtain a target classification model includes :
    通过预设函数对所述第一原始数据和所述目标训练数据进行扰动计算,得到文本扰动值;Perform perturbation calculation on the first original data and the target training data through a preset function to obtain a text perturbation value;
    根据所述文本扰动值对所述神经网络模型的损失函数进行计算,得到损失值;Calculate the loss function of the neural network model according to the text perturbation value to obtain a loss value;
    将所述损失值作为反向传播量,调整所述神经网络模型的模型参数,以训练所述神经网络模型,得到所述文本分类模型。The loss value is used as a back propagation amount, and the model parameters of the neural network model are adjusted to train the neural network model to obtain the text classification model.
PCT/CN2022/090737 2022-03-15 2022-04-29 Model training method and apparatus, text classification method and apparatus, device, and medium WO2023173555A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210253301.X 2022-03-15
CN202210253301.XA CN114637847A (en) 2022-03-15 2022-03-15 Model training method, text classification method and device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2023173555A1 true WO2023173555A1 (en) 2023-09-21

Family

ID=81947559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090737 WO2023173555A1 (en) 2022-03-15 2022-04-29 Model training method and apparatus, text classification method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN114637847A (en)
WO (1) WO2023173555A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688868B (en) * 2022-12-30 2023-10-20 荣耀终端有限公司 Model training method and computing equipment
CN117171625B (en) * 2023-10-23 2024-02-06 云和恩墨(北京)信息技术有限公司 Intelligent classification method and device for working conditions, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339292A (en) * 2018-12-18 2020-06-26 北京京东尚科信息技术有限公司 Training method, system, equipment and storage medium of text classification network
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
CN112508243A (en) * 2020-11-25 2021-03-16 国网浙江省电力有限公司信息通信分公司 Training method and device for multi-fault prediction network model of power information system
US20210117718A1 (en) * 2019-10-21 2021-04-22 Adobe Inc. Entropy Based Synthetic Data Generation For Augmenting Classification System Training Data
CN113869398A (en) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 Unbalanced text classification method, device, equipment and storage medium
CN114022737A (en) * 2021-11-16 2022-02-08 胜斗士(上海)科技技术发展有限公司 Method and apparatus for updating training data set

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339292A (en) * 2018-12-18 2020-06-26 北京京东尚科信息技术有限公司 Training method, system, equipment and storage medium of text classification network
US20210117718A1 (en) * 2019-10-21 2021-04-22 Adobe Inc. Entropy Based Synthetic Data Generation For Augmenting Classification System Training Data
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN112131384A (en) * 2020-08-27 2020-12-25 科航(苏州)信息科技有限公司 News classification method and computer-readable storage medium
CN112508243A (en) * 2020-11-25 2021-03-16 国网浙江省电力有限公司信息通信分公司 Training method and device for multi-fault prediction network model of power information system
CN113869398A (en) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 Unbalanced text classification method, device, equipment and storage medium
CN114022737A (en) * 2021-11-16 2022-02-08 胜斗士(上海)科技技术发展有限公司 Method and apparatus for updating training data set

Also Published As

Publication number Publication date
CN114637847A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
WO2022037256A1 (en) Text sentence processing method and device, computer device and storage medium
WO2022022163A1 (en) Text classification model training method, device, apparatus, and storage medium
CN109359297B (en) Relationship extraction method and system
WO2023173555A1 (en) Model training method and apparatus, text classification method and apparatus, device, and medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
WO2023108991A1 (en) Model training method and apparatus, knowledge classification method and apparatus, and device and medium
CN111475622A (en) Text classification method, device, terminal and storage medium
WO2023159767A1 (en) Target word detection method and apparatus, electronic device and storage medium
CN114897060B (en) Training method and device for sample classification model, and sample classification method and device
CN113849661A (en) Entity embedded data extraction method and device, electronic equipment and storage medium
EP4165554A1 (en) Semantic representation of text in document
CN112101031A (en) Entity identification method, terminal equipment and storage medium
CN113705315A (en) Video processing method, device, equipment and storage medium
CN114416995A (en) Information recommendation method, device and equipment
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN116258137A (en) Text error correction method, device, equipment and storage medium
Wang et al. A text classification method based on LSTM and graph attention network
CN114048314A (en) Natural language steganalysis method
CN116975292A (en) Information identification method, apparatus, electronic device, storage medium, and program product
CN116775875A (en) Question corpus construction method and device, question answering method and device and storage medium
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium
CN115204300A (en) Data processing method, device and storage medium for text and table semantic interaction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931582

Country of ref document: EP

Kind code of ref document: A1