WO2022095682A1 - 文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品 - Google Patents

文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022095682A1
WO2022095682A1 PCT/CN2021/124335 CN2021124335W WO2022095682A1 WO 2022095682 A1 WO2022095682 A1 WO 2022095682A1 CN 2021124335 W CN2021124335 W CN 2021124335W WO 2022095682 A1 WO2022095682 A1 WO 2022095682A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
classification model
samples
language
text classification
Prior art date
Application number
PCT/CN2021/124335
Other languages
English (en)
French (fr)
Inventor
缪畅宇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023514478A priority Critical patent/JP2023539532A/ja
Publication of WO2022095682A1 publication Critical patent/WO2022095682A1/zh
Priority to US17/959,402 priority patent/US20230025317A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present application are based on the Chinese patent application with the application number of 202011217057.9 and the filing date of November 4, 2020, and claim the priority of the Chinese patent application.
  • the entire content of the Chinese patent application is incorporated into the embodiments of the present application as refer to.
  • the present application relates to artificial intelligence technology, and in particular, to a training method for a text classification model, a text classification method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
  • Artificial intelligence is a comprehensive technology of computer science. By studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive subject covering a wide range of fields, such as natural language processing technology and machine learning/deep learning. With the development of technology, artificial intelligence technology will be applied in more fields, and play a more increasingly important value.
  • the text classification model is one of the important applications in the field of artificial intelligence.
  • the text classification model can identify the category to which the text belongs.
  • Text classification models are widely used in news recommendation, intent recognition systems, etc., that is, text classification models are the basic components of these complex systems.
  • the text classification model in the related art is aimed at a certain language.
  • the text classification model will face the pressure of lack of labeled samples in other languages, and it will not be able to perform other languages smoothly.
  • Language text classification tasks are aimed at a certain language.
  • Embodiments of the present application provide a text classification model training method, text classification method, apparatus, electronic device, computer-readable storage medium, and computer program product, which can automatically obtain cross-language text samples and improve the accuracy of text classification.
  • the embodiment of the present application provides a training method for a text classification model, including:
  • the network depth of the second text classification model is greater than the network depth of the first text classification model.
  • the embodiment of the present application provides a text classification method, including:
  • the text to be classified adopts a second language different from the first language
  • the second text classification model is obtained by training the text samples of the second language screened by the first text classification model, and the text samples of the second language are obtained by analyzing the text samples of the first language. obtained by machine translation.
  • the embodiment of the present application provides a training device for a text classification model, including:
  • a translation module configured to perform machine translation processing on a plurality of first text samples in a first language to obtain a plurality of second text samples corresponding to the plurality of first text samples one-to-one;
  • a first training module configured to train a first text classification model for the second language based on a plurality of third text samples in the second language and their corresponding category labels;
  • a screening module configured to perform a confidence-based screening process on the plurality of second text samples through the trained first text classification model
  • the second training module is configured to train a second text classification model for the second language based on the second text samples obtained by the screening process; wherein, the network depth of the second text classification model is greater than that of the second text classification model. Network depth of a text classification model.
  • An embodiment of the present application provides a text classification device, including:
  • an obtaining module configured to obtain the text to be classified; wherein, the text to be classified adopts a second language different from the first language;
  • a processing module configured to perform encoding processing on the text to be classified by using a second text classification model whose network depth is greater than that of the first text classification model, to obtain an encoding vector of the text to be classified; Non-linear mapping to obtain the category corresponding to the text to be classified; wherein, the second text classification model is obtained by training the text samples of the second language screened by the first text classification model, and the second language The text samples are obtained by machine-translating the text samples in the first language.
  • An embodiment of the present application provides an electronic device for training a text classification model, the electronic device comprising:
  • the processor is configured to, when executing the executable instructions stored in the memory, implement the text classification model training method or the text classification method provided by the embodiments of the present application.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the training method for a text classification model or a text classification method provided by the embodiments of the present application.
  • Embodiments of the present application provide a computer program product, including computer programs or instructions, which, when executed by a processor, implement the text classification model training method or text classification method provided by the embodiments of the present application.
  • second text samples in a second language different from the first language through machine translation, and filter the second text samples through the first text classification model, so as to achieve automatic acquisition of cross-language text samples, reducing the need for lack of text samples.
  • the second text classification model is trained through the high-quality text samples obtained by screening, so that the second text classification model can perform accurate text classification and improve the accuracy of text classification.
  • FIG. 1 is a schematic diagram of an application scenario of a text classification system provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an electronic device for text classification model training provided by an embodiment of the present application
  • 3-5 are schematic flowcharts of a training method based on a text classification model provided by an embodiment of the present application
  • FIG. 6 is a schematic flowchart of an iterative training provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a hierarchical softmax provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a cascaded encoder provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a text set A and a text set B provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a text set B1 provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of active learning provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of reinforcement learning provided by an embodiment of the present application.
  • first ⁇ second involved is only to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that “first ⁇ second” can be used when permitted.
  • the specific order or sequence is interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
  • Convolutional Neural Networks A class of Feedforward Neural Networks (FNN, Feedforward Neural Networks) that includes convolution calculations and has a deep structure, is one of the representative algorithms of deep learning.
  • Convolutional neural networks have representation learning capabilities and can perform shift-invariant classification of input images according to their hierarchical structure.
  • Cross-language few shot text classification When migrating from A language scene to B language scene, and there is a small budget for B language sample annotation, only a small amount of B language annotation text and a large amount of A language annotation text are needed. , the large-scale annotation of the B language text can be realized, and the text classification model can be trained through the large-scale annotation of the B language text to realize the B language text classification.
  • Cross-language zero shot text classification When migrating from A language scene to B language scene, and lack of budget (no labor or tight product promotion time), it is impossible to label language B samples, that is: only with the help of A large number of labeled texts in language A are used to realize large-scale labeling of language B, and a text classification model is trained through the large-scale labeling of language B texts to realize text classification in language B.
  • Text classification is widely used in content-related products, such as news classification, article classification, intent classification, information flow products, forums, communities, e-commerce, etc.
  • text classification is for texts in a certain language, such as Chinese, English, etc., but when the product needs to expand its business in other languages, it will encounter the problem of insufficient labeled text in the early stage of the product.
  • the product is promoted from the Chinese market to the English market, it is necessary to quickly label the news in the English field; when performing positive and negative sentiment analysis on the comments of Chinese users, as the number of users increases, or when the product is launched to overseas markets , there will be many non-Chinese comments, so these comments also need to be marked with the corresponding emotional polarity.
  • the embodiments of the present application provide a text classification model training method, text classification method, device, electronic device, computer-readable storage medium and computer program product, which can automatically obtain cross-language text samples, improve text Classification accuracy.
  • the training method and text classification method of the text classification model provided by the embodiments of the present application can be implemented by the terminal/server alone; or can be implemented by the terminal and the server collaboratively, for example, the terminal alone undertakes the training method of the text classification model described below, Or, the terminal sends a text classification request for a certain language to the server, and the server executes the training method of the text classification model according to the received text classification of the certain language, and performs the text classification task of the language based on the trained text classification model.
  • the electronic device used for text classification model training may be various types of terminal devices or servers, where the server may be an independent physical server, or a server cluster or distributed server composed of multiple physical servers.
  • the system can also be a cloud server that provides cloud computing services; the terminal can be a smartphone, tablet computer, notebook computer, desktop computer, smart speaker, smart watch, vehicle-mounted device, etc., but is not limited to this.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • a server can be a server cluster deployed in the cloud to open artificial intelligence cloud services (AiaaS, AI as a Service) to users.
  • AIaaS artificial intelligence cloud services
  • the AIaaS platform will split several types of common AI services and provide independent services in the cloud. Or packaged services. This service model is similar to an AI-themed mall. All users can access one or more artificial intelligence services provided by the AIaaS platform through application programming interfaces.
  • one of the artificial intelligence cloud services may be a text classification model training service, that is, a server in the cloud encapsulates the text classification model training program provided by the embodiment of the present application.
  • the user invokes the text classification model training service in the cloud service through a terminal (running a client, such as a news client, a reading client, etc.), so that the server deployed in the cloud invokes the encapsulated text classification model training program, based on the first
  • the first text sample of the language is obtained by using a machine translation model to obtain a second text sample in a second language different from the first language, and the second text sample is screened by the first text classification model.
  • the text samples are used to train the second text classification model, and the second text samples are used for text classification for subsequent news applications, reading applications, etc.
  • the text is English news
  • the second text classification model News classification for English
  • the category of each news to be recommended such as entertainment news, sports news, etc.
  • the category of each article to be recommended is determined by the trained second text classification model (for Chinese article classification), such as mind Chicken soup, legal articles, educational articles, etc., so that each article to be recommended is screened based on the category of the article to obtain articles for recommendation, and the articles for recommendation are displayed to users to achieve targeted article recommendation.
  • FIG. 1 is a schematic diagram of an application scenario of the text classification system 10 provided by the embodiment of the present application.
  • the terminal 200 is connected to the server 100 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.
  • the terminal 200 (running a client, such as a news client) can be used to obtain text to be classified in a certain language. For example, a developer inputs text to be classified in a certain language through the terminal, and the terminal automatically obtains a text classification request for a certain language.
  • a text classification model training plug-in may be embedded in the client running in the terminal, so as to implement the training method of the text classification model locally on the client. For example, after acquiring the text to be classified in a second language different from the first language, the terminal 200 invokes the text classification model training plug-in to implement the training method of the text classification model, and obtains the text sample (using the The second text sample (using the second language) corresponding to the first language), and the second text sample is screened by the first text classification model, and the second text sample is trained by the screened second text sample. Two text samples are used for text classification for subsequent news applications, reading applications, etc.
  • the terminal 200 after the terminal 200 requests a text classification in a certain language, it calls the text classification model training interface of the server 100 (which can be provided in the form of a cloud service, that is, a text classification model training service), and the server 100 obtains the text classification model through the machine translation model.
  • the second text sample (in the second language) corresponding to the first text sample (in the first language) is used, the second text sample is screened by the first text classification model, and the second text sample obtained by the screening is used to train the first text sample.
  • the second text classification model performs text classification based on the trained second text samples for subsequent news applications, reading applications, and the like.
  • FIG. 2 is a schematic structural diagram of the electronic device 500 for text classification model training provided by the embodiment of the present application.
  • 500 is a server for illustration.
  • the electronic device 500 for text classification model training shown in FIG. 2 includes: at least one processor 510 , memory 550 , at least one network interface 520 and user interface 530 .
  • the various components in electronic device 500 are coupled together by bus system 540 .
  • bus system 540 is used to implement the connection communication between these components.
  • the bus system 540 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 540 in FIG. 2 .
  • the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • the memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.
  • Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .
  • memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • the apparatus for training the text classification model provided by the embodiments of the present application may be implemented in software, for example, the text classification model training plug-in in the terminal described above, or the server described above. Chinese text classification model training service.
  • the apparatus for training the text classification model provided by the embodiments of the present application may be provided as various software embodiments, including various forms including application programs, software, software modules, scripts, or codes.
  • FIG. 2 shows a training device 555 for a text classification model stored in the memory 550, which can be software in the form of programs and plug-ins, such as a text classification model training plug-in, and includes a series of modules, including a translation module 5551, a first A training module 5552, a screening module 5553, and a second training module 5554; wherein the translation module 5551, the first training module 5552, the screening module 5553, and the second training module 5554 are used to implement the training of the text classification model provided by the embodiment of the present application Features.
  • FIG. 3 is a schematic flowchart of a training method based on a text classification model provided by an embodiment of the present application, which is described in conjunction with the steps shown in FIG. 3 .
  • the network depth of the second text classification model is greater than the network depth of the first text classification model, that is, the text classification capability of the second text classification model is stronger than the classification capability of the first text classification model. Therefore, for training The number of text samples required for the second text classification model is greater than the number of text samples required for training the first text classification model.
  • the first text sample is in a first language
  • the second text sample and the third text sample are in a second language different from the first language
  • the first text sample is a Chinese sample
  • the second text sample and the third text sample are in a second language different from the first language.
  • Three text samples are English samples.
  • step 101 a machine translation process is performed on a plurality of first text samples in a first language to obtain a plurality of second text samples corresponding to the plurality of first text samples one-to-one.
  • the terminal automatically obtains the text classification request for the second language, and sends the text classification request in the second language to the server, and the server receives the text classification in the second language.
  • the server receives the text classification in the second language.
  • a first text classification model for the second language is trained based on the plurality of third text samples in the second language and their corresponding category labels.
  • step 101 There is no obvious sequence between step 101 and step 102.
  • the server After the server receives the text classification request in the second language, it obtains a small amount of labeled third text samples from the sample library, and trains the first text classification model through multiple third text samples and corresponding category labels, so that after training The first text classification model of the can perform text classification based on the second language.
  • training a first text classification model for the second language based on the plurality of third text samples in the second language and the respective corresponding class labels includes: based on the plurality of third text samples in the second language and the corresponding category labels, the first text classification model is trained for the t-th time; the first text classification model trained for the t-th time performs the confidence-based t-th screening process on multiple second text samples; based on the previous The t-time screening results, multiple third text samples, and corresponding category labels are used to perform the t+1-th training on the first text classification model; the first text classification model trained for the T-th time is used as the first text after training.
  • Classification model wherein, t is a positive integer that increases sequentially, and the value range satisfies 1 ⁇ t ⁇ T-1, and T is an integer greater than 2, and is used to represent the total number of iterative training.
  • the first text classification model is iteratively trained, so as to filter out more high-quality third text samples through the gradually optimized first text classification model , for subsequent augmentation training to train the second text classification model.
  • the first training of the first text classification model is performed, and the first text classification model trained for the first time
  • the second text sample is subjected to the first screening process based on confidence.
  • the first text classification model is trained for the second time.
  • the first text classification model trained for the second time performs the second screening process based on confidence on the second text samples except the first screening results among the plurality of second text samples.
  • the text samples and the corresponding category labels are used to train the first text classification model for the third time, and the above training process is iterated until the T-th training is performed on the first text classification model, and the first text classification model trained for the T-th time is used. as the first text classification model after training.
  • FIG. 4 is an optional schematic flowchart of a training method for a text classification model provided by an embodiment of the present application.
  • FIG. 4 shows that step 102 in FIG. 3 can be implemented by steps 1021 to 1023 shown in FIG. 4 : in step 1021, perform prediction processing on a plurality of third text samples in the second language by the first text classification model to obtain the confidence levels of the prediction categories corresponding to the plurality of third text samples respectively; in step 1022, based on the prediction The confidence of the category and the category label of the third text sample are used to construct the loss function of the first text classification model; in step 1023, the parameters of the first text classification model are updated until the loss function converges, and the first text classification when the loss function converges The updated parameters of the model are used as the parameters of the first text classification model after training.
  • the error signal of the first text classification model is determined based on the loss function of the first text classification model, the error information is back-propagated in the first text classification model, and the During the propagation process, the model parameters of each layer are updated.
  • the training sample data is input into the input layer of the neural network model, passes through the hidden layer, and finally reaches the output layer and outputs the result.
  • This is the forward propagation process of the neural network model. If there is an error between the output result and the actual result, calculate the error between the output result and the actual value, and propagate the error back from the output layer to the hidden layer until it propagates to the input layer.
  • the process of back propagation according to the error Adjust the values of the model parameters; iterate the above process until convergence.
  • the first text classification model belongs to a neural network model.
  • performing prediction processing on a plurality of third text samples in the second language by a first text classification model to obtain confidences of predicted categories corresponding to the plurality of third text samples respectively including: for a plurality of third text samples Any third text sample in the text samples performs the following processing: the first text classification model performs the following processing: encoding the third text sample to obtain an encoding vector of the third text sample; encoding the third text sample's encoding vector Perform fusion processing to obtain a fusion vector; perform nonlinear mapping processing on the fusion vector to obtain the confidence level of the prediction category corresponding to the third text sample.
  • the first text classification model is a fast text classification model (fasttext).
  • the first text classification model in this embodiment of the present application is not limited to fasttext.
  • Fasttext includes an input layer, a hidden layer, and an output layer.
  • the samples can quickly train fasttext to enable fasttext to quickly perform text classification tasks in second languages.
  • the third text sample is encoded through the input layer to obtain the encoding vector of the third text sample; then the encoding vector of the third text sample is fused through the hidden layer to obtain the fusion vector; finally, the fusion vector is processed through the output layer.
  • Non-linear mapping that is, performing mapping processing through an activation function (eg, softmax)
  • an activation function eg, softmax
  • the first text classification model includes multiple cascaded activation layers; performing nonlinear mapping processing on the fusion vector to obtain the confidence level of the predicted category corresponding to the third text sample, including: In the first activation layer of the activation layer, the fusion vector is mapped to the first activation layer; the mapping result of the first activation layer is output to the subsequent cascaded activation layers, and the mapping is continued through the subsequent cascaded activation layers The processing and mapping results are output until output to the last activation layer; the activation result output by the last activation layer is used as the confidence level of the predicted category corresponding to the third text sample.
  • the activation operation is performed through the hierarchical softmax, which can avoid the one-time activation operation to obtain the confidence of the predicted category, but through the multi-layer activation operation, thereby reducing the computational complexity.
  • the hierarchical softmax includes T-layer activation layers, and each activation layer performs a hierarchical softmax operation.
  • the fusion vector is mapped to the first activation layer through the first activation layer, and the first mapping result is obtained.
  • the first mapping result is mapped to the second activation layer through the second activation layer, and the second mapping result is obtained until it is output to the T-th activation layer.
  • the activation results output by the T activation layers are used as the confidence of the predicted category corresponding to the third text sample.
  • T is the total number of active layers.
  • performing encoding processing on the third text sample to obtain an encoding vector of the third text sample includes: performing window sliding processing on the third text sample to obtain multiple segment sequences; wherein the size of the window is N, N is a natural number; perform mapping processing on multiple fragment sequences based on the vocabulary library to obtain sequence vectors corresponding to multiple fragment sequences; combine sequence vectors corresponding to multiple fragment sequences to obtain the encoding vector of the third text sample .
  • window sliding processing is performed on the third text sample to obtain multiple segment sequences, including: performing the following processing on the i-th word in the third text sample: obtaining the third text sample The i-th word in the i+N-1 word; the i-th word to the i+N-1 word is combined, and the combined result is used as a fragment sequence; wherein, 0 ⁇ i ⁇ M-N+ 1, M is the number of words in the third text sample, M is a natural number, so as to generate better encoding vectors for rare words, in the vocabulary table, even if the word does not appear in the training corpus, it can still be constructed from a window of word granularity
  • the encoding vector corresponding to the word granularity can also allow the first text classification model to learn part of the information of the local word order, so that the first text classification model can keep the word order information during training.
  • window sliding processing is performed on the third text sample to obtain multiple segment sequences, including: performing the following processing on the jth word in the third text sample: obtaining the third text sample
  • the jth word to the j+N-1th word in the 1 K is the number of words in the third text sample, and K is a natural number.
  • step 103 a confidence-based screening process is performed on the plurality of second text samples by using the trained first text classification model.
  • the trained first text classification model can perform confidence-based screening processing on multiple second text samples, so as to filter out high-quality text samples.
  • the second text samples are used to train the second text classification model through the high-quality second text samples.
  • performing confidence-based screening processing on a plurality of second text samples by using the trained first text classification model includes: performing the following processing on any second text sample in the plurality of second text samples : perform prediction processing on the second text sample through the trained first text classification model to obtain the confidence levels of multiple predicted categories corresponding to the second text sample; determine the category label of the first text sample corresponding to the second text sample as Category label of the second text sample; based on the confidence of multiple predicted categories corresponding to the second text sample and the category label of the second text sample, the second text sample that exceeds the confidence threshold is used as the second text sample obtained by the screening process .
  • the second text sample is encoded by the trained first text classification model to obtain the encoding vector of the second text sample, the encoding vector of the second text sample is fused to obtain the fusion vector, and the fusion vector is not Linear mapping processing to obtain the confidence levels of multiple predicted categories corresponding to the second text sample, and from the multiple predicted categories corresponding to the second text sample, determine the predicted category that matches the category label of the second text sample, when the matching predicted category When the confidence of the category exceeds the confidence threshold, the second text sample is used as the second text sample obtained by the screening process.
  • step 104 a second text classification model for the second language is trained based on the second text samples obtained by the screening process.
  • the server selects a large number of high-quality second text samples through the trained first text classification model, it realizes the automatic construction of cross-language text samples (that is, the second text samples in the second language with corresponding Category labeling of a text sample, i.e. no need for manual labeling), the second text classification model is trained through a large number of high-quality second text samples, so that the trained second text classification model can accurately perform text based on the second language. Classification to improve the accuracy of text classification in a second language.
  • the embodiment of the present application can only filter the second text samples obtained through processing to classify the second text Just do the training.
  • the text classification is performed on the text to be classified, that is, the text to be classified is encoded by the trained second text classification model.
  • the encoding vector of the text to be classified is obtained, and the encoding vector of the text to be classified is non-linearly mapped to obtain the category corresponding to the text to be classified, and subsequent news applications, reading applications, etc. can be performed through the category corresponding to the text to be classified.
  • FIG. 5 is an optional schematic flowchart of a training method for a text classification model provided by an embodiment of the present application.
  • FIG. 5 shows that step 104 in FIG. 3 can be implemented through steps 1041 to 1043 shown in FIG. 5 .
  • step 1041 the distribution of the second text samples obtained by the screening process in multiple categories is determined; in step 1042, when the distribution of the second text samples obtained by the screening process in the multiple categories satisfies the distribution equilibrium condition, and in each When the number of categories exceeds the corresponding threshold of the number of categories, randomly select text samples corresponding to the threshold of the number of categories from the text samples of each category in the second text samples obtained by the screening process to construct a training set; in step 1043, A second text classification model for the second language is trained based on the training set.
  • the server obtains a large number of second text samples for training the second text classification model, analyze the distribution of the second text samples obtained by the screening process in multiple categories to determine whether the distribution equilibrium condition is satisfied, that is, different categories of Quantitative jitter, such as using the mean square error to measure the jitter of the number of different categories, the larger the jitter, the more uneven the distribution of text samples in multiple categories.
  • the distribution of the second text samples obtained by the screening process in multiple categories satisfies the distribution equilibrium condition, and the number of each category exceeds the threshold of the number of categories
  • the text samples of each category in the second text samples obtained by the screening process are selected from the text samples of each category. , extract text samples corresponding to the threshold of the number of categories to construct a training set, thereby improving the accuracy of text classification.
  • training a second text classification model for the second language based on the second text samples obtained by the screening process includes: when the distribution of the second text samples obtained by the screening process in multiple categories does not satisfy a balanced distribution Condition, the expansion processing based on synonyms is performed for the second text samples of the categories with less distribution, so that the distribution of the second text samples obtained by the expansion processing in multiple categories satisfies the distribution equilibrium condition; based on the second text samples obtained by the expansion processing. training set; training a second text classification model for the second language based on the training set.
  • the expansion processing based on synonyms is performed on the second text samples of the corresponding category, so that the second text samples obtained by the expansion processing are in the The number of each category exceeds the corresponding threshold of the number of categories; a training set is constructed based on the second text samples obtained by the augmentation process.
  • the specific expansion process is as follows: perform the following processing on any text sample in the plurality of third text samples and the second text sample obtained by the screening process: a dictionary of synonyms (including the correspondence between various synonyms) Match the words in the text sample to obtain the matching words corresponding to the words in the text sample; replace the words in the text sample based on the matching words to obtain a new text sample; use the category label corresponding to the text sample as the new text sample.
  • the class labels of the text samples By replacing the synonyms, the text samples of the second language can be greatly expanded, so as to realize the training of the second text classification model.
  • training a second text classification model for the second language based on the second text samples obtained by the screening process includes: constructing a training set based on the plurality of third text samples and the second text samples obtained by the screening process , trains a second text classification model for the second language based on the training set.
  • constructing a training set based on the plurality of third text samples and the second text samples obtained by the screening process includes: traversing each category of the second text samples obtained by the screening process, and performing the following processing: when the second text samples in the category When the number of categories is lower than the category number threshold of the category, the third text samples of the category are randomly selected from the plurality of third text samples to supplement the second text samples of the category to update the second text samples obtained by the screening process; based on The second text sample obtained by the updated screening process is used to construct a training set.
  • a third text sample can be used to supplement.
  • the number of second text samples in a category is lower than the threshold of the category number of categories, it means that there are relatively few text samples in this category, and the third text samples of this category can be randomly selected from multiple third text samples to supplement into the second text sample of the category, so as to update the second text sample obtained by the screening process, so that the text sample of the category in the second text sample is more sufficient.
  • the computing power of the second text classification model can be used to match the corresponding number of text samples for appropriate training.
  • the computing power (computing power) of the text classification model Based on the second text samples obtained by the screening process, before training the second text classification model for the second language, according to the correspondence between the computing power (computing power) of the text classification model and the number of text samples that can be calculated per unit time to determine the number of target samples that match the computing power that can be used to train the second text classification model; from the training set constructed based on the second text samples obtained by the screening process, screen out the text samples corresponding to the target number of samples as training Sample of the second text classification model for the second language.
  • training a second text classification model for the second language based on the second text samples obtained by the screening process includes: performing prediction processing on the second text samples obtained by the screening process by using the second text classification model, Obtain the predicted category corresponding to the second text sample obtained by the screening process; build the loss function of the second text classification model based on the predicted category corresponding to the second text sample obtained by the screening process and the corresponding category label; update the second text classification model.
  • the parameters are used until the loss function converges, and the updated parameters of the second text classification model when the loss function converges are used as the parameters of the trained second text classification model.
  • the value of the loss function of the second text classification model after determining the value of the loss function of the second text classification model based on the predicted category corresponding to the second text sample obtained by the screening process and the corresponding category label, it can be determined whether the value of the loss function of the second text classification model exceeds a preset value Threshold, when the value of the loss function of the second text classification model exceeds the preset threshold, determine the error signal of the second text classification model based on the loss function of the second text classification model, and reverse the error information in the second text classification model Propagation, and update the model parameters of each layer in the process of propagation.
  • Threshold when the value of the loss function of the second text classification model exceeds the preset threshold, determine the error signal of the second text classification model based on the loss function of the second text classification model, and reverse the error information in the second text classification model Propagation, and update the model parameters of each layer in the process of propagation.
  • the second text classification model includes a plurality of cascaded encoders; the second text classification model performs prediction processing on the second text sample obtained by the screening process, and obtains the corresponding value of the second text sample obtained by the screening process. Predicting the category, including: performing the following processing on any text sample in the second text samples obtained by the screening process: using the first encoder of the plurality of cascaded encoders to encode the text sample by the first encoder Processing; output the encoding result of the first encoder to the subsequent cascaded encoders, and continue to perform encoding processing and output encoding results through the subsequent cascaded encoders until output to the last encoder; output the last encoder
  • the encoding result of the text sample is used as the encoding vector of the corresponding text sample; the encoding vector of the text sample is nonlinearly mapped to obtain the predicted category corresponding to the text sample.
  • the feature information of rich text samples can be extracted by performing encoding operations with cascaded encoders. For example, use the first encoder to perform the encoding processing of the first encoder on the text sample to obtain the first encoding result, output the first encoding result to the second encoder, and use the second encoder to encode the first encoding result.
  • One encoding result is encoded by the second encoder, and the second encoding result is obtained, until it is output to the S-th encoder, and finally the encoding vector of the text sample is nonlinearly mapped to obtain the prediction category corresponding to the text sample. .
  • S is the total number of encoders.
  • y is a positive integer that increases in sequence, and the value range satisfies 2 ⁇ y ⁇ H-1, and H is Integer greater than 2 and used to represent the number of multiple cascaded encoders.
  • the second language text classification is performed through the trained second text classification model, and the text classification method is as follows: obtain the text to be classified; wherein, the text to be classified Use a second language different from the first language; encode the text to be classified by using a second text classification model with a network depth greater than the first text classification model to obtain an encoding vector of the text to be classified; perform nonlinear coding on the encoding vector of the text to be classified Mapping to obtain the category corresponding to the text to be classified; wherein, the second text classification model is obtained by training the text samples of the second language screened by the first text classification model, and the text samples of the second language are obtained by training the text samples of the first language. Text samples are obtained by machine translation.
  • the second text classification model includes a plurality of cascaded encoders.
  • the following processing is performed on the text to be classified: through the first encoder of multiple cascaded encoders, the text to be classified is encoded by the first encoder; the encoding result of the first encoder is output to the subsequent cascade
  • the encoder, through the subsequent cascaded encoders, continues to perform encoding processing and encoding results output until it is output to the last encoder; the encoding result output by the last encoder is used as the encoding vector corresponding to the text to be classified;
  • the encoding vector is non-linearly mapped to obtain the category corresponding to the text to be classified.
  • rich feature information of the text to be classified can be extracted.
  • use the first encoder to perform the encoding processing of the first encoder on the text to be classified obtain the first encoding result, output the first encoding result to the second encoder, and use the second encoder to encode the first encoding result.
  • One encoding result is encoded by the second encoder, and the second encoding result is obtained, until it is output to the S-th encoder, and finally the encoding vector of the text to be classified is non-linearly mapped, and the category corresponding to the text to be classified can be obtained.
  • S is the total number of encoders.
  • y is a positive integer that increases in sequence, and the value range satisfies 2 ⁇ y ⁇ H-1, and H is Integer greater than 2 and used to represent the number of multiple cascaded encoders.
  • Text classification is widely used in content-related products, such as news classification, article classification, intent classification, information flow products, forums, communities, e-commerce, etc., so as to perform text recommendation and emotional guidance based on the categories of text classification.
  • text classification is for texts in a certain language, such as Chinese, English, etc., products need to expand business in other languages, for example, to promote news reading products from the Chinese market to the English market, when users read news , which can recommend news based on the tags of English news, so as to recommend English news that meets the user's interests to users; when performing positive and negative sentiment analysis on the comments of Chinese users, when promoting products to overseas markets, when users make comments, it can be based on The labels of English comments guide users appropriately to avoid negative emotions.
  • the method includes two parts, namely A) data preparation, B) algorithm framework and C) prediction:
  • the embodiments of the present application are aimed at a situation where there are not a large number of samples (without labels), so it is impossible to train a large-scale pre-training model to extract text content.
  • a partial text set A (Text A, a text set in language A) (including the first text sample) and a small amount of text set B (Text B, a text set in language B) (including the first text set Three text samples), among which, Text A and Text B are samples with category annotations.
  • Text B has only a small amount of annotations, so the proportion is very small.
  • the algorithm framework in the embodiment of the present application includes: 1) sample enhancement, 2) active learning, and 3) enhanced training.
  • the sample augmentation, active learning, and augmentation training are described in detail below:
  • each text X_A in language A in Text A is converted into text in language B to form the corresponding Text set B1 (Text B1, the text set of language B formed by translation).
  • a weak classifier the first text classification model
  • the weak classifier such as a shallow classifier such as fasttext
  • Step 2 these high-confidence, labeled samples form a new training sample set (text set B1', Text B1'), based on Text B1' and Text B, continue to train the weak classifier, after the training is completed, Repeat step 1, and apply the weak classifier to the remaining samples screened by Text B1 (the remaining samples refer to the remaining text after selecting samples with high confidence from Text B1).
  • Step 3 until the confidence obtained by predicting the samples in Text B1 can no longer be higher than the specified confidence threshold, that is, it is considered that the remaining samples screened by Text B1 are all samples of poor quality, and the iterative training is stopped at this time.
  • the Text B' and Text B obtained in the above steps are mixed together, and then a strong classifier (second text classification model) is trained (such as a deep neural network (BERT, Bidirectional Encoder Representations from Transformers)).
  • a strong classifier such as a deep neural network (BERT, Bidirectional Encoder Representations from Transformers)
  • the trained strong classifier is used as the final text classification model for text classification of language B.
  • the English news can be quickly labeled with the strong classifier obtained through training.
  • News recommendation so as to recommend English news that meets the user's interests to users; when positive and negative sentiment analysis is performed on the comments of Chinese users, when the product is launched to the overseas market (language B), there will be many comments that are not Chinese, that is, English comments,
  • the strong classifier obtained by training can quickly label the English comments with corresponding emotional labels, and when the user comments, it can properly guide the user's emotions based on the label of the English comment, so as to avoid the user's continuous negative emotions.
  • the training method and text classification method of the text classification model in the embodiment of the present application obtain a second text sample in language B different from language A through a machine translation model, and screen the second text sample through a weak classifier,
  • the automatic acquisition of cross-language text samples can reduce the pressure caused by the lack of text samples;
  • the high-quality text samples obtained by screening are used to train a strong classifier, so that the strong classifier can perform accurate text classification and improve the accuracy of text classification. accuracy.
  • each functional module in the training device for a text classification model may be composed of hardware resources of an electronic device (such as a terminal device, a server, or a server cluster), such as a processor, etc.
  • Computational resources, communication resources (for example, to support communication in various ways such as optical cable and cellular), and memory are implemented collaboratively.
  • a training device 555 for a text classification model stored in the memory 550 which may be software in the form of programs and plug-ins, for example, software modules designed in programming languages such as software C/C++, Java, C/C++, Application software designed by programming languages such as Java or special software modules, application program interfaces, plug-ins, cloud services, etc. in large-scale software systems are implemented. Examples of different implementation methods are described below.
  • Example 1 The training device of the text classification model is a mobile application and module
  • the training device 555 of the text classification model in the embodiment of the present application can be provided as a software module designed using a programming language such as software C/C++, Java, etc., and embedded in various mobile terminal applications based on systems such as Android or iOS (with executable).
  • the instructions are stored in the storage medium of the mobile terminal and executed by the processor of the mobile terminal), so as to directly use the computing resources of the mobile terminal to complete the relevant information recommendation tasks, and periodically or irregularly transmit the processing results through various network communication methods.
  • Remote server or save locally on the mobile terminal.
  • Example 2 The training device of the text classification model is a server application and a platform
  • the training device 555 of the text classification model in this embodiment of the present application can be provided as application software designed using programming languages such as C/C++, Java, or a special software module in a large-scale software system, running on the server side (in the form of executable instructions) It is stored in the storage medium on the server side and run by the processor on the server side), and the server uses its own computing resources to complete related information recommendation tasks.
  • application software designed using programming languages such as C/C++, Java, or a special software module in a large-scale software system, running on the server side (in the form of executable instructions) It is stored in the storage medium on the server side and run by the processor on the server side), and the server uses its own computing resources to complete related information recommendation tasks.
  • the embodiments of the present application can also be provided as a distributed and parallel computing platform composed of multiple servers, equipped with a customized, easy-to-interact web (Web) interface or other user interfaces (UI, User Interface) to form a user interface for personal, Information recommendation platforms (for recommendation lists) used by groups or units, etc.
  • Web easy-to-interact web
  • UI User Interface
  • Example 3 The training device of the text classification model is a server-side application program interface (API, Application Program Interface) and a plug-in
  • the text classification model training device 555 in this embodiment of the present application may be provided as a server-side API or plug-in for the user to invoke to execute the text classification model training method of the embodiment of the present application, and be embedded in various application programs.
  • Example 4 The training device of the text classification model is the mobile device client API and plug-in
  • the apparatus 555 for training the text classification model in the embodiment of the present application may be provided as an API or plug-in on the mobile device, for the user to call, so as to execute the training method of the text classification model in the embodiment of the present application.
  • Example 5 The training device of the text classification model is an open cloud service
  • the training device 555 of the text classification model in the embodiment of the present application may provide a cloud service for information recommendation developed for users, so that individuals, groups or units can obtain a recommendation list.
  • the training device 555 of the text classification model includes a series of modules, including a translation module 5551 , a first training module 5552 , a screening module 5553 , and a second training module 5554 .
  • the following continues to describe the training scheme for implementing the text classification model in cooperation of each module in the text classification model training device 555 provided by the embodiment of the present application.
  • the translation module 5551 is configured to perform machine translation processing on a plurality of first text samples in the first language through a machine translation model to obtain a plurality of second text samples corresponding to the plurality of first text samples one-to-one;
  • the plurality of second text samples are in a second language different from the first language;
  • the first training module 5552 is configured to, based on the plurality of third text samples in the second language and the corresponding category labels, train with the first text classification model in the second language;
  • the screening module 5553 is configured to perform confidence-based screening processing on the plurality of second text samples through the trained first text classification model;
  • the second training Module 5554 configured to train a second text classification model for the second language based on the second text samples obtained by the screening process; wherein the network depth of the second text classification model is greater than that of the first text The network depth of the classification model.
  • the first training module 5552 is further configured to perform the t-th training on the first text classification model based on a plurality of third text samples in the second language and the corresponding category labels;
  • the t-th screening process based on confidence is performed on the plurality of second text samples by the first text classification model trained for the t-th time; based on the previous t-time screening results, the plurality of third text samples and the The corresponding category label, the t+1th training is performed on the first text classification model;
  • the first text classification model trained for the Tth time is used as the first text classification model after the training; wherein, t is a positive integer increasing in sequence, and the value range satisfies 1 ⁇ t ⁇ T-1, and T is an integer greater than 2, and is used to represent the total number of iterative training.
  • the second training module 5554 is further configured to determine the distribution of the second text samples obtained by the screening process in multiple categories; when the second text samples obtained by the screening process are in multiple categories The distribution satisfies the distribution balance condition, and when the number of each category exceeds the corresponding threshold of the number of categories, from the text samples of each category in the second text sample obtained by the screening process, based on random extraction corresponding to the number of categories A training set is constructed from the thresholded text samples; a second text classification model for the second language is trained based on the training set.
  • the second training module 5554 is further configured to, when the distribution of the second text samples obtained by the screening process in multiple categories does not satisfy the distribution equilibrium condition, perform training on the second text samples of the category with less distribution.
  • An expansion process based on synonyms; wherein, the distribution of the second text samples obtained by the expansion process in multiple categories satisfies the distribution equilibrium condition; a training set is constructed based on the second text samples obtained by the expansion process; based on the training The set trains a second text classification model for the second language.
  • the second training module 5554 is further configured to construct a training set based on the plurality of third text samples and the second text samples obtained by the screening process, and train the training set for the A second text classification model for a second language.
  • the second training module 5554 is further configured to traverse each category of the second text samples obtained by the screening process, and perform the following processing: when the number of the second text samples in the category is lower than When the number of categories of the category is the threshold, the third text samples of the category are randomly selected from the plurality of third text samples to supplement the second text samples of the category, so as to update the results obtained by the screening process.
  • a second text sample; a training set is constructed based on the updated second text sample obtained by the screening process.
  • the second training module 5554 is further configured to determine and train the second text classification model according to the corresponding relationship between the computing power of the text classification model and the number of text samples that can be computed in a unit time The number of target samples matched by the computing power that can be used; from the training set constructed based on the second text samples obtained by the screening process, screen out the text samples corresponding to the target number of samples, as training for the second text sample Example of a second text classification model for language.
  • the first training module 5552 is further configured to perform prediction processing on a plurality of third text samples of the second language by using the first text classification model to obtain the plurality of third text samples Confidences of the corresponding predicted categories respectively; based on the confidences of the predicted categories and the category labels of the third text samples, construct the loss function of the first text classification model; update the parameters of the first text classification model Until the loss function converges, the updated parameters of the first text classification model when the loss function converges are used as the parameters of the trained first text classification model.
  • the first training module 5552 is further configured to perform the following processing for any third text sample in the plurality of third text samples: perform the following processing through the first text classification model: performing encoding processing on the third text sample to obtain a coding vector of the third text sample; performing fusion processing on the coding vector of the third text sample to obtain a fusion vector; performing nonlinear mapping processing on the fusion vector, The confidence level of the predicted category corresponding to the third text sample is obtained.
  • the first text classification model includes a plurality of cascaded activation layers; the first training module 5552 is further configured to pass the first activation layer of the plurality of cascaded activation layers, to the The fusion vector performs the mapping processing of the first activation layer; the mapping result of the first activation layer is output to the subsequent cascaded activation layers, and the mapping processing and mapping results are continued through the subsequent cascaded activation layers. output until the last activation layer; the activation result output by the last activation layer is used as the confidence level of the predicted category corresponding to the third text sample.
  • the screening module 5553 is further configured to perform the following processing for any second text sample in the plurality of second text samples: perform the following processing on all the second text samples through the trained first text classification model performing prediction processing on the second text sample to obtain the confidence levels of multiple predicted categories corresponding to the second text sample; and determining the category label of the first text sample corresponding to the second text sample as the second text sample based on the confidence of multiple predicted categories corresponding to the second text sample and the category label of the second text sample, the second text sample that exceeds the confidence threshold Text sample.
  • the second training module 5554 is further configured to perform prediction processing on the second text sample obtained by the screening process through the second text classification model to obtain the second text sample obtained by the screening process The corresponding predicted category; based on the predicted category corresponding to the second text sample obtained by the screening process and the corresponding category label, construct the loss function of the second text classification model; update the parameters of the second text classification model until all When the loss function converges, the updated parameters of the second text classification model when the loss function converges are used as the parameters of the second text classification model after training.
  • the second text classification model includes a plurality of cascaded encoders; the second training module 5554 is further configured to perform the following processing for any text sample in the second text samples obtained by the screening process : perform the encoding processing of the first encoder on the text sample through the first encoder of the plurality of cascaded encoders; output the encoding result of the first encoder to the subsequent stage Concatenated encoders, continue encoding processing and encoding result output through the subsequent cascaded encoders until output to the last encoder; take the encoding result output by the last encoder as the encoding corresponding to the text sample vector; perform nonlinear mapping on the encoding vector of the text sample to obtain the predicted category corresponding to the text sample.
  • the second training module 5554 is further configured to perform the following processing by the y th encoder of the plurality of cascaded encoders: self-attention to the encoding result of the y-1 th encoder force processing to obtain the y-th self-attention vector; perform residual connection processing on the y-th self-attention vector and the encoding result of the y-1th encoder to obtain the y-th residual vector;
  • the y-th residual vector is subjected to nonlinear mapping processing to obtain the y-th mapping vector; the y-th mapping vector and the y-th residual vector are subjected to residual connection processing, and the result of the residual connection is obtained.
  • y is a positive integer that increases sequentially, and the value range satisfies 2 ⁇ y ⁇ H-1, where H is an integer greater than 2, and is used to represent the number of the plurality of cascaded encoders.
  • an embodiment of the present application further provides a text classification device, and the text classification device includes a series of modules, including an acquisition module and a processing module.
  • the obtaining module is configured to obtain the text to be classified; wherein, the text to be classified is in a second language different from the first language;
  • the processing module is configured to pass a second text classification model whose network depth is greater than that of the first text classification model Encoding the text to be classified to obtain the encoding vector of the text to be classified; performing nonlinear mapping on the encoding vector of the text to be classified to obtain the category corresponding to the text to be classified; wherein the second The text classification model is obtained by training the text samples in the second language screened by the first text classification model, and the text samples in the second language are obtained by performing machine translation on the text samples in the first language.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the electronic device executes the text classification model training method or text classification method described above in the embodiments of the present application.
  • the embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the artificial intelligence-based artificial intelligence provided by the embodiments of the present application.
  • the information recommendation method, or the text classification method for example, the training method of the text classification model shown in Figure 3-5.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本分类模型的训练方法、文本分类方法、装置、电子设备及计算机可读存储介质;涉及人工智能技术;方法包括:通过机器翻译模型对第一语言的多个第一文本样本进行机器翻译处理,得到与多个第一文本样本一一对应的多个第二文本样本(101);其中,多个第二文本样本采用不同于第一语言的第二语言;基于第二语言的多个第三文本样本以及分别对应的类别标签,训练用于第二语言的第一文本分类模型(102);通过训练后的第一文本分类模型对多个第二文本样本进行基于置信度的筛选处理(103);基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型(104)。

Description

文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品
相关申请的交叉引用
本申请实施例基于申请号为202011217057.9、申请日为2020年11月04日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请实施例作为参考。
技术领域
本申请涉及人工智能技术,尤其涉及一种文本分类模型的训练方法、文本分类方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
人工智能(AI,Artificial Intelligence)是计算机科学的一个综合技术,通过研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,例如自然语言处理技术以及机器学习/深度学习等几大方向,随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
文本分类模型是人工智能领域的重要应用之一,文本分类模型可以识别出文本所属的类别。文本分类模型在新闻推荐、意图识别系统等中都有广泛的应用,即文本分类模型是这些复杂系统的基础组件。
但是,相关技术中的文本分类模型是针对某一种语言,当将文本分类模型扩展到其他语言的文本分类时,该文本分类模型将会面临缺乏其他语言的标注样本的压力,无法顺利进行其他语言的文本分类任务。
发明内容
本申请实施例提供一种文本分类模型的训练方法、文本分类方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够自动获取跨语言的文本样本,提高文本分类的准确性。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种文本分类模型的训练方法,包括:
对第一语言的多个第一文本样本进行机器翻译处理,得到与所述多个第一文本样本一一对应的多个第二文本样本;
其中,所述多个第二文本样本采用不同于所述第一语言的第二语言;
基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型;
通过训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理;
基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型;
其中,所述第二文本分类模型的网络深度大于所述第一文本分类模型的网络深度。
本申请实施例提供一种文本分类方法,包括:
获取待分类文本;
其中,所述待分类文本采用不同于第一语言的第二语言;
通过网络深度大于第一文本分类模型的第二文本分类模型对所述待分类文本进行编码处理,得到所述待分类文本的编码向量;
对所述待分类文本的编码向量进行非线性映射,得到所述待分类文本对应的类别;
其中,所述第二文本分类模型是通过所述第一文本分类模型筛选得到的第二语言的文本样本训练得到的,所述第二语言的文本样本是通过对所述第一语言的文本样本进行机器翻译得到的。
本申请实施例提供一种文本分类模型的训练装置,包括:
翻译模块,配置为对第一语言的多个第一文本样本进行机器翻译处理,得到与所述多个第一文本样本一一对应的多个第二文本样本;
第一训练模块,配置为基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型;
筛选模块,配置为通过训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理;
第二训练模块,配置为基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型;其中,所述第二文本分类模型的网络深度大于所述第一文本分类模型的网络深度。
本申请实施例提供一种文本分类装置,包括:
获取模块,配置为获取待分类文本;其中,所述待分类文本采用不同于第一语言的第二语言;
处理模块,配置为通过网络深度大于第一文本分类模型的第二文本分类模型对所述待分类文本进行编码处理,得到所述待分类文本的编码向量;对所述待分类文本的编码向量进行非线性映射,得到所述待分类文本对应的类别;其中,所述第二文本分类模型是通过所述第一文本分类模型筛选得到的第二语言的文本样本训练得到的,所述第二语言的文本样本是通过对所述第一语言的文本样本进行机器翻译得到的。
本申请实施例提供一种用于文本分类模型训练的电子设备,所述电子设备包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的文本分类模型的训练方法,或文本分类方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的文本分类模型的训练方法,或文本分类方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现本申请实施例提供的文本分类模型的训练方法,或文本分类方法。
本申请实施例具有以下有益效果:
通过机器翻译获取采用不同于第一语言的第二语言的第二文本样本,并通过第一文本分类模型对第二文本样本进行筛选,从而实现自动获取跨语言的文本样本,降低由于缺乏文本样本所带来的压力;并且,通过筛选得到的优质文本样本训练第二文本分类模型,使得第二文本分类模型能够进行准确的文本分类,提高文本分类的准确性。
附图说明
图1是本申请实施例提供的文本分类系统的应用场景示意图;
图2是本申请实施例提供的用于文本分类模型训练的电子设备的结构示意图;
图3-5是本申请实施例提供的基于文本分类模型的训练方法的流程示意图;
图6是本申请实施例提供的迭代训练的流程示意图;
图7是本申请实施例提供的层次softmax的示意图;
图8是本申请实施例提供的级联的编码器的示意图;
图9是本申请实施例提供的文本集A和文本集B的示意图;
图10是本申请实施例提供的文本集B1的示意图;
图11是本申请实施例提供的主动学习的流程示意图;
图12是本申请实施例提供的增强学习的流程示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)卷积神经网络(CNN,Convolutional Neural Networks):一类包含卷积计算且具有深度结构的前馈神经网络(FNN,Feedforward Neural Networks),是深度学习(deep learning)的代表算法之一。卷积神经网络具有表征学习(representation learning)能力,能够按其阶层结构对输入图像进行平移不变分类(shift-invariant classification)。
2)跨语言少量(few shot)文本分类:当从A语言场景迁移到B语言场景、且有少量预算来做B语言样本标注时,只需少量的B语言的标注文本、大量A语言标注文本,便可以实现B语言文本的大规模标注,通过B语言文本的大规模标注训练文本分类模型,以实现B语言文本分类。
3)跨语言零次(zero shot)文本分类:当从A语言场景迁移到B语言场景、且缺乏预算(没有人工或产品推广时间紧迫)时,无法对B语言样本进行标注,即:仅借助大量A语言标注文本,来实现B语言的大规模标注,并通过B语言文本的大规模标注训练文本分类模型,以实现B语言文本分类。
文本分类被广泛应用在内容相关的产品中,例如新闻分类、文章分类、意图分类、信息流产品、论坛、社区、电商等等。一般情况下,文本分类都是针对某一种语言的文本,例如中文、英文等等,但当产品需要拓展其他语言业务时,在产品初期会遇到标注文本不足的问题,例如,将新闻阅读产品从中文市场推广到英文市场时,则需要快速地对英文领域的新闻打上相应的标签;对中文用户的评论进行正负情感分析时,随着用户数增多,或者将产品推向海外市场时,会出现很多不是中文的评论,因此这些评论也需要标注出相应的情感极性。
虽然从更长时间的尺度看,这些其他语言文本,可以通过人工运营等方式,慢慢积累一定体量的标注数据,然后进行模型训练以及预测。但在早期,只通过人工来给文本 进行标注,非常耗时且浪费人力,不利于产品的快速迭代。所以在初期时,希望通过算法、借助已有语言的标注文本积累,来实现大量文本的自动标注。
相关技术中,都是围绕同一种语言的few shot文本分类或zero shot文本分类,也就是只解决同一种语言标注样本不足的问题,缺乏跨语言的文本分类。
为了解决上述问题,本申请实施例提供了一种文本分类模型的训练方法、文本分类方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够自动获取跨语言的文本样本,提高文本分类的准确性。
本申请实施例所提供的文本分类模型的训练方法和文本分类方法,可以由终端/服务器独自实现;也可以由终端和服务器协同实现,例如终端独自承担下文所述的文本分类模型的训练方法,或者,终端向服务器发送针对某语言的文本分类请求,服务器根据接收的该某语言的文本分类执行文本分类模型的训练方法,并基于训练后的文本分类模型进行该语言的文本分类任务。
本申请实施例提供的用于文本分类模型训练的电子设备可以是各种类型的终端设备或服务器,其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器;终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载设备等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
以服务器为例,例如可以是部署在云端的服务器集群,向用户开放人工智能云服务(AiaaS,AI as a Service),AIaaS平台会把几类常见的AI服务进行拆分,并在云端提供独立或者打包的服务,这种服务模式类似于一个AI主题商城,所有的用户都可以通过应用程序编程接口的方式来接入使用AIaaS平台提供的一种或者多种人工智能服务。
例如,其中的一种人工智能云服务可以为文本分类模型训练服务,即云端的服务器封装有本申请实施例提供的文本分类模型训练的程序。用户通过终端(运行有客户端,例如新闻客户端、阅读客户端等)调用云服务中的文本分类模型训练服务,以使部署在云端的服务器调用封装的文本分类模型训练的程序,基于第一语言的第一文本样本,通过机器翻译模型,获取采用不同于第一语言的第二语言的第二文本样本,并通过第一文本分类模型对第二文本样本进行筛选,通过筛选得到的第二文本样本训练第二文本分类模型,通过训练的第二文本样本进行文本分类,以进行后续新闻应用、阅读应用等,例如,对于新闻应用,文本为英文新闻,通过训练的第二文本分类模型(用于英文的新闻分类)确定各待推荐的新闻的类别,例如娱乐新闻、体育新闻等,从而基于新闻的类别对各待推荐的新闻进行筛选,以获得用于推荐的新闻,并向用户展示用于推荐的新闻,以实现针对性的新闻推荐;对于阅读应用,文本为中文文章,通过训练的第二文本分类模型(用于中文的文章分类)确定各待推荐的文章的类别,例如心灵鸡汤、法律文章、教育文章等,从而基于文章的类别对各待推荐的文章进行筛选,以获得用于推荐的文章,并向用户展示用于推荐的文章,以实现针对性的文章推荐。
参见图1,图1是本申请实施例提供的文本分类系统10的应用场景示意图,终端200通过网络300连接服务器100,网络300可以是广域网或者局域网,又或者是二者的组合。
终端200(运行有客户端,例如新闻客户端)可以被用来获取某语言的待分类文本,例如,开发人员通过终端输入某语言的待分类文本,终端自动获取针对某语言的文本分类请求。
在一些实施例中,终端中运行的客户端中可以植入有文本分类模型训练插件,用以在客户端本地实现文本分类模型的训练方法。例如,终端200获取不同于第一语言的第 二语言的待分类文本后,调用文本分类模型训练插件,以实现文本分类模型的训练方法,通过机器翻译模型获取采用与第一文本样本(采用第一语言)对应的第二文本样本(采用第二语言),并通过第一文本分类模型对第二文本样本进行筛选,通过筛选得到的第二文本样本训练第二文本分类模型,基于训练的第二文本样本进行文本分类,以进行后续新闻应用、阅读应用等。
在一些实施例中,终端200针对某语言的文本分类请求后,调用服务器100的文本分类模型训练接口(可以提供为云服务的形式,即文本分类模型训练服务),服务器100通过机器翻译模型获取采用与第一文本样本(采用第一语言)对应的第二文本样本(采用第二语言),并通过第一文本分类模型对第二文本样本进行筛选,通过筛选得到的第二文本样本训练第二文本分类模型,基于训练的第二文本样本进行文本分类,以进行后续新闻应用、阅读应用等。
下面说明本申请实施例提供的用于文本分类模型训练的电子设备的结构,参见图2,图2是本申请实施例提供的用于文本分类模型训练的电子设备500的结构示意图,以电子设备500是服务器为例说明,图2所示的用于文本分类模型训练的电子设备500包括:至少一个处理器510、存储器550、至少一个网络接口520和用户接口530。电子设备500中的各个组件通过总线系统540耦合在一起。可理解,总线系统540用于实现这些组件之间的连接通信。总线系统540除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统540。
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块553,用于经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
在一些实施例中,本申请实施例提供的文本分类模型的训练装置可以采用软件方式实现,例如,可以是上文所述的终端中的文本分类模型训练插件,可以是上文所述的服务器中文本分类模型训练服务。当然,不局限于此,本申请实施例提供的文本分类模型的训练装置可以提供为各种软件实施例,包括应用程序、软件、软件模块、脚本或代码在内的各种形式。
图2示出了存储在存储器550中的文本分类模型的训练装置555,其可以是程序和插件等形式的软件,例如文本分类模型训练插件,并包括一系列的模块,包括翻译模块5551、第一训练模块5552、筛选模块5553以及第二训练模块5554;其中,翻译模块5551、第一训练模块5552、筛选模块5553、第二训练模块5554用于实现本申请实施例提供的文本分类模型的训练功能。
如前所述,本申请实施例提供的文本分类模型的训练方法可以由各种类型的电子设备实施。参见图3,图3是本申请实施例提供的基于文本分类模型的训练方法的流程示意图,结合图3示出的步骤进行说明。
在下面的步骤中,第二文本分类模型的网络深度大于第一文本分类模型的网络深度,即第二文本分类模型的文本分类能力强于第一文本分类模型的分类能力,因此,用于训练第二文本分类模型所需的文本样本的数量大于用于训练第一文本分类模型所需的文本样本的数量。
在下面的步骤中,第一文本样本采用第一语言,第二文本样本以及第三文本样本采用不同于第一语言的第二语言,例如第一文本样本为中文样本,第二文本样本以及第三文本样本为英文样本。
在步骤101中,对第一语言的多个第一文本样本进行机器翻译处理,得到与多个第一文本样本一一对应的多个第二文本样本。
例如,当开发人员通过终端输入第二语言的文本分类指令,终端自动获取针对第二语言的文本分类请求,并将第二语言的文本分类请求发送至服务器,服务器接收到第二语言的文本分类请求后,从样本库中获取有大量标注的第一文本样本,但是第一文本样本采用的是不同于第二语言的第一语言,并调用机器翻译模型对多个第一文本样本进行机器翻译,以获取与多个第一文本样本一一对应的多个第二文本样本,其中,第二文本样本的类别标注继承对应第一文本样本的类别标注,即不需要进行人工标注,大大节省了人工标注多带来的标注压力。
在步骤102中,基于第二语言的多个第三文本样本以及分别对应的类别标签,训练用于第二语言的第一文本分类模型。
其中,步骤101和步骤102并无明显的先后顺序。在服务器接收到第二语言的文本分类请求后,从样本库中获取有少量标注的第三文本样本,通过多个第三文本样本以及对应的类别标签,训练第一文本分类模型,使得训练后的第一文本分类模型可以基于第二语言进行文本分类。
在一些实施例中,基于第二语言的多个第三文本样本以及分别对应的类别标签,训练用于第二语言的第一文本分类模型,包括:基于第二语言的多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行第t次训练;通过第t次训练的第一文本分类模型对多个第二文本样本进行基于置信度的第t次筛选处理;基于前t次筛选结果、多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行第t+1次训练;将第T次训练的第一文本分类模型作为训练后的第一文本分类模型;其中,t为依次递增的正整数、且取值范围满足1≤t≤T-1,T为大于2的整数、且用于表示迭代训练的总次数。
例如,基于第二语言的多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行迭代训练,以通过逐渐优化的第一文本分类模型筛选出更多优质的第三文本样本,以进行后续的增强训练,对第二文本分类模型进行训练。
如图6所示,基于第二语言的多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行第1次训练,通过第1次训练的第一文本分类模型对多个第二文本样本进行基于置信度的第1次筛选处理,基于第1次筛选结果、多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行第2次训练,通过第2次训练的第一文本分类模型对多个第二文本样本中除第1次筛选结果外的第二文本样本进行基于置信度的第2次筛选处理,将前2次筛选结果、多个第三文本样本以及分别对应的类别标签,对第一文本分类模型进行第3次训练,迭代上述训练过程,直至对第一文本分类模型进行第T次训练,将第T次训练的第一文本分类模型作为训练后的第一文本分类模型。
参见图4,图4是本申请实施例提供的文本分类模型的训练方法的一个可选的流程示意图,图4示出图3中的步骤102可以通过图4示出的步骤1021至步骤1023实现:在步骤1021中,通过第一文本分类模型对第二语言的多个第三文本样本进行预测处理,得到多个第三文本样本分别对应的预测类别的置信度;在步骤1022中,基于预测类别的置信度以及第三文本样本的类别标签,构建第一文本分类模型的损失函数;在步骤1023中,更新第一文本分类模型的参数直至损失函数收敛,将损失函数收敛时第一文本分类模型的更新的参数,作为训练后的第一文本分类模型的参数。
例如,基于预测类别的置信度以及第三文本样本的类别标签,确定第一文本分类模型的损失函数的值后,可以判断第一文本分类模型的损失函数的值是否超出预设阈值,当第一文本分类模型的损失函数的值超出预设阈值时,基于第一文本分类模型的损失函数确定第一文本分类模型的误差信号,将误差信息在第一文本分类模型中反向传播,并在传播的过程中更新各个层的模型参数。
这里,对反向传播进行说明,将训练样本数据输入到神经网络模型的输入层,经过隐藏层,最后达到输出层并输出结果,这是神经网络模型的前向传播过程,由于神经网络模型的输出结果与实际结果有误差,则计算输出结果与实际值之间的误差,并将该误差从输出层向隐藏层反向传播,直至传播到输入层,在反向传播的过程中,根据误差调整模型参数的值;不断迭代上述过程,直至收敛。其中,第一文本分类模型属于神经网络模型。
在一些实施例中,通过第一文本分类模型对第二语言的多个第三文本样本进行预测处理,得到多个第三文本样本分别对应的预测类别的置信度,包括:针对多个第三文本样本中的任一第三文本样本执行以下处理:通过第一文本分类模型执行以下处理:对第三文本样本进行编码处理,得到第三文本样本的编码向量;对第三文本样本的编码向量进行融合处理,得到融合向量;对融合向量进行非线性映射处理,得到第三文本样本对应的预测类别的置信度。
例如,第一文本分类模型为快速文本分类模型(fasttext),本申请实施例中的第一文本分类模型并不局限于fasttext,fasttext包括输入层、隐藏层以及输出层,通过少量的第三文本样本可以快速训练fasttext,以使fasttext能够快速进行第二语言的文本分类任务。例如,通过输入层对第三文本样本进行编码,得到第三文本样本的编码向量;再通过隐藏层对第三文本样本的编码向量进行融合,得到融合向量;最后,通过输出层对融合向量进行非线性映射(即通过激活函数(例如softmax)进行映射处理),得到第三文本样本对应的预测类别的置信度。
在一些实施例中,第一文本分类模型包括多个级联的激活层;对融合向量进行非线性映射处理,得到第三文本样本对应的预测类别的置信度,包括:通过多个级联的激活层的第一个激活层,对融合向量进行第一个激活层的映射处理;将第一个激活层的映射结果输出到后续级联的激活层,通过后续级联的激活层继续进行映射处理和映射结果输出,直至输出到最后一个激活层;将最后一个激活层输出的激活结果作为第三文本样本对应的预测类别的置信度。
如图7所示,通过层次softmax进行激活运算,可以避免通过一次性的激活运算,得到预测类别的置信度,而是通过多层的激活运算,从而降低计算复杂度。例如,层次softmax包括T层激活层,每一层激活层都进行一次层次softmax运算,通过第1个激活层对融合向量进行第1个激活层的映射,得到第1个映射结果,将第1个映射结果输出至第2个激活层,通过第2个激活层对第1个映射结果进行第2个激活层的映射,得到第2个映射结果,直至输出到第T个激活层,将第T个激活层输出的激活结果作为第三文本样本对应的预测类别的置信度。其中,T为激活层的总数。
在一些实施例中,对第三文本样本进行编码处理,得到第三文本样本的编码向量,包括:对第三文本样本进行窗口滑动处理,得到多个片段序列;其中,窗口的大小为N,N为自然数;基于词表库对多个片段序列进行映射处理,得到多个片段序列分别对应的序列向量;对多个片段序列分别对应的序列向量进行组合处理,得到第三文本样本的编码向量。
承接上述示例,片段序列包括N个字,则对第三文本样本进行窗口滑动处理,得到多个片段序列,包括:针对第三文本样本中的第i个字执行以下处理:获取第三文本样本中的第i个字至第i+N-1个字;对第i个字至第i+N-1个字进行组合,将组合结果作为片段序列;其中,0<i≤M-N+1,M为第三文本样本的字数,M为自然数,从而为罕见的字生成更好的编码向量,在词库表中,即使字没有出现在训练语料库中,仍然可以从字粒度的窗口构造对应字粒度的编码向量,还可以让第一文本分类模型学习到局部字顺序的部分信息,这样会让第一文本分类模型在训练的时候保持字序信息。
承接上述示例,片段序列包括N个词语,则对第三文本样本进行窗口滑动处理,得到多个片段序列,包括:针对第三文本样本中的第j个词语执行以下处理:获取第三文本样本中的第j个词语至第j+N-1个词语;对第j个词语至第j+N-1个词语进行组合,将组合结果作为片段序列;其中,0<j≤K-N+1,K为第三文本样本中的词语的数量,K为自然数。从而为罕见的词(语句)生成更好的编码向量,在词库表中,即使语句没有出现在训练语料库中,仍然可以从词粒度的窗口构造对应词粒度的编码向量,还可以让第一文本分类模型学习到局部词顺序的部分信息,这样会让第一文本分类模型在训练的时候保持词序信息。
在步骤103中,通过训练后的第一文本分类模型对多个第二文本样本进行基于置信度的筛选处理。
例如,在服务器通过第三文本样本,得到训练后的第一文本分类模型后,可以通过训练后的第一文本分类模型对多个第二文本样本进行基于置信度的筛选处理,以筛选出优质的第二文本样本,以通过优质的第二文本样本训练第二文本分类模型。
在一些实施例中,通过训练后的第一文本分类模型对多个第二文本样本进行基于置信度的筛选处理,包括:针对多个第二文本样本中的任一第二文本样本执行以下处理:通过训练后的第一文本分类模型对第二文本样本进行预测处理,得到第二文本样本对应的多个预测类别的置信度;将第二文本样本对应的第一文本样本的类别标签确定为第二文本样本的类别标签;基于第二文本样本对应的多个预测类别的置信度以及第二文本样本的类别标签,将超出置信度阈值的第二文本样本作为筛选处理得到的第二文本样本。
例如,通过训练后的第一文本分类模型对第二文本样本进行编码处理,得到第二文本样本的编码向量,对第二文本样本的编码向量进行融合处理,得到融合向量,对融合向量进行非线性映射处理,得到第二文本样本对应的多个预测类别的置信度,从第二文本样本对应的多个预测类别中,确定与第二文本样本的类别标签匹配的预测类别,当匹配的预测类别的置信度超出置信度阈值时,将第二文本样本作为筛选处理得到的第二文本样本。
在步骤104中,基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型。
例如,在服务器通过训练后的第一文本分类模型筛选出大量优质的第二文本样本后,则实现自动构建跨语言的文本样本(即第二语言的第二文本样本,其带有对应的第一文本样本的类别标注,即无需进行人工标注),通过大量优质的第二文本样本对第二文本分类模型进行训练,以使训练后的第二文本分类模型进行准确的基于第二语言的文本分类,提高第二语言的文本分类的准确性。
由于通过本申请实施例的训练方法可以得到充足的第二文本样本用于训练第二文本分类模型,因此,本申请实施例可以仅通过筛选处理得到的第二文本样本,对第二文本分类模型进行训练即可。
其中,在服务器得到训练后的第二文本分类模型后,响应于针对第二语言的文本分类请求,对待分类文本进行文本分类,即通过训练后的第二文本分类模型对该待分类文本进行编码处理,得到待分类文本的编码向量,并对待分类文本的编码向量进行非线性映射,以得到待分类文本对应的类别,还可以通过待分类文本对应的类别进行后续新闻应用、阅读应用等。
参见图5,图5是本申请实施例提供的文本分类模型的训练方法的一个可选的流程示意图,图5示出图3中的步骤104可以通过图5示出的步骤1041至步骤1043实现:在步骤1041中,确定筛选处理得到的第二文本样本在多个类别的分布;在步骤1042中,当筛选处理得到的第二文本样本在多个类别的分布满足分布均衡条件、且在每个类别的数量超出对应的类别数量阈值时,从筛选处理得到的第二文本样本中的每个类别的文本样本中,随机抽取对应类别数量阈值的文本样本以构建训练集;在步骤1043中,基于训练集训练用于第二语言的第二文本分类模型。
例如,在服务器获得大量用于训练第二文本分类模型的第二文本样本后,分析筛选处理得到的第二文本样本在在多个类别的分布,以确定是否满足分布均衡条件,即不同类别的数量的抖动情况,例如使用均方差衡量不同类别的数量的抖动情况,抖动越大,则说明文本样本在多个类别的分布越不均衡。当筛选处理得到的第二文本样本在多个类别的分布满足分布均衡条件、且每个类别的数量超出类别数量阈值,则从筛选处理得到的第二文本样本中的每个类别的文本样本中,抽取对应类别数量阈值的文本样本以构建训练集,从而提高文本分类的精度。
在一些实施例中,基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型,包括:当筛选处理得到的第二文本样本在多个类别的分布不满足分布均衡条件,针对分布少的类别的第二文本样本进行基于近义词的扩充处理,以使扩充处理得到的第二文本样本在多个类别的分布满足分布均衡条件;基于扩充处理得到的第二文本样本构建训练集;基于训练集训练用于所述第二语言的第二文本分类模型。
当筛选处理得到的第二文本样本在每个类别的数量低于对应的类别数量阈值时,针对对应类别的第二文本样本进行基于近义词的扩充处理,以使扩充处理得到的第二文本样本在每个类别的数量超出对应的类别数量阈值;基于扩充处理得到的第二文本样本构建训练集。
其中,具体的扩充过程如下所示:针对多个第三文本样本以及筛选处理得到的第二文本样本中的任一文本样本执行以下处理:将近义词词典(包括各种近义词之间的对应关系)与文本样本中的词语进行匹配处理,得到与文本样本中的词语对应的匹配词;基于匹配词对文本样本中的词语进行替换处理,得到新的文本样本;将文本样本对应的类别标签作为新的文本样本的类别标签。通过近义词替换的方式,可以大大扩充第二语言的文本样本,以实现对第二文本分类模型的训练。
在一些实施例中,基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型,包括:基于多个第三文本样本以及筛选处理得到的第二文本样本构建训练集,基于训练集训练用于第二语言的第二文本分类模型。
例如,基于多个第三文本样本以及筛选处理得到的第二文本样本构建训练集,包括:遍历筛选处理得到的第二文本样本的每个类别,执行以下处理:当类别中的第二文本样本的数量低于类别的类别数量阈值时,将从多个第三文本样本中随机抽取类别的第三文本样本补充到类别的第二文本样本中,以更新筛选处理得到的第二文本样本;基于更新 后的筛选处理得到的第二文本样本,构建训练集。
承接上述示例,当在某些类别的文本样本比较少时,或者在某些类别的分布不均衡时,可以通过第三文本样本来补充。例如当类别中的第二文本样本的数量低于类别的类别数量阈值时,则说明该类别的文本样本比较少,可以将从多个第三文本样本中随机抽取该类别的第三文本样本补充到该类别的第二文本样本中,以更新筛选处理得到的第二文本样本,使得第二文本样本中该类别的文本样本更加充足。
在一些实施例中,为了避免通过大量的样本训练第二分类模型造成过拟合的问题,可以通过第二文本分类模型的算力,匹配对应的文本样本数量进行适当的训练。基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型之前,根据文本分类模型的算力(计算能力)与在单位时间内所能够运算的文本样本的数量的对应关系,确定与训练第二文本分类模型所能够使用的算力匹配的目标样本数量;从基于筛选处理得到的第二文本样本构建的训练集中,筛选出对应目标样本数量的文本样本,以作为训练用于第二语言的第二文本分类模型的样本。
在一些实施例中,基于筛选处理得到的第二文本样本,训练用于第二语言的第二文本分类模型,包括:通过第二文本分类模型对筛选处理得到的第二文本样本进行预测处理,得到筛选处理得到的第二文本样本对应的预测类别;基于筛选处理得到的第二文本样本对应的预测类别以及对应的类别标签,构建第二文本分类模型的损失函数;更新第二文本分类模型的参数直至损失函数收敛,将损失函数收敛时第二文本分类模型的更新的参数,作为训练后的第二文本分类模型的参数。
例如,基于筛选处理得到的第二文本样本对应的预测类别以及对应的类别标签,确定第二文本分类模型的损失函数的值后,可以判断第二文本分类模型的损失函数的值是否超出预设阈值,当第二文本分类模型的损失函数的值超出预设阈值时,基于第二文本分类模型的损失函数确定第二文本分类模型的误差信号,将误差信息在第二文本分类模型中反向传播,并在传播的过程中更新各个层的模型参数。
在一些实施例中,第二文本分类模型包括多个级联的编码器;通过第二文本分类模型对筛选处理得到的第二文本样本进行预测处理,得到筛选处理得到的第二文本样本对应的预测类别,包括:针对筛选处理得到的第二文本样本中的任一文本样本执行以下处理:通过多个级联的编码器的第一个编码器,对文本样本进行第一个编码器的编码处理;将第一个编码器的编码结果输出到后续级联的编码器,通过后续级联的编码器继续进行编码处理和编码结果输出,直至输出到最后一个编码器;将最后一个编码器输出的编码结果作为对应文本样本的编码向量;对文本样本的编码向量进行非线性映射,得到文本样本对应的预测类别。
如图8所示,通过级联的编码器进行编码运算,可以提取丰富的文本样本的特征信息。例如,通过第1个编码器对文本样本进行第1个编码器的编码处理,得到第1个编码结果,将第1个编码结果输出至第2个编码器,通过第2个编码器对第1个编码结果进行第2个编码器的编码,得到第2个编码结果,直至输出到第S个编码器,最后对文本样本的编码向量进行非线性映射,即可得到文本样本对应的预测类别。其中,S为编码器的总数。
承接上述示例,在后续级联的编码器中继续进行编码处理和编码结果输出,包括:通过多个级联的编码器的第y个编码器执行以下处理:对第y-1个编码器的编码结果进行自注意力处理,得到第y个自注意力向量;对第y个自注意力向量以及第y-1个编码器的编码结果进行残差连接处理,得到第y个残差向量;对第y个残差向量进行非线性映射处理,得到第y个映射向量;对第y个映射向量以及第y个残差向量进行残差连接处理,将残差连接的结果作为第y个编码器的编码结果,并将第y个编码器的编码结果 输出到第y+1个编码器;其中,y为依次递增的正整数、且取值范围满足2≤y≤H-1,H为大于2的整数、且用于表示多个级联的编码器的数量。
需要说明的是,获取训练后的第二文本分类模型后,通过训练后的第二文本分类模型进行第二语言的文本分类,文本分类方法如下所示:获取待分类文本;其中,待分类文本采用不同于第一语言的第二语言;通过网络深度大于第一文本分类模型的第二文本分类模型对待分类文本进行编码处理,得到待分类文本的编码向量;对待分类文本的编码向量进行非线性映射,得到待分类文本对应的类别;其中,第二文本分类模型是通过第一文本分类模型筛选得到的第二语言的文本样本训练得到的,第二语言的文本样本是通过对第一语言的文本样本进行机器翻译得到的。
承接上述示例,第二文本分类模型包括多个级联的编码器。针对待分类文本执行以下处理:通过多个级联的编码器的第一个编码器,对待分类文本进行第一个编码器的编码处理;将第一个编码器的编码结果输出到后续级联的编码器,通过后续级联的编码器继续进行编码处理和编码结果输出,直至输出到最后一个编码器;将最后一个编码器输出的编码结果作为对应待分类文本的编码向量;对待分类文本的编码向量进行非线性映射,得到待分类文本对应的类别。
例如,通过级联的编码器进行编码运算,可以提取丰富的待分类文本的特征信息。例如,通过第1个编码器对待分类文本进行第1个编码器的编码处理,得到第1个编码结果,将第1个编码结果输出至第2个编码器,通过第2个编码器对第1个编码结果进行第2个编码器的编码,得到第2个编码结果,直至输出到第S个编码器,最后对待分类文本的编码向量进行非线性映射,即可得到待分类文本对应的类别。其中,S为编码器的总数。
承接上述示例,在后续级联的编码器中继续进行编码处理和编码结果输出,包括:通过多个级联的编码器的第y个编码器执行以下处理:对第y-1个编码器的编码结果进行自注意力处理,得到第y个自注意力向量;对第y个自注意力向量以及第y-1个编码器的编码结果进行残差连接处理,得到第y个残差向量;对第y个残差向量进行非线性映射处理,得到第y个映射向量;对第y个映射向量以及第y个残差向量进行残差连接处理,将残差连接的结果作为第y个编码器的编码结果,并将第y个编码器的编码结果输出到第y+1个编码器;其中,y为依次递增的正整数、且取值范围满足2≤y≤H-1,H为大于2的整数、且用于表示多个级联的编码器的数量。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
文本分类被广泛应用在内容相关的产品中,例如新闻分类、文章分类、意图分类、信息流产品、论坛、社区、电商等等,从而基于文本分类的类别进行文本推荐、感情疏导等。一般情况下,文本分类都是针对某一种语言的文本,例如中文、英文等等,产品需要拓展其他语言业务,例如,将新闻阅读产品从中文市场推广到英文市场,在用户进行新闻阅读时,能够基于英文新闻的标签进行新闻推荐,从而向用户推荐符合用户兴趣的英文新闻;对中文用户的评论进行正负情感分析时,将产品推向海外市场时,在用户进行评论时,能够基于英文评论的标签对用户进行适当的感情疏导,避免用户不断产生负面情绪。
下面结合上述应用场景具体说明本申请实施例提出一种文本分类模型的训练方法、文本分类方法,通过已有的机器翻译模型,来通过A语言的样本增加B语言的样本量。但由于通过算法翻译得到的文本是有一定偏差和错误的,所以采用主动学习的方法,从翻译的文本中挑出高质量的样本,以进行后续的训练。
下面具体说明本申请实施例提出一种文本分类模型的训练方法、文本分类方法方法,该方法包括两个部分,即A)数据准备、B)算法框架以及C)预测:
A)数据准备
本申请实施例针对的是没有大量样本(无标注)的情形,所以无法训练大型的预训练模型来提取文本内容。如图9所示,本申请实施例存在部分文本集A(Text A,A语言的文本集)(包括第一文本样本)和少量文本集B(Text B,B语言的文本集)(包括第三文本样本),其中,Text A和Text B为有类别标注的样本,相对于Text A,Text B只有少量标注,因此占比很小。
其中,将Text A里面的标注样本记做<X_A,Y>,Text B记做<X_B,Y>,其中,X_A表示Text A中的文本,X_B表示Text B中的文本,Text A和Text B的标签是共通的,都用Y表示,例如类别0(Y=0)表示娱乐类型的新闻,类别1(Y=1)表示体育类的新闻,这里的0和1是通用的,跟语言无关。
B)算法框架
其中,本申请实施例中的算法框架包括:1)样本增强、2)主动学习以及3)增强训练。下面具体说明样本增强、主动学习以及增强训练:
1)样本增强
首先,如图10所示,借助的机器翻译模型(用于将A语言翻译为B语言),将Text A里面的每个A语言的文本X_A,都转换成B语言的文本,以形成对应的文本集B1(Text B1,翻译所形成的B语言的文本集)。
通过这种样本增强的方法,得到两类标注文本,一类是原有的、少量人工标注的样本集Text B,其人工标注是非常准确的;一类是通过机器翻译模型进行转换之后得到的、大量标注(其标注与Text A中的标注对应)样本Text B1(包括第二文本样本),Text B1中可能包含噪音、错误等等,没有Text B的内容准确。
2)主动学习
为了能将Text B1里的优质样本过滤出来,采取主动学习的方法,整个过程如图11所示:
步骤1,先用人工标注的Text B,训练出一个弱分类器(第一文本分类模型)(例如fasttext这种浅层分类器),然后将弱分类器作用在Text B1上进行预测,从Text B1中筛选出置信度较高的样本,例如假设置信度阈值是0.8,如果Text B1里的某个样本X_B1预测出来的标签Y=2的置信度是0.87(比0.8大),则认为样本X_B1的类别是2,从而得到带标注的训练样本<X_B1,Y=2>。
步骤2,将这些置信度高的、带标签的样本,构成新的训练样本集(文本集B1',Text B1'),基于Text B1'和Text B,继续训练弱分类器,训练完成后,重复步骤1,将弱分类器作用在Text B1筛选所剩下的样本(所剩下的样本是指从Text B1中挑选出置信度高的样本后所剩下的文本)上。
步骤3,直到预测Text B1中的样本所得到的置信度,无法再高于指定的置信度阈值,即认为Text B1筛选所剩下的样本都是质量较差的样本,此时停止迭代训练。
3)增强训练
如图12所示,将上面步骤得到的Text B'和Text B混合在一起,再训练一个强分类器(第二文本分类模型)(例如深层神经网络(BERT,Bidirectional Encoder Representations from Transformers))。
C)预测
将训练得到的强分类器作为最终的文本分类模型,用于B语言的文本分类。例如,将新闻阅读产品从中文市场推广到英文(B语言)市场时,通过训练得到的强分类器快速地对英文新闻打上相应的标签,在用户进行新闻阅读时,能够基于英文新闻的标签进行新闻推荐,从而向用户推荐符合用户兴趣的英文新闻;对中文用户的评论进行正负情 感分析时,将产品推向海外市场(B语言)时,会出现很多不是中文的评论,即英文评论,通过训练得到的强分类器快速地对英文评论打上相应的情感标签,在用户进行评论时,能够基于英文评论的标签对用户进行适当的感情疏导,避免用户不断产生负面情绪。
综上,本申请实施例文本分类模型的训练方法、文本分类方法通过机器翻译模型,获取采用不同于A语言的B语言的第二文本样本,并通过弱分类器对第二文本样本进行筛选,从而实现自动获取跨语言的文本样本,降低由于缺乏文本样本所带来的压力;并且,通过筛选得到的优质文本样本训练强分类器,使得强分类器能够进行准确的文本分类,提高文本分类的准确性。
至此已经结合本申请实施例提供的服务器的示例性应用和实施,说明本申请实施例提供的文本分类模型的训练方法。本申请实施例还提供文本分类模型的训练装置,实际应用中,文本分类模型的训练装置中的各功能模块可以由电子设备(如终端设备、服务器或服务器集群)的硬件资源,如处理器等计算资源、通信资源(如用于支持实现光缆、蜂窝等各种方式通信)、存储器协同实现。图2示出了存储在存储器550中的文本分类模型的训练装置555,其可以是程序和插件等形式的软件,例如,软件C/C++、Java等编程语言设计的软件模块、C/C++、Java等编程语言设计的应用软件或大型软件系统中的专用软件模块、应用程序接口、插件、云服务等实现方式,下面对不同的实现方式举例说明。
示例一、文本分类模型的训练装置是移动端应用程序及模块
本申请实施例中的文本分类模型的训练装置555可提供为使用软件C/C++、Java等编程语言设计的软件模块,嵌入到基于Android或iOS等系统的各种移动端应用中(以可执行指令存储在移动端的存储介质中,由移动端的处理器执行),从而直接使用移动端自身的计算资源完成相关的信息推荐任务,并且定期或不定期地通过各种网络通信方式将处理结果传送给远程的服务器,或者在移动端本地保存。
示例二、文本分类模型的训练装置是服务器应用程序及平台
本申请实施例中的文本分类模型的训练装置555可提供为使用C/C++、Java等编程语言设计的应用软件或大型软件系统中的专用软件模块,运行于服务器端(以可执行指令的方式在服务器端的存储介质中存储,并由服务器端的处理器运行),服务器使用自身的计算资源完成相关的信息推荐任务。
本申请实施例还可以提供为在多台服务器构成的分布式、并行计算平台上,搭载定制的、易于交互的网络(Web)界面或其他各用户界面(UI,User Interface),形成供个人、群体或单位使用的信息推荐平台(用于推荐列表)等。
示例三、文本分类模型的训练装置是服务器端应用程序接口(API,Application Program Interface)及插件
本申请实施例中的文本分类模型的训练装置555可提供为服务器端的API或插件,以供用户调用,以执行本申请实施例的文本分类模型的训练方法,并嵌入到各类应用程序中。
示例四、文本分类模型的训练装置是移动设备客户端API及插件
本申请实施例中的文本分类模型的训练装置555可提供为移动设备端的API或插件,以供用户调用,以执行本申请实施例的文本分类模型的训练方法。
示例五、文本分类模型的训练装置是云端开放服务
本申请实施例中的文本分类模型的训练装置555可提供为向用户开发的信息推荐云服务,供个人、群体或单位获取推荐列表。
其中,文本分类模型的训练装置555包括一系列的模块,包括翻译模块5551、第一训练模块5552、筛选模块5553、第二训练模块5554。下面继续说明本申请实施例提供 的文本分类模型的训练装置555中各个模块配合实现文本分类模型的训练方案。
翻译模块5551,配置为通过机器翻译模型对第一语言的多个第一文本样本进行机器翻译处理,得到与所述多个第一文本样本一一对应的多个第二文本样本;其中,所述多个第二文本样本采用不同于所述第一语言的第二语言;第一训练模块5552,配置为基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型;筛选模块5553,配置为通过训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理;第二训练模块5554,配置为基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型;其中,所述第二文本分类模型的网络深度大于所述第一文本分类模型的网络深度。
在一些实施例中,所述第一训练模块5552还配置为基于所述第二语言的多个第三文本样本以及分别对应的类别标签,对所述第一文本分类模型进行第t次训练;通过第t次训练的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的第t次筛选处理;基于前t次筛选结果、所述多个第三文本样本以及分别对应的类别标签,对所述第一文本分类模型进行第t+1次训练;将第T次训练的所述第一文本分类模型作为所述训练后的所述第一文本分类模型;其中,t为依次递增的正整数、且取值范围满足1≤t≤T-1,T为大于2的整数、且用于表示迭代训练的总次数。
在一些实施例中,所述第二训练模块5554还配置为确定所述筛选处理得到的第二文本样本在多个类别的分布;当所述筛选处理得到的第二文本样本在多个类别的分布满足分布均衡条件、且在每个类别的数量超出对应的类别数量阈值时,从所述筛选处理得到的第二文本样本中的每个类别的文本样本中,基于随机抽取对应所述类别数量阈值的文本样本构建训练集;基于所述训练集训练用于所述第二语言的第二文本分类模型。
在一些实施例中,所述第二训练模块5554还配置为当所述筛选处理得到的第二文本样本在多个类别的分布不满足分布均衡条件,针对分布少的类别的第二文本样本进行基于近义词的扩充处理;其中,所述扩充处理得到的第二文本样本在多个类别的分布满足所述分布均衡条件;基于所述扩充处理得到的第二文本样本构建训练集;基于所述训练集训练用于所述第二语言的第二文本分类模型。
在一些实施例中,所述第二训练模块5554还配置为基于所述多个第三文本样本以及所述筛选处理得到的第二文本样本构建训练集,基于所述训练集训练用于所述第二语言的第二文本分类模型。
在一些实施例中,所述第二训练模块5554还配置为遍历所述筛选处理得到的第二文本样本的每个类别,执行以下处理:当所述类别中的第二文本样本的数量低于所述类别的类别数量阈值时,将从所述多个第三文本样本中随机抽取所述类别的第三文本样本补充到所述类别的第二文本样本中,以更新所述筛选处理得到的第二文本样本;基于更新后的所述筛选处理得到的第二文本样本,构建训练集。
在一些实施例中,所述第二训练模块5554还配置为根据文本分类模型的算力与在单位时间内所能够运算的文本样本的数量的对应关系,确定与训练所述第二文本分类模型所能够使用的算力匹配的目标样本数量;从基于所述筛选处理得到的第二文本样本构建的训练集中,筛选出对应所述目标样本数量的文本样本,以作为训练用于所述第二语言的第二文本分类模型的样本。
在一些实施例中,所述第一训练模块5552还配置为通过所述第一文本分类模型对所述第二语言的多个第三文本样本进行预测处理,得到所述多个第三文本样本分别对应的预测类别的置信度;基于所述预测类别的置信度以及所述第三文本样本的类别标签,构建所述第一文本分类模型的损失函数;更新所述第一文本分类模型的参数直至所述损失函数收敛,将所述损失函数收敛时所述第一文本分类模型的更新的参数,作为所述训 练后的所述第一文本分类模型的参数。
在一些实施例中,所述第一训练模块5552还配置为针对所述多个第三文本样本中的任一第三文本样本执行以下处理:通过所述第一文本分类模型执行以下处理:对所述第三文本样本进行编码处理,得到所述第三文本样本的编码向量;对所述第三文本样本的编码向量进行融合处理,得到融合向量;对所述融合向量进行非线性映射处理,得到所述第三文本样本对应的预测类别的置信度。
在一些实施例中,所述第一文本分类模型包括多个级联的激活层;第一训练模块5552还配置为通过所述多个级联的激活层的第一个激活层,对所述融合向量进行所述第一个激活层的映射处理;将所述第一个激活层的映射结果输出到后续级联的激活层,通过所述后续级联的激活层继续进行映射处理和映射结果输出,直至输出到最后一个激活层;将所述最后一个激活层输出的激活结果作为所述第三文本样本对应的预测类别的置信度。
在一些实施例中,所述筛选模块5553还配置为针对所述多个第二文本样本中的任一第二文本样本执行以下处理:通过所述训练后的所述第一文本分类模型对所述第二文本样本进行预测处理,得到所述第二文本样本对应的多个预测类别的置信度;将所述第二文本样本对应的第一文本样本的类别标签确定为所述第二文本样本的类别标签;基于所述第二文本样本对应的多个预测类别的置信度以及所述第二文本样本的类别标签,将超出置信度阈值的第二文本样本作为所述筛选处理得到的第二文本样本。
在一些实施例中,所述第二训练模块5554还配置为通过所述第二文本分类模型对所述筛选处理得到的第二文本样本进行预测处理,得到所述筛选处理得到的第二文本样本对应的预测类别;基于所述筛选处理得到的第二文本样本对应的预测类别以及对应的类别标签,构建所述第二文本分类模型的损失函数;更新所述第二文本分类模型的参数直至所述损失函数收敛,将所述损失函数收敛时所述第二文本分类模型的更新的参数,作为训练后的所述第二文本分类模型的参数。
在一些实施例中,所述第二文本分类模型包括多个级联的编码器;第二训练模块5554还配置为针对所述筛选处理得到的第二文本样本中的任一文本样本执行以下处理:通过所述多个级联的编码器的第一个编码器,对所述文本样本进行所述第一个编码器的编码处理;将所述第一个编码器的编码结果输出到后续级联的编码器,通过所述后续级联的编码器继续进行编码处理和编码结果输出,直至输出到最后一个编码器;将所述最后一个编码器输出的编码结果作为对应所述文本样本的编码向量;对所述文本样本的编码向量进行非线性映射,得到所述文本样本对应的预测类别。
在一些实施例中,所述第二训练模块5554还配置为通过所述多个级联的编码器的第y个编码器执行以下处理:对第y-1个编码器的编码结果进行自注意力处理,得到第y个自注意力向量;对所述第y个自注意力向量以及所述第y-1个编码器的编码结果进行残差连接处理,得到第y个残差向量;对所述第y个残差向量进行非线性映射处理,得到第y个映射向量;对所述第y个映射向量以及所述第y个残差向量进行残差连接处理,将残差连接的结果作为所述第y个编码器的编码结果,并将所述第y个编码器的编码结果输出到第y+1个编码器;其中,y为依次递增的正整数、且取值范围满足2≤y≤H-1,H为大于2的整数、且用于表示所述多个级联的编码器的数量。
其中,本申请实施例还提供一种文本分类装置,文本分类装置包括一系列的模块,包括获取模块以及处理模块。其中,获取模块,配置为获取待分类文本;其中,所述待分类文本采用不同于第一语言的第二语言;处理模块,配置为通过网络深度大于第一文本分类模型的第二文本分类模型对所述待分类文本进行编码处理,得到所述待分类文本的编码向量;对所述待分类文本的编码向量进行非线性映射,得到所述待分类文本对应 的类别;其中,所述第二文本分类模型是通过所述第一文本分类模型筛选得到的第二语言的文本样本训练得到的,所述第二语言的文本样本是通过对所述第一语言的文本样本进行机器翻译得到的。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该电子设备执行本申请实施例上述的文本分类模型的训练方法,或者文本分类方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的基于人工智能的信息推荐方法,或者文本分类方法,例如,如图3-5示出的文本分类模型的训练方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (20)

  1. 一种文本分类模型的训练方法,包括:
    对第一语言的多个第一文本样本进行机器翻译处理,得到与所述多个第一文本样本一一对应的多个第二文本样本;
    其中,所述多个第二文本样本采用不同于所述第一语言的第二语言;
    基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型;
    通过训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理;
    基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型;
    其中,所述第二文本分类模型的网络深度大于所述第一文本分类模型的网络深度。
  2. 根据权利要求1所述的方法,其中,所述基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型,包括:
    基于所述第二语言的多个第三文本样本以及分别对应的类别标签,对所述第一文本分类模型进行第t次训练;
    通过第t次训练的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的第t次筛选处理;
    基于前t次筛选结果、所述多个第三文本样本以及分别对应的类别标签,对所述第一文本分类模型进行第t+1次训练;
    将第T次训练的所述第一文本分类模型作为所述训练后的所述第一文本分类模型;
    其中,t为依次递增的正整数、且取值范围满足1≤t≤T-1,T为大于2的整数、且用于表示迭代训练的总次数。
  3. 根据权利要求1所述的方法,其中,所述基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型,包括:
    确定所述筛选处理得到的第二文本样本在多个类别的分布;
    当所述筛选处理得到的第二文本样本在多个类别的分布满足分布均衡条件、且在每个类别的数量超出对应的类别数量阈值时,从所述筛选处理得到的第二文本样本中的每个类别的文本样本中,基于随机抽取对应所述类别数量阈值的文本样本构建训练集;
    基于所述训练集训练用于所述第二语言的第二文本分类模型。
  4. 根据权利要求1所述的方法,其中,所述基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型,包括:
    当所述筛选处理得到的第二文本样本在多个类别的分布不满足分布均衡条件,针对分布少的类别的第二文本样本进行基于近义词的扩充处理;
    其中,所述扩充处理得到的第二文本样本在多个类别的分布满足所述分布均衡条件;
    基于所述扩充处理得到的第二文本样本构建训练集;
    基于所述训练集训练用于所述第二语言的第二文本分类模型。
  5. 根据权利要求1所述的方法,其中,所述基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型,包括:
    基于所述多个第三文本样本以及所述筛选处理得到的第二文本样本构建训练集,基于所述训练集训练用于所述第二语言的第二文本分类模型。
  6. 根据权利要求5所述的方法,其中,所述基于所述多个第三文本样本以及所述筛选处理得到的第二文本样本构建训练集,包括:
    遍历所述筛选处理得到的第二文本样本的每个类别,执行以下处理:
    当所述类别中的第二文本样本的数量低于所述类别的类别数量阈值时,将从所述多个第三文本样本中随机抽取所述类别的第三文本样本补充到所述类别的第二文本样本中,以更新所述筛选处理得到的第二文本样本;
    基于更新后的所述筛选处理得到的第二文本样本,构建训练集。
  7. 根据权利要求1所述的方法,其中,所述基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型之前,所述方法还包括:
    根据文本分类模型的算力与在单位时间内所能够运算的文本样本的数量的对应关系,确定与训练所述第二文本分类模型所能够使用的算力匹配的目标样本数量;
    从基于所述筛选处理得到的第二文本样本构建的训练集中,筛选出对应所述目标样本数量的文本样本,以作为训练用于所述第二语言的第二文本分类模型的样本。
  8. 根据权利要求1所述的方法,其中,所述基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型,包括:
    通过所述第一文本分类模型对所述第二语言的多个第三文本样本进行预测处理,得到所述多个第三文本样本分别对应的预测类别的置信度;
    基于所述预测类别的置信度以及所述第三文本样本的类别标签,构建所述第一文本分类模型的损失函数;
    更新所述第一文本分类模型的参数直至所述损失函数收敛,将所述损失函数收敛时所述第一文本分类模型的更新的参数,作为所述训练后的所述第一文本分类模型的参数。
  9. 根据权利要求8所述的方法,其中,所述通过所述第一文本分类模型对所述第二语言的多个第三文本样本进行预测处理,得到所述多个第三文本样本分别对应的预测类别的置信度,包括:
    针对所述多个第三文本样本中的任一第三文本样本执行以下处理:
    通过所述第一文本分类模型执行以下处理:
    对所述第三文本样本进行编码处理,得到所述第三文本样本的编码向量;
    对所述第三文本样本的编码向量进行融合处理,得到融合向量;
    对所述融合向量进行非线性映射处理,得到所述第三文本样本对应的预测类别的置信度。
  10. 根据权利要求9所述的方法,其中,
    所述第一文本分类模型包括多个级联的激活层;
    所述对所述融合向量进行非线性映射处理,得到所述第三文本样本对应的预测类别的置信度,包括:
    通过所述多个级联的激活层的第一个激活层,对所述融合向量进行所述第一个激活层的映射处理;
    将所述第一个激活层的映射结果输出到后续级联的激活层,通过所述后续级联的激活层继续进行映射处理和映射结果输出,直至输出到最后一个激活层;
    将所述最后一个激活层输出的激活结果作为所述第三文本样本对应的预测类别的置信度。
  11. 根据权利要求1所述的方法,其中,所述通过所述训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理,包括:
    针对所述多个第二文本样本中的任一第二文本样本执行以下处理:
    通过所述训练后的所述第一文本分类模型对所述第二文本样本进行预测处理,得到所述第二文本样本对应的多个预测类别的置信度;
    将所述第二文本样本对应的第一文本样本的类别标签确定为所述第二文本样本的类别标签;
    基于所述第二文本样本对应的多个预测类别的置信度以及所述第二文本样本的类别标签,将超出置信度阈值的第二文本样本作为所述筛选处理得到的第二文本样本。
  12. 根据权利要求1所述的方法,其中,所述基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型,包括:
    通过所述第二文本分类模型对所述筛选处理得到的第二文本样本进行预测处理,得到所述筛选处理得到的第二文本样本对应的预测类别;
    基于所述筛选处理得到的第二文本样本对应的预测类别以及对应的类别标签,构建所述第二文本分类模型的损失函数;
    更新所述第二文本分类模型的参数直至所述损失函数收敛,将所述损失函数收敛时所述第二文本分类模型的更新的参数,作为训练后的所述第二文本分类模型的参数。
  13. 根据权利要求12所述的方法,其中,
    所述第二文本分类模型包括多个级联的编码器;
    所述通过所述第二文本分类模型对所述筛选处理得到的第二文本样本进行预测处理,得到所述筛选处理得到的第二文本样本对应的预测类别,包括:
    针对所述筛选处理得到的第二文本样本中的任一文本样本执行以下处理:
    通过所述多个级联的编码器的第一个编码器,对所述文本样本进行所述第一个编码器的编码处理;
    将所述第一个编码器的编码结果输出到后续级联的编码器,通过所述后续级联的编码器继续进行编码处理和编码结果输出,直至输出到最后一个编码器;
    将所述最后一个编码器输出的编码结果作为对应所述文本样本的编码向量;
    对所述文本样本的编码向量进行非线性映射,得到所述文本样本对应的预测类别。
  14. 根据权利要求13所述的方法,其中,所述通过所述后续级联的编码器继续进行编码处理和编码结果输出,包括:
    通过所述多个级联的编码器的第y个编码器执行以下处理:
    对第y-1个编码器的编码结果进行自注意力处理,得到第y个自注意力向量;
    对所述第y个自注意力向量以及所述第y-1个编码器的编码结果进行残差连接处理,得到第y个残差向量;
    对所述第y个残差向量进行非线性映射处理,得到第y个映射向量;
    对所述第y个映射向量以及所述第y个残差向量进行残差连接处理,将残差连接的结果作为所述第y个编码器的编码结果,并将所述第y个编码器的编码结果输出到第y+1个编码器;
    其中,y为依次递增的正整数、且取值范围满足2≤y≤H-1,H为大于2的整数、且用于表示所述多个级联的编码器的数量。
  15. 一种文本分类方法,所述方法包括:
    获取待分类文本;
    其中,所述待分类文本采用不同于第一语言的第二语言;
    通过网络深度大于第一文本分类模型的第二文本分类模型对所述待分类文本进行编码处理,得到所述待分类文本的编码向量;
    对所述待分类文本的编码向量进行非线性映射,得到所述待分类文本对应的类别;
    其中,所述第二文本分类模型是通过所述第一文本分类模型筛选得到的第二语言的文本样本训练得到的,所述第二语言的文本样本是通过对所述第一语言的文本样本进行机器翻译得到的。
  16. 一种文本分类模型的训练装置,所述装置包括:
    翻译模块,配置为对第一语言的多个第一文本样本进行机器翻译处理,得到与所述 多个第一文本样本一一对应的多个第二文本样本;其中,所述多个第二文本样本采用不同于所述第一语言的第二语言;
    第一训练模块,配置为基于所述第二语言的多个第三文本样本以及分别对应的类别标签,训练用于所述第二语言的第一文本分类模型;
    筛选模块,用于通过训练后的所述第一文本分类模型对所述多个第二文本样本进行基于置信度的筛选处理;
    第二训练模块,配置为基于所述筛选处理得到的第二文本样本,训练用于所述第二语言的第二文本分类模型;其中,所述第二文本分类模型的网络深度大于所述第一文本分类模型的网络深度。
  17. 一种文本分类装置,所述装置包括:
    获取模块,配置为获取待分类文本;其中,所述待分类文本采用不同于第一语言的第二语言;
    处理模块,配置为通过网络深度大于第一文本分类模型的第二文本分类模型对所述待分类文本进行编码处理,得到所述待分类文本的编码向量;对所述待分类文本的编码向量进行非线性映射,得到所述待分类文本对应的类别;其中,所述第二文本分类模型是通过所述第一文本分类模型筛选得到的第二语言的文本样本训练得到的,所述第二语言的文本样本是通过对所述第一语言的文本样本进行机器翻译得到的。
  18. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至14任一项所述的文本分类模型的训练方法,或权利要求15所述的文本分类方法。
  19. 一种计算机可读存储介质,存储有可执行指令,用于被处理器执行时,实现权利要求1至14任一项所述的文本分类模型的训练方法,或权利要求15所述的文本分类方法。
  20. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现权利要求1至14任一项所述的文本分类模型的训练方法,或权利要求15所述的文本分类方法。
PCT/CN2021/124335 2020-11-04 2021-10-18 文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品 WO2022095682A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023514478A JP2023539532A (ja) 2020-11-04 2021-10-18 テキスト分類モデルのトレーニング方法、テキスト分類方法、装置、機器、記憶媒体及びコンピュータプログラム
US17/959,402 US20230025317A1 (en) 2020-11-04 2022-10-04 Text classification model training method, text classification method, apparatus, device, storage medium and computer program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011217057.9A CN112214604A (zh) 2020-11-04 2020-11-04 文本分类模型的训练方法、文本分类方法、装置及设备
CN202011217057.9 2020-11-04

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/959,402 Continuation US20230025317A1 (en) 2020-11-04 2022-10-04 Text classification model training method, text classification method, apparatus, device, storage medium and computer program product

Publications (1)

Publication Number Publication Date
WO2022095682A1 true WO2022095682A1 (zh) 2022-05-12

Family

ID=74058181

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124335 WO2022095682A1 (zh) 2020-11-04 2021-10-18 文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品

Country Status (4)

Country Link
US (1) US20230025317A1 (zh)
JP (1) JP2023539532A (zh)
CN (1) CN112214604A (zh)
WO (1) WO2022095682A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033701A (zh) * 2022-08-12 2022-09-09 北京百度网讯科技有限公司 文本向量生成模型训练方法、文本分类方法及相关装置
CN115186670A (zh) * 2022-09-08 2022-10-14 北京航空航天大学 一种基于主动学习的领域命名实体识别方法及系统
CN115329723A (zh) * 2022-10-17 2022-11-11 广州数说故事信息科技有限公司 基于小样本学习的用户圈层挖掘方法、装置、介质及设备
CN115346084A (zh) * 2022-08-15 2022-11-15 腾讯科技(深圳)有限公司 样本处理方法、装置、电子设备、存储介质及程序产品
CN117455421A (zh) * 2023-12-25 2024-01-26 杭州青塔科技有限公司 科研项目的学科分类方法、装置、计算机设备及存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214604A (zh) * 2020-11-04 2021-01-12 腾讯科技(深圳)有限公司 文本分类模型的训练方法、文本分类方法、装置及设备
US11934795B2 (en) * 2021-01-29 2024-03-19 Oracle International Corporation Augmented training set or test set for improved classification model robustness
CN113010674B (zh) * 2021-03-11 2023-12-22 平安创科科技(北京)有限公司 文本分类模型封装方法、文本分类方法及相关设备
CN112765359B (zh) * 2021-04-07 2021-06-18 成都数联铭品科技有限公司 一种基于少样本的文本分类方法
CN114462387B (zh) * 2022-02-10 2022-09-02 北京易聊科技有限公司 无标注语料下的句型自动判别方法
CN114328936B (zh) * 2022-03-01 2022-08-30 支付宝(杭州)信息技术有限公司 建立分类模型的方法和装置
CN114911821B (zh) * 2022-04-20 2024-05-24 平安国际智慧城市科技股份有限公司 一种结构化查询语句的生成方法、装置、设备及存储介质
CN116737935B (zh) * 2023-06-20 2024-05-03 青海师范大学 基于提示学习的藏文文本分类方法、装置及存储介质
CN116720005B (zh) * 2023-08-10 2023-10-20 四川大学 一种基于自适应噪声的数据协同对比推荐模型的系统
CN117851601A (zh) * 2024-02-26 2024-04-09 海纳云物联科技有限公司 事件分类模型的训练方法、使用方法、装置及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488623A (zh) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 多种语言文本数据分类处理方法
US20190026356A1 (en) * 2015-09-22 2019-01-24 Ebay Inc. Miscategorized outlier detection using unsupervised slm-gbm approach and structured data
CN111813942A (zh) * 2020-07-23 2020-10-23 苏州思必驰信息科技有限公司 实体分类方法和装置
CN111831821A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 文本分类模型的训练样本生成方法、装置和电子设备
CN112214604A (zh) * 2020-11-04 2021-01-12 腾讯科技(深圳)有限公司 文本分类模型的训练方法、文本分类方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488623A (zh) * 2013-09-04 2014-01-01 中国科学院计算技术研究所 多种语言文本数据分类处理方法
US20190026356A1 (en) * 2015-09-22 2019-01-24 Ebay Inc. Miscategorized outlier detection using unsupervised slm-gbm approach and structured data
CN111831821A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 文本分类模型的训练样本生成方法、装置和电子设备
CN111813942A (zh) * 2020-07-23 2020-10-23 苏州思必驰信息科技有限公司 实体分类方法和装置
CN112214604A (zh) * 2020-11-04 2021-01-12 腾讯科技(深圳)有限公司 文本分类模型的训练方法、文本分类方法、装置及设备

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033701A (zh) * 2022-08-12 2022-09-09 北京百度网讯科技有限公司 文本向量生成模型训练方法、文本分类方法及相关装置
CN115346084A (zh) * 2022-08-15 2022-11-15 腾讯科技(深圳)有限公司 样本处理方法、装置、电子设备、存储介质及程序产品
CN115186670A (zh) * 2022-09-08 2022-10-14 北京航空航天大学 一种基于主动学习的领域命名实体识别方法及系统
CN115329723A (zh) * 2022-10-17 2022-11-11 广州数说故事信息科技有限公司 基于小样本学习的用户圈层挖掘方法、装置、介质及设备
CN117455421A (zh) * 2023-12-25 2024-01-26 杭州青塔科技有限公司 科研项目的学科分类方法、装置、计算机设备及存储介质
CN117455421B (zh) * 2023-12-25 2024-04-16 杭州青塔科技有限公司 科研项目的学科分类方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
JP2023539532A (ja) 2023-09-14
CN112214604A (zh) 2021-01-12
US20230025317A1 (en) 2023-01-26

Similar Documents

Publication Publication Date Title
WO2022095682A1 (zh) 文本分类模型的训练方法、文本分类方法、装置、设备、存储介质及计算机程序产品
CN106997370B (zh) 基于作者的文本分类和转换
CN108984683B (zh) 结构化数据的提取方法、系统、设备及存储介质
US11775761B2 (en) Method and apparatus for mining entity focus in text
US11886480B2 (en) Detecting affective characteristics of text with gated convolutional encoder-decoder framework
WO2020238783A1 (zh) 一种信息处理方法、装置及存储介质
CN110140133A (zh) 机器学习任务的隐式桥接
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN110083702B (zh) 一种基于多任务学习的方面级别文本情感转换方法
CN111860653A (zh) 一种视觉问答方法、装置及电子设备和存储介质
CN111930915B (zh) 会话信息处理方法、装置、计算机可读存储介质及设备
CN115269786B (zh) 可解释的虚假文本检测方法、装置、存储介质以及终端
CN112668347B (zh) 文本翻译方法、装置、设备及计算机可读存储介质
Sonawane et al. ChatBot for college website
CN113420869B (zh) 基于全方向注意力的翻译方法及其相关设备
CN116977885A (zh) 视频文本任务处理方法、装置、电子设备及可读存储介质
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN113919338B (zh) 处理文本数据的方法及设备
CN114330285A (zh) 语料处理方法、装置、电子设备及计算机可读存储介质
CN117521674B (zh) 对抗信息的生成方法、装置、计算机设备和存储介质
CN115204118B (zh) 文章生成方法、装置、计算机设备及存储介质
CN113591493B (zh) 翻译模型的训练方法及翻译模型的装置
Kang et al. Hierarchical attention networks for user profile inference in social media systems
CN116913278B (zh) 语音处理方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888379

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023514478

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 260923)

122 Ep: pct application non-entry in european phase

Ref document number: 21888379

Country of ref document: EP

Kind code of ref document: A1