CN117633220A

CN117633220A - Language model training method and device, electronic equipment and readable medium

Info

Publication number: CN117633220A
Application number: CN202311508682.2A
Authority: CN
Inventors: 杨志欣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-03-01

Abstract

The application provides a training method and device of a language model, electronic equipment and a readable medium. The method comprises the following steps: obtaining M candidate samples; for each candidate sample, acquiring a text identifier of the current candidate sample as a target identifier and acquiring text identifiers in N other candidate samples from M-1 candidate samples as interference identifiers, wherein N is greater than 0 and less than the number of preset tags, and the number of the preset tags is the maximum number of category tags in a category identification result; performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is generated according to one candidate sample; training the language model to be trained according to the training text set to obtain a category recognition model. The method can realize the text classification task without the labeling sample and reduce the labor cost.

Description

Language model training method and device, electronic equipment and readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method and apparatus for a language model, an electronic device, and a readable medium.

Background

Text category recognition is an important task in natural language processing. In order to improve the accuracy of model identification, the model often needs to be trained by suitable samples, and the samples usually need to be marked first to be used.

In the related art, data generated in an actual business is often used as training data, and these data are manually labeled with categories for training a model.

However, the sample data volume required by model training is usually large, so that the labor cost required to be input is high, the uniformity of labeling labels is required to be ensured in the labeling process, the personnel participating in labeling also need uniform training, and the cost of sample labeling is further increased.

Disclosure of Invention

Based on the technical problems, the application provides a training method and device for a language model, electronic equipment and a readable medium, so that a text classification task without a labeling sample is realized, and the labor cost of the training process is reduced.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a training method of a language model, including:

obtaining M candidate samples, wherein M is an integer greater than 0;

for each candidate sample, acquiring a text identifier of a current candidate sample as a target identifier and acquiring text identifiers in N other candidate samples from M-1 candidate samples as interference identifiers, wherein N is an integer greater than 0 and less than the number of preset labels, and the number of the preset labels is the maximum number of class labels in a class identification result of text class identification;

performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample;

training the language model to be trained according to the training text set to obtain a category identification model.

According to an aspect of an embodiment of the present application, there is provided a training apparatus for a language model, including:

a sample acquisition module configured to acquire M candidate samples, the M being an integer greater than 0;

The identification acquisition module is configured to acquire a text identification of a current candidate sample as a target identification and acquire text identifications in N other candidate samples from M-1 candidate samples as interference identifications, wherein N is an integer greater than 0 and less than the number of preset labels, and the number of the preset labels is the maximum number of class labels in a class identification result of text class identification;

the text generation module is configured to perform text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, so as to obtain a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample;

and the model training module is configured to train the language model to be trained according to the training text set to obtain a category identification model.

In some embodiments of the present application, based on the above technical solutions, the identifier obtaining module is specifically configured to include: any one candidate sample in the M candidate samples is obtained as a designated candidate sample; determining a random number as the number N of interference items in a value range with the number of labels output by the language model to be trained as a maximum value; and acquiring N candidate samples from the M-1 candidate samples as other candidate samples according to the number N of the interference items.

In some embodiments of the present application, based on the above technical solutions, the identifier obtaining module is specifically configured to include: performing out-of-order splicing on the target identifier and the interference identifier to obtain an identifier splicing result; if the number of samples of the target mark and the interference mark is smaller than the total category number of the category identification model, adding a preset mark to the mark splicing result according to the difference between the total category number and the sample number; and splicing the identification splicing result with the text description information of the appointed candidate sample to obtain a training text.

In some embodiments of the present application, based on the above technical solutions, the model training module is further configured to: performing text splicing according to the target identification, the interference identification and text description information of the other candidate samples, generating a noise text and adding the noise text into the training text set, wherein the interference identification is a target result corresponding to the noise text; training the language model to be trained through the training text and the noise text in the training text set to obtain the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module is specifically configured to: acquiring two texts and corresponding target results from the training text set; according to preset weighting parameters, carrying out weighted summation on vectors corresponding to the two obtained texts and carrying out weighted summation on vectors corresponding to the two corresponding target results, wherein the obtained results are used as training input samples; training the language model to be trained according to the obtained training input sample to obtain the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module is specifically configured to: dividing the text in the training text set into a plurality of text batches; for each text batch, respectively inputting the text batch into a first model to be trained and a second model to be trained to obtain a first training result and a second training result, wherein the first model to be trained and the second model to be trained are obtained through initializing a language model to be trained; acquiring texts with different recognition results in the first training result and the second training result as texts to be updated; selecting texts with preset proportions from the texts to be updated as a first group according to the loss values of the texts to be updated in the first training result, and selecting the texts with the preset proportions from the texts to be updated as a second group according to the loss values of the texts to be updated in the second training result; performing parameter updating on the second model to be trained according to the first group, and performing parameter updating on the first model to be trained according to the second group; and determining the first model to be trained and the second model to be trained as the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module is specifically configured to: reconstructing training texts and noise texts in the training text set according to a plurality of preset model target tasks to obtain a plurality of task samples corresponding to each model target task; for each model target task, selecting a sample from the corresponding task samples as a training sample batch; according to the training sample batch, executing prediction of a corresponding model target task in a language model to be trained to obtain a training result of each model target task; and updating model parameters of the language model to be trained according to the training result of each model target task to obtain the category identification model.

According to an aspect of the embodiments of the present application, there is provided a text category recognition method, including:

acquiring a text to be identified and a plurality of text categories, wherein the text to be identified comprises a text identifier and text description information;

splicing according to the text categories and the text description information of the text to be identified to generate an input text;

performing text category recognition on the input text through the category recognition model, and determining a category recognition result of the text to be recognized, wherein the category recognition result is one of a plurality of text categories;

The training process of the category identification model is as follows:

obtaining M candidate samples, wherein M is an integer greater than 0;

According to an aspect of an embodiment of the present application, there is provided a text category recognition apparatus including:

the category acquisition module is configured to acquire a text to be identified and a plurality of text categories, wherein the text to be identified comprises a text identifier and text description information;

The text splicing module is configured to splice according to the text categories and the text description information of the text to be identified to generate an input text;

the category identification module is configured to identify the text category of the input text through the category identification model, and determine a category identification result of the text to be identified, wherein the category identification result is one of the text categories;

the training process of the category identification model is as follows:

obtaining M candidate samples, wherein M is an integer greater than 0;

In some embodiments of the present application, based on the above technical solutions, the text splicing module is specifically configured to: performing text splicing on the text categories and the text description information of the text to be identified to obtain category splicing results; if the number of the text categories is smaller than the total category number of the category identification model, adding a preset mark to the identification splicing result according to the difference between the total category number and the text category number to obtain an input text.

In some embodiments of the present application, based on the above technical solutions, the text splicing module is specifically configured to: according to the text templates corresponding to each text category, performing text conversion on the plurality of text categories to obtain a plurality of category input texts; and performing text splicing on the plurality of category input texts and the text description information of the text to be identified to obtain a category splicing result.

In some embodiments of the present application, based on the above technical solutions, the category identification module is specifically configured to: carrying out text category recognition on the input text through a category recognition model to obtain an output result; if the data dimension of the output result is larger than the category number of the text categories, cutting the output result according to the category number to obtain a category index; and according to the ordering of the text categories in the input text, acquiring the text category corresponding to the category index as a category recognition result of the text to be recognized.

According to an aspect of the embodiments of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the training method of the language model as in the above technical solution via execution of the executable instructions.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of a language model as in the above technical solution.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the training method of the language model provided in the various alternative implementations described above.

In the embodiment of the application, firstly, M candidate samples are obtained, then, for each candidate sample, text identifiers of the current candidate sample are obtained as target identifiers, text identifiers in other candidate samples are obtained from M-1 candidate samples as interference identifiers, N is an integer which is more than 0 and less than the number of preset labels, and the number of the preset labels is the maximum number of class labels in a class identification result of text class identification; performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample; and finally training the language model to be trained according to the training text set to obtain a category identification model. The text identification and the text description information are spliced to form training data in the training process to train the category identification model, so that the category identification model can learn the mapping relation between the text identification and the text description information, the text identification replaces a manually marked tag, the language model can be optimized through unmarked data without the manual marking and training process, the text classification task without marked samples is realized, and the labor cost of the training process is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a system architecture of a training scheme applied to a language model according to an embodiment of the present application.

FIG. 2 illustrates a flow chart of a method of training a language model according to one embodiment of the present application.

FIG. 3 illustrates a flow chart of a method of training a language model according to one embodiment of the present application

FIG. 4 illustrates a flow chart of a method of training a language model according to one embodiment of the present application.

Fig. 5 shows a flow chart of a text category recognition method according to one embodiment of the present application.

Fig. 6 is a schematic flow chart of an overall text category recognition flow in an embodiment of the present application.

Fig. 7 schematically shows a block diagram of the training apparatus of the language model in the embodiment of the present application.

Fig. 8 schematically shows a block diagram of the text category recognition device in the embodiment of the present application.

Fig. 9 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

It should be appreciated that the aspects of the present application may be applied in the context of text category recognition, and in particular in the context of identifying the actual business category of a platform account. In particular, in various public information or transaction platforms, each account, especially an account of a public number class, usually has a specific operation range, for example, a catering class, an information service class, a value added service class, and the like. However, an account typically does not directly determine the extent to which information is provided in its account at registration, but rather gradually determines the extent to which its account issues or services are provided during subsequent account usage. The platform typically needs to classify accounts according to the content they are managing and counting, which requires identifying the specific business category of the account according to the information disclosed in the account. These business categories are typically predefined by the platform side. The method and the device can be applied to such scenes, text category recognition is carried out through the related texts of the account, so that the problem of account management category recognition is converted into the problem of text category recognition, and a language model capable of recognizing category recognition according to the text information of the account is trained under the condition that data do not need to be additionally marked.

Text category recognition is an important task in natural language processing. In order to improve the accuracy of model identification, the model often needs to be trained by suitable samples, and the samples usually need to be marked first to be used. In the related art, data generated in an actual business is often used as training data, and these data are manually labeled with categories for training a model. However, the sample data volume required by model training is usually large, so that the labor cost required to be input is high, the uniformity of labeling labels is required to be ensured in the labeling process, the personnel participating in labeling also need uniform training, and the cost of sample labeling is further increased.

Based on the above, the technical solution of the embodiment of the present application proposes a training solution for a language model. Specifically, as shown in fig. 1, a system architecture 100 of a training scheme applied to a language model according to an embodiment of the present application may include a terminal device 110, a network 120, a server 130, and a server 140. Terminal device 110 may include a smart phone, tablet, notebook, smart voice interaction device, smart home appliance, vehicle terminal, aircraft, and the like. The server 130 and the server 140 may be servers providing various services, which may be independent physical servers, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Network 120 may be a variety of connection types of communication media capable of providing communication links between terminal device 110, server 130, and server 140, and may be, for example, a wired communication link or a wireless communication link.

The system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 and the server 140 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular in this application. In the embodiment of the present application, the server 130 may be a server of an information distribution platform or a transaction platform, the server 140 is a server of account management, and the terminal device 110 may be a user terminal, and the terminal device 110 runs client software of the information distribution platform, so as to communicate with the server 130. The user registers a platform account with the server 130 of the information distribution platform through the terminal device 110, provides account information, distributes information through the account, or provides services, and the server 130 provides platform-related services such as information distribution to the terminal device 110. The management server 140 may obtain account related information from the platform server 130 when needed and train a language identification model in the management server 140. The trained models may be deployed in server 130 or server 140 for account management class identification for existing and subsequently newly registered accounts.

Text category recognition models perform typically text classification tasks, a specific implementation of natural language processing. Natural language processing belongs to the field of artificial intelligence. Artificial intelligence (Artificia lIntelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. Natural language processing is an important direction in the field of artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

According to the training method of the language model, the marked samples are not needed in the training process, and the original data can be directly used for training. The implementation details of the technical solutions of the embodiments of the present application are described in detail below: fig. 2 shows a flowchart of a training method of a language model according to an embodiment of the present application, which is applied to an information distribution platform and may be specifically executed by a device having a computing processing function, such as a server or a terminal device where the information distribution platform is located. In the following embodiments, the solution of the present application will be described by taking the example of identifying the operation range of the public number according to the account information of the public number in the public number platform. Referring to fig. 2, the training method of the language model at least includes steps S210 to S240, and is described in detail as follows:

in step S210, M candidate samples are obtained, where M is an integer greater than 0.

The candidate sample may be public number disclosure information obtained from the history data, or may be public number disclosure information obtained by randomly selecting a part of public numbers from the server as the candidate sample. The candidate sample is typically text content that includes text identification and text description information. The text identifier is information representing the candidate sample, and the text identifier is information capable of reflecting the public number business category in the candidate sample, and is usually an account name of the public number. In some embodiments, the text identifier may also be information such as a core word or a high frequency word extracted from information published by the public number. The text description information is typically supplemental information provided by a public number. Such as public number introductory information and public number function fields. The function field is an entry where a public number can provide a service, for example, a service for making an album is provided in the function field, and the service entry usually has a function entry related to music, pictures and clips.

Step S220, for each candidate sample, obtaining a text identifier of the current candidate sample as a target identifier and obtaining text identifiers of N other candidate samples from M-1 candidate samples as interference identifiers, wherein N is an integer greater than 0 and less than a preset number of labels, and the preset number of labels is the maximum number of class labels in a class identification result of text class identification.

The training device may obtain, for each candidate sample, a text identifier of the candidate sample as a target identifier, and obtain, from the remaining M-1 candidate samples, text identifiers in N other candidate samples as interference identifiers. Specifically, the number of interference identifiers may be smaller than the number of preset tags, which is the maximum number of category tags in the category recognition result of text category recognition. This number will typically be less than the total number of categories that can be identified by the category identification model. The total number of categories is the maximum number of categories in all training data sets. That is, the maximum number of categories in the training data set may be 100, which means that all training data is divided into 100 categories, but the number of labels included in the output result of the category identification model is less than 100, and a number is randomly selected between 2 and 100 to represent the number of interference items. In some embodiments, the training device obtains multiple sets of interference identities for each candidate sample, and randomly selects the number of interference representations when each set of interference representations is obtained.

And step S230, performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample.

The training device performs text stitching according to the target identification of each candidate sample, the corresponding acquired interference identification and the text description information of the candidate sample to generate training texts, and all the generated training texts form a training text set. Each training text is generated based on a candidate sample, and the target identification of the candidate sample is the target result corresponding to the training text. It will be appreciated that each training sample typically has only one target result, i.e., is generated based on one candidate sample, but that each candidate sample may be identified with different interferences to generate multiple training texts. The target mark and the interference representation exist as options in the training text, and the language model to be trained can select corresponding options from the options according to the text description information, so that the mapping relation between the text description information and the text mark is learned.

And step S240, training the language model to be trained according to the training text set to obtain a category recognition model.

The training device trains the language model to be trained by using the generated training text set to obtain a category identification model. In the training process, the training text set is divided into a training set and a verification set, model parameters of the language model to be trained are adjusted according to a loss function of the language model to be trained through the training set, whether the language model to be trained after parameter adjustment reaches a convergence condition is verified through the verification set, and a category recognition model is obtained after the model is converged. The pre-training masking language model can be used as a main model for the pre-training masking language model to be used as a pre-training masking language model, and an output layer for classification is additionally added to the main model, so that the classification and identification process of the obtained public number text is completed. In the training process, the language model to be trained outputs a result of which the index position of the target mark in the training text is not positioned according to the input training text. The output may be a single thermal vector. In the output result, a position corresponding to each mark in the training text exists, the data of the position corresponding to the recognition result is marked as 1, and the other marks are zero. For example, if the pattern output contains 5 tags, the output result is a 1-dimensional vector containing 5 data, and if the third tag is considered to be an identification result, the output result would be [0, 1, 0]. In the training process, each label in the recognition result actually corresponds to a text identifier, but in the actual recognition process, the data input into the category recognition model is not composed of the text identifier but is composed of the text category. Although the category recognition model learns the mapping relation between the text description information and the text identifier in the training process, the text identifier is used as the information which can best embody the text output category, and the corresponding association relation exists between the text identifier and the text category naturally. In the actual application process of the category identification model, the text category is used as an option to be input into the category identification model together with the text description information, and the category identification model can identify the corresponding text category according to the mapping relation between the text description information and the text identifier, so that the identification of the text category is completed. In the recognition scene of the management category of the public number, the public number title is used as the text mark of the text, so that the management category can be reflected, the data output to the category recognition model can contain all categories possibly belonging to the public number, and the category recognition model recognizes the management category of the public number.

In the embodiment of the application, firstly, M candidate samples are obtained, then, for each candidate sample, a text identifier of a current candidate sample is obtained as a target identifier, and text identifiers in N other candidate samples are obtained from the M-1 candidate samples as interference identifiers, wherein N is an integer greater than 0 and less than the number of preset labels, and the number of the preset labels is the maximum number of class labels in a class identification result of text class identification; performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample; and finally training the language model to be trained according to the training text set to obtain a category identification model. The text identification and the text description information are spliced to form training data in the training process to train the category identification model, so that the category identification model can learn the mapping relation between the text identification and the text description information, the text identification replaces a manually marked tag, the language model can be optimized through unmarked data without the manual marking and training process, the text classification task without marked samples is realized, and the labor cost of the training process is reduced.

In the embodiment of the present application, other embodiments for refining the technical solution of the embodiment shown in fig. 2 are also provided, and specifically as shown in fig. 3, in the training method of the language model of one embodiment of the present application, the method may include the following steps:

in step S310, M candidate samples are obtained, where M is an integer greater than 0.

Optionally, the implementation details of step S310 are identical to those of step S210 shown in fig. 2, and will not be described again.

Step S320, obtaining any one candidate sample of the M candidate samples as a specified candidate sample;

step S330, determining a random number as the number N of interference items in a value range with the number of labels output by the language model to be trained as a maximum value;

and step S340, according to the number N of the interference items, N candidate samples are obtained from the M-1 candidate samples to serve as other candidate samples.

Specifically, the training device may select any one of the M candidate samples as the specified candidate sample. The selection process may be random or each candidate sample may be selected sequentially. Then, the training device determines the random number as the number N of the interference items in a value range with the number of labels output by the language model to be trained as a maximum value. The number of labels output by the language model to be trained is smaller than the number of all text categories in the training data set, namely that N satisfies 1-J-N _maxLabel -1, wherein N _maxLabel The maximum number of predicted target identifications by the language model to be trained. By setting this number, it can be ensured that the total number of prediction options is not too long. For example, there are 100 total types, and the maximum number of predictions may be less than 100, for example, may be 10, and it is ensured that the number of predicted markers is not too long. After determining the number of interference terms N, the training device may obtain N candidate samples from M-1 candidate samples according to the number of interference terms N, as other candidate samples. In practical applications, the number of categories of model inputs is typicallyWhen predicting an actual data set according to the number of categories of input data, for example, in the above example, the possible types may be any number between 2 and 100, and randomly selecting the number of negative options may reduce the gap between the training process and the actual process, so as to be beneficial to improving the accuracy of identifying the scheme.

And step S350, performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample.

Optionally, the implementation details of step S350 are identical to those of step S230 shown in fig. 2, and will not be described again.

And step S360, training the language model to be trained according to the training text set to obtain a category identification model.

Optionally, the implementation details of step S360 are identical to those of step S240 shown in fig. 2, and will not be described again.

In some optional embodiments of the present application, based on the above embodiments, in a process of generating a training text according to a target identifier, an interference identifier, and text description information of a specified candidate sample, a training device may perform out-of-order splicing on the target identifier and the interference identifier to obtain an identifier splicing result, if the number of samples of the target identifier and the interference identifier is less than the total number of categories of the category identification model, a preset mark is added to the identifier splicing result according to a difference between the total number of categories and the number of samples, and then the identifier splicing result is spliced with the text description information of the specified candidate sample to obtain the training text. Specifically, the training device performs option filling in the process of generating training texts. As described in the above embodiments, the target identification and the interference identification may be considered as options in the training text. In the case that the number of interfering identifications is random, it may occur that the number of target identifications and interfering identifications is less than the number of options required by the category recognition model for the input text. This number of options is typically the total number of categories for all text types in the training set. If the number of samples of the target and interfering identifications is less than the total number of categories of the category identification model, the number of options in the training text needs to be populated to the total number of categories. The filling process is performed by adding a preset mark, which may be for example "[ PAD" ]". For example, if the number of samples of the target and interference identifications, n+1, is less than the total number of categories, N, of the category identification model _model Then N needs to be added _model -N [ PAD ]]Options. In this embodiment, the input of the training process is consistent with the input of the recognition process through the option filling, and training is performed for different option numbers, so that the required calculation amount and training amount for matching the option numbers are simplified, and the training efficiency of the scheme is improvedThe rate.

In the embodiment of the present application, other embodiments for refining the technical solution of the embodiment shown in fig. 2 are also provided, and specifically as shown in fig. 4, in the training method of the language model of one embodiment of the present application, the method may include the following steps:

in step S410, M candidate samples are acquired, where M is an integer greater than 0.

Optionally, the implementation details of step S410 are identical to those of step S210 shown in fig. 2, and will not be described again.

Step S420, for each candidate sample, obtaining a text identifier of the current candidate sample as a target identifier and obtaining text identifiers of N other candidate samples from M-1 candidate samples as interference identifiers, wherein N is an integer greater than 0 and less than a preset number of labels, and the preset number of labels is the maximum number of class labels in a class identification result of text class identification.

Optionally, the implementation details of step S420 are identical to those of step S220 shown in fig. 2, and will not be described again.

And step S430, performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample.

Optionally, the implementation details of step S430 are identical to those of step S230 shown in fig. 2, and will not be described again.

Step S440, performing text splicing according to the target identification, the interference identification and the text description information of the current candidate sample, generating a noise text and adding the noise text into the training text set, wherein the interference identification is a target result corresponding to the noise text;

and step S450, training the language model to be trained through the training text and the noise text in the training text set to obtain the category identification model.

In this embodiment, the training device may further generate a noise text whose target result is an erroneous result. Specifically, the training device performs text splicing according to the target identifier, the interference identifier and the text description information of the current candidate sample to obtain a noise text, and the splicing process is the same as that in the above embodiment. The difference is that the training device will select one of the noisy identities as the target result of the generated noisy text. That is, in the case where the text description information of the target identification, the disturbance identification, and the current candidate sample are the same, the contents of the training text and the noise text are the same, but the corresponding target results are different. Since the text description information is the content to which the target identification corresponds, the disturbing identification is actually the wrong target. It will be appreciated that while generating the noisy text, where the current candidate sample is unchanged, other candidate samples may be reselected to obtain the interference identities, and the number of interference identities may be redetermined. The generated noise text is added into the training text set, so that the training text set contains training text with correct results and disturbing text with incorrect results. The training device trains the language model to be trained through training texts and noise texts in the training text set to obtain a category recognition model. Specifically, the language model to be trained may include a corresponding noise recognition module for recognizing a noise text, in the training process, the training device selects a correct training text through the noise recognition module, then trains the type recognition module of the language model to be trained, and then performs parameter adjustment on the noise recognition module and the type recognition module according to the result to obtain a category recognition model. In other embodiments, a language model to be trained may include a sub-model dedicated to identifying noisy text and a sub-model for type identification, both sub-models being trained together into a category identification model. In this embodiment, the language module to be trained is trained by generating the noise text, so that the recognition capability and the error resistance of the type recognition model obtained by training on the error data can be improved, and the robustness of the type recognition model is improved.

In some optional embodiments of the present application, based on the above embodiments, in a process of training a language model to be trained through training text and noise text in the training text set to obtain the category recognition model, the training device may acquire two texts and corresponding target results from the training text set; according to preset weighting parameters, carrying out weighted summation on vectors corresponding to the two obtained texts and carrying out weighted summation on vectors corresponding to the two corresponding target results, wherein the obtained results are used as training input samples; training the language model to be trained according to the obtained training input sample to obtain a category identification model. Specifically, in this embodiment, the training device synthesizes two text results in the training text set to perform parameter adjustment. Either text may be the correct training text or the wrong noise text. The training device performs weighted summation on the output results of the two texts, and then performs training according to the weighted summation result. Specifically, assuming that the vector corresponding to the target result of the first text is [0, 1], and the vector corresponding to the target result of the second text is [1, 0], and the weighting parameter is 0.2, the vector corresponding to the target result of the training input sample is 0.2 x [0, 1] +0.8 x [1, 0] = [0.8,0,0.2], and similarly, the same weighting parameter is also used for the text, thereby obtaining the complete training input sample. The training texts and the noise texts in the training text set can be integrated in a mode of combining the two texts in a pairwise corresponding mode, so that the class identification model obtained through training has enough ubiquity capacity to cope with the noise texts, and the anti-interference capacity of the model on error data is improved.

In some optional embodiments of the present application, based on the above embodiments, in a process of training a language model to be trained by training a training text and a noise text in a text set to obtain a category recognition model, a training device divides the text in the training text set into a plurality of text batches, and for each text batch, inputs the text batches into a first model to be trained and a second model to be trained respectively to train, so as to obtain a first training result and a second training result, where the first model to be trained and the second model to be trained are both obtained by initializing the language model to be trained. Then, the training device obtains texts with different recognition results in the first training result and the second training result as texts to be updated, selects texts with preset proportions from the texts to be updated as a first group according to the loss value of the texts to be updated in the first training result, and selects texts with preset proportions from the texts to be updated as a second group according to the loss value of the texts to be updated in the second training result. Finally, the training device updates parameters of the second model to be trained according to the first group, updates parameters of the first model to be trained according to the second group, and determines the first model to be trained and the second model to be trained as the category recognition model. Specifically, the training device divides the training sample set into a plurality of batches, and then trains the first model to be trained and the second model to be trained according to each batch to obtain a first training result and a second training result. The first model to be trained and the second model to be trained are both obtained by initializing the same language model to be trained, namely the two models have the same structure, but initialized model parameters are usually different. The training means compares which of the training results differ, and for these different results, obtains their loss values in the respective models, respectively. Then, according to the loss value, each group is selected from the texts with different results, and a sample with a smaller loss value is generally selected, and the proportion or the number of the selected samples is predetermined. And finally, inputting a first group selected by the first model to be trained into a second model to be trained for parameter updating, and inputting a second group selected by the second model to be trained into the first model to be trained for parameter updating. And when the first model to be trained and the second model to be trained meet the condition of finishing training, the first model to be trained and the second model to be trained are used as category recognition models. In some embodiments, the sampling rate of the text selected from the text to be updated is gradually decreased with the increase of the batch, for example, a batch is preset in the training device, the sampling rate is decreased according to a certain ratio before the batch is reached, and a preset fixed sampling rate is used after the batch is reached, or a fixed sampling rate may be used before the batch is reached, and the sampling rate is gradually decreased with the batch after the batch is reached. In this embodiment, the two models mutually select samples with smaller loss from training samples with inconsistent results to update parameters, so that relatively accurate training samples can be provided as training data, and the influence of the interference data with labeling errors on the model training process is reduced.

In some optional embodiments of the present application, based on the foregoing embodiments, in a process of training a language model to be trained through training texts and noise texts in the training text set to obtain the class identification model, the training device may reconstruct the training texts and noise texts in the training text set according to a plurality of preset model target tasks to obtain a plurality of task samples corresponding to each model target task, then select a sample from the corresponding task samples as a training sample batch for each model target task, and execute prediction of the corresponding model target task in the language model to be trained according to the training sample batch to obtain a training result of each model target task, and finally update model parameters of the language model to be trained according to the training result of each model target task to obtain the class identification model. Specifically, the training device trains the language model to be trained under various target tasks, and correspondingly, the training device also reconstructs training texts and noise texts into data forms of the corresponding tasks. For example, the training device may reconstruct training text and noisy text into instruction templates, question answer pairs, multiple choice questions, or retail peer forms to train the corresponding tasks of the language model to be trained. Each model target task is correspondingly divided into a plurality of task samples, and the plurality of task samples of each model target task are further divided into a plurality of training sample batches. The training device performs a round of model training with each training sample batch. Specifically, each training sample batch is divided into two parts, and the first part is input into the language model to be trained to execute the corresponding task, so as to obtain the model parameter update. After the first part of the training sample batch of each task is updated, the second part of each training sample batch is utilized to update the parameters for the second time, and then the next batch is carried out until the model meets the condition of training ending. In the embodiment, the language model to be trained is trained through various tasks, so that the language model to be trained can learn the correct mapping relation between the text description information and the text identifier more accurately in the training process of text category identification, and the accuracy of the category identification model obtained through training is improved.

In the solution of the present application, a text category recognition method is provided, and fig. 5 shows a flowchart of text category recognition according to an embodiment of the present application, where the text category recognition method is applied to an information distribution platform, and may specifically be executed by a device having a computing processing function, for example, may be executed by a server or a terminal device where the information distribution platform is located. In the following embodiments, the solution of the present application will be described by taking the example of identifying the operation range of the public number according to the account information of the public number in the public number platform. Referring to fig. 5, the text category recognition method at least includes steps S510 to S570, and is described in detail as follows:

in step S510, M candidate samples are acquired, where M is an integer greater than 0.

Optionally, the implementation details of step S510 are identical to those of step S230 shown in fig. 2, and will not be described again.

Step S520, for each candidate sample, obtaining a text identifier of the current candidate sample as a target identifier and obtaining text identifiers of N other candidate samples from M-1 candidate samples as interference identifiers, wherein N is an integer greater than 0 and less than a preset number of labels, and the preset number of labels is the maximum number of class labels in a class identification result of text class identification.

Optionally, the implementation details of step S520 are identical to those of step S220 shown in fig. 2, and will not be described again.

And step S530, performing text splicing according to the target identification, the interference identification and the text description information of each candidate sample to generate a training text, and obtaining a training text set, wherein the target identification is a target result corresponding to the training text, the training text set comprises a plurality of training texts, and each training text is correspondingly generated according to one candidate sample.

Optionally, the implementation details of step S530 are identical to those of step S230 shown in fig. 2, and will not be described again.

And step S540, training the language model to be trained according to the training text set to obtain a category recognition model.

Optionally, the implementation details of step S540 are identical to those of step S240 shown in fig. 2, and will not be described again.

Step S550, obtaining a text to be identified and a plurality of text categories, wherein the text to be identified comprises a text identifier and text description information;

step S560, splicing according to the text description information of the text to be identified and the text categories, and generating an input text;

step S570, performing text category recognition on the input text through the category recognition model, and determining a category recognition result of the text to be recognized, where the category recognition result is one of the text categories.

Specifically, when the category recognition model is applied to perform category recognition, text categories are used instead of text identifiers as options in the input text. Specifically, a server deployed by a model firstly acquires a text to be identified and a plurality of text categories, wherein the text to be identified contains text identification and text description information. The text category is typically all the categories to which the text to be identified may belong, such as value added services and restaurants. These categories may be pre-specified manually from the scope of collecting text to be identified. The text to be identified contains text identification and text description information, and the text identification can be a public number name, and the text description information can be introduction of the public number and function bar information by taking the public number as an example. And then, the server splices according to the text types and text description information of the text to be recognized to generate an input text. Splicing process the process of splicing training texts in the training process is the same, the only difference being that when the model is applied, the name of the public number is no longer used as an option in the input text, but the text category is adopted to splice with the text description information into the input text. And then, the category recognition model carries out text category recognition on the input text, and determines a category recognition result of the text to be recognized, wherein the category recognition result is one of a plurality of text categories. That is, the category recognition model selects one category from the text categories contained therein as a category recognition result based on the input text. In this embodiment, when the category identification is actually performed, the combination of the category and the information in the text to be identified is used as the input text, and the category of the text to be identified is selected from the input text categories through the category identification model, so that the text category identification is performed by using the mapping relationship between the text identification and the text category, so that the gap between the unmarked data and the text classification task is filled by using the text identification, the data to be identified can be directly used without additional data processing, thereby being beneficial to reducing the complexity of text category identification and improving the scheme execution efficiency.

In some optional embodiments of the present application, based on the above embodiments, in a process of splicing according to a plurality of text categories and text description information of a text to be identified, and generating an input text, a server may perform text splicing on the plurality of text categories and the text description information of the text to be identified, so as to obtain a category splicing result. And if the number of the text categories is smaller than the total category number of the category identification model, adding a preset mark to the identification splicing result according to the difference between the total category number and the text category number to obtain an input text. The process of splicing the input text in the recognition process is similar to the process of splicing the training text and the noise text in the training process, but the text identification is not required to be used as a recognition target in the process of splicing the input text. The input text also needs to be complemented according to the number of options therein and the total number of categories in the category recognition model so that the input text is in the same format as the training text. The input of the training process is consistent with the input of the recognition process through the option filling, and compared with preprocessing and filtering the text to be recognized, the processing difficulty required for generating the input text is reduced.

In some optional embodiments of the present application, based on the foregoing embodiments, in a process of performing text splicing on the plurality of text categories and the text description information of the text to be identified to obtain a category splicing result, the server performs text conversion on the plurality of text categories according to a text template corresponding to each text category to obtain a plurality of category input texts, and then performs text splicing on the plurality of category input texts and the text description information of the text to be identified to obtain a category splicing result. In this embodiment, the server may have a corresponding text template for each text category, and the server performs text conversion on the text category according to the text template. For example, the text template may be "text about XXX," and upon conversion, the server replaces XXX in the template with a text category to form category input text. The text is converted through the text template, so that the corresponding relation between the generated category input text and the input text of the training data is higher, and the accuracy of category identification is facilitated.

In some optional embodiments of the present application, based on the above embodiments, in a process of performing text category recognition on the input text through the category recognition model and determining a category recognition result of the text to be recognized, the server may perform text category recognition on the input text through the category recognition model first, so as to obtain an output result. If the data dimension of the output result is larger than the category number of the plurality of text categories, cutting the output result according to the category number to obtain a category index, and then obtaining the text category corresponding to the category index as a category recognition result of the text to be recognized according to the ordering of the plurality of text categories in the input text. Specifically, the number of categories contained in the input text will typically be less than the total number of resulting vectors output by the category recognition model, in which case the resulting vectors need to be clipped according to the number of categories in the input text. And then taking the text category corresponding to the category reduction as a category recognition result of the text to be recognized. It will be appreciated that if the result of the cropping does not include the identified result, it is indicative that the text to be identified does not belong to any of the provided plurality of text categories. By cutting the result vector, excessive storage space occupation caused by overlong category identification results is avoided, and the resource utilization rate of the scheme is improved.

The implementation details of the technical solutions of the embodiments of the present application are set forth below by specific examples:

referring to fig. 6, fig. 6 is a schematic flowchart of an overall text category recognition procedure in an embodiment of the present application. As shown in fig. 6, the overall flow is largely divided into two parts, namely a self-supervised learning process and a zero-sample reasoning process. In the self-supervision learning process, the task is simplified into the matching relation between the learning public number name and the public number introduction and the public number function column spliced text, so that the model learns the expression of the public number text. Specifically, as shown in fig. 6, text extraction and phenomenon extraction are performed for the input public number data. In some embodiments, the data is filtered prior to extraction, text is selected that is appropriate for training and nonsensical sentences are deleted, improving data quality. The public number name is then taken as the forward option. The public number name is the item with the largest information concentration in the whole text of the public number, and the name can most reflect the management category of the public number. After the forward option is obtained, we randomly choose from other public number namesN names are selected as negative options, N is a random number, and J is more than or equal to 1 and less than or equal to N _maxLabel －1。N _maxLabel The maximum number of targets representing name predictive labels of the self-supervised learning process is predefined to ensure that the total number of labels of the options is not too long. It is less than or equal to N _model ，N _model The number of labels for the model output layer. After positive and negative options are determined, option filling is required. In the example of FIG. 6, the special label "[ PAD ]]"to populate the options to reconcile the input formats of the self-supervised learning phase and the zero sample reasoning phase. Specifically, if the total number of negative sampled options is less than N _model Will add N _model -N-1 [ PAD ]]Options. The list of options is then shuffled and spliced. After scrambling, the list of options is assumed as follows:

wherein, forward option O _c ＝O _n The label of the sample is l=n.

The spliced final input text is that

Where Ti represents the i-th item from the index indicator list T (e.g., [ a, B, c.).])，[CLS]Is a classification mark, [ SEP ]]Is a separator marker. Thus, the final text label pair (x _n ,L _n ) The sample is generated. A number of self-supervised learning samples are obtained for training by repeating the above process. The verification set may also be generated in the same manner.

The trained model may employ a pre-trained mask language model as the master model. Furthermore, an output layer for classification is added to the model. The number of labels of the output layer is configured as the maximum class number of all test data sets, denoted as N _model And the loss function of the model is cross entropy loss.

The model after training can be used for identifying the category of the text, namely a zero sample reasoning process. In the zero sample reasoning phase, reasoning can be performed by converting the input of the sample into the same format as in the self-supervised learning phase. As shown in fig. 6, the expression of zero sample input is similar to the self-supervised learning phase, with the main difference that: 1) The category name is converted into an option without using the public number name as an option. Can be done directly with the original label or some simple template (e.g. "text is about [ label ]"); 2) The order of the options may not be disturbed.

Since the input and output after the two-stage conversion are the same, no further adjustment is required to the model. Dimension N due to output logits _model Possibly with the number of categories N in the dataset _L Differently, the prediction may be out of range (e.g., for 2 categories of data sets, the model may output 3). In determining the result, the method can be based on the previous N _L The individual logits predict:

P＝argmax(logits[0:N _L ])

where P is the index of the forward options.

It should be noted that although the steps of the methods in the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes the implementation of the apparatus of the present application, which may be used to perform the training method of the language model in the above-described embodiments of the present application. Fig. 7 schematically shows a block diagram of the training apparatus of the language model in the embodiment of the present application. As shown in fig. 7, the training apparatus 700 of the language model may mainly include:

a sample acquisition module 710 configured to acquire M candidate samples, the M being an integer greater than 0;

the identifier obtaining module 720 is configured to obtain, for each candidate sample, a text identifier of the current candidate sample as a target identifier and obtain, from M-1 candidate samples, text identifiers in N other candidate samples as interference identifiers, where N is an integer greater than 0 and less than a preset number of tags, and the preset number of tags is the maximum number of category tags in a category recognition result of text category recognition;

the text generation module 730 is configured to perform text stitching according to the target identifier, the interference identifier and the text description information of each candidate sample to generate a training text, so as to obtain a training text set, where the target identifier is a target result corresponding to the training text, the training text set includes a plurality of training texts, and each training text is correspondingly generated according to one candidate sample;

The model training module 740 is configured to train the language model to be trained according to the training text set, and obtain a category recognition model.

In some embodiments of the present application, based on the above technical solutions, the identifier obtaining module 720 is specifically configured to include: any one candidate sample in the M candidate samples is obtained as a designated candidate sample; determining a random number as the number N of interference items in a value range with the number of labels output by the language model to be trained as a maximum value; and according to the number N of the interference items, N candidate samples are obtained from M-1 candidate samples to serve as other candidate samples.

In some embodiments of the present application, based on the above technical solutions, the identifier obtaining module 720 is specifically configured to include: performing out-of-order splicing on the target identifier and the interference identifier to obtain an identifier splicing result; if the number of samples of the target mark and the interference mark is smaller than the total category number of the category identification model, adding a preset mark to the mark splicing result according to the difference between the total category number and the sample number; and splicing the identification splicing result with the text description information of the appointed candidate sample to obtain a training text.

In some embodiments of the present application, based on the above technical solutions, the model training module 740 is further configured to: performing text splicing according to the target identification, the interference identification and text description information of the current candidate sample, generating a noise text and adding the noise text into the training text set, wherein the interference identification is a target result corresponding to the noise text; training the language model to be trained through the training text and the noise text in the training text set to obtain the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module 740 is specifically configured to: acquiring two texts and corresponding target results from the training text set; according to preset weighting parameters, carrying out weighted summation on vectors corresponding to the two obtained texts and carrying out weighted summation on vectors corresponding to the two corresponding target results, wherein the obtained results are used as training input samples; training the language model to be trained according to the obtained training input sample to obtain the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module 740 is specifically configured to: dividing the text in the training text set into a plurality of text batches; for each text batch, respectively inputting the text batch into a first model to be trained and a second model to be trained to obtain a first training result and a second training result, wherein the first model to be trained and the second model to be trained are obtained through initializing a language model to be trained; acquiring texts with different recognition results in the first training result and the second training result as texts to be updated; selecting texts with preset proportions from the texts to be updated as a first group according to the loss values of the texts to be updated in the first training result, and selecting the texts with the preset proportions from the texts to be updated as a second group according to the loss values of the texts to be updated in the second training result; performing parameter updating on the second model to be trained according to the first group, and performing parameter updating on the first model to be trained according to the second group; and determining the first model to be trained and the second model to be trained as the category identification model.

In some embodiments of the present application, based on the above technical solutions, the model training module 740 is specifically configured to: reconstructing training texts and noise texts in the training text set according to a plurality of preset model target tasks to obtain a plurality of task samples corresponding to each model target task; for each model target task, selecting a sample from the corresponding task samples as a training sample batch; according to the training sample batch, executing prediction of a corresponding model target task in a language model to be trained to obtain a training result of each model target task; and updating model parameters of the language model to be trained according to the training result of each model target task to obtain the category identification model.

Another apparatus implementation of the present application is described below, which may be used to perform the text category recognition method in the above-described embodiments of the present application. Fig. 8 schematically shows a block diagram of the text category recognition device in the embodiment of the present application. As shown in fig. 8, the text category recognition apparatus 800 includes:

the category acquisition module 810 is configured to acquire a text to be identified and a plurality of text categories, wherein the text to be identified contains text identification and text description information;

A text splicing module 820 configured to splice according to the text categories and the text description information of the text to be recognized, and generate an input text;

a category recognition module 830 configured to perform text category recognition on the input text through the category recognition model, and determine a category recognition result of the text to be recognized, where the category recognition result is one of the plurality of text categories;

the training process of the category identification model is as follows:

obtaining M candidate samples, wherein M is an integer greater than 0;

In some embodiments of the present application, based on the above technical solutions, the text stitching module 820 is specifically configured to: performing text splicing on the text categories and the text description information of the text to be identified to obtain category splicing results; if the number of the text categories is smaller than the total category number of the category identification model, adding a preset mark to the identification splicing result according to the difference between the total category number and the text category number to obtain an input text.

In some embodiments of the present application, based on the above technical solutions, the text stitching module 820 is specifically configured to: according to the text templates corresponding to each text category, performing text conversion on the plurality of text categories to obtain a plurality of category input texts; and performing text splicing on the plurality of category input texts and the text description information of the text to be identified to obtain a category splicing result.

In some embodiments of the present application, based on the above technical solutions, the category identification module 830 is specifically configured to: carrying out text category recognition on the input text through a category recognition model to obtain an output result; if the data dimension of the output result is larger than the category number of the text categories, cutting the output result according to the category number to obtain a category index; and according to the ordering of the text categories in the input text, acquiring the text category corresponding to the category index as a category recognition result of the text to be recognized.

It should be noted that, the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and a specific manner in which each module performs an operation has been described in detail in the method embodiment, which is not described herein again.

It should be noted that, the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 9, the computer system 900 includes a central processing unit (Centra lProcessing Unit, CPU) 901 which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random access Memory (Random AccessMemory, RAM) 903. In the RAM 903, various programs and data required for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output section 907 including a speaker and the like, such as a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crysta lDisplay, LCD), and the like; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN (Loca lArea Network ) card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. Removable media 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed as needed into the storage section 908.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. When the computer program is executed by a Central Processing Unit (CPU) 901, various functions defined in the system of the present application are performed.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-only memory (ROM), an erasable programmable Read-only memory (Erasable Programmable Read Only Memory, EPROM), a flash memory, an optical fiber, a portable compact disc Read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for training a language model, comprising:

obtaining M candidate samples, wherein M is an integer greater than 0;

2. The training method according to claim 1, wherein for each candidate sample, the obtaining the text identifier of the current candidate sample as the target identifier and the text identifiers of N other candidate samples from the M-1 candidate samples as the interference identifiers includes:

Any one candidate sample in the M candidate samples is obtained as a designated candidate sample;

determining a random number as the number N of interference items in a value range with the number of labels output by the language model to be trained as a maximum value;

and according to the number N of the interference items, dividing N candidate samples from the M-1 candidate samples to obtain other candidate samples.

3. The method of claim 2, wherein the generating training text from the target identification, the interference identification, and the text description information of the specified candidate sample comprises:

performing out-of-order splicing on the target identifier and the interference identifier to obtain an identifier splicing result;

if the number of samples of the target mark and the interference mark is smaller than the total category number of the category identification model, adding a preset mark to the mark splicing result according to the difference between the total category number and the sample number;

and splicing the identification splicing result with the text description information of the appointed candidate sample to obtain a training text.

4. The method of claim 1, wherein before training the language model to be trained from the training text set to obtain the class identification model, the method further comprises:

Performing text splicing according to the target identification, the interference identification and text description information of the current candidate sample, generating a noise text and adding the noise text into the training text set, wherein the interference identification is a target result corresponding to the noise text;

training the language model to be trained according to the training text set to obtain a category identification model, wherein the training comprises the following steps:

training the language model to be trained through the training text and the noise text in the training text set to obtain the category identification model.

5. The method of claim 4, wherein training the language model to be trained by training text and noise text in the training text set to obtain the category recognition model comprises:

acquiring two texts and corresponding target results from the training text set;

according to preset weighting parameters, carrying out weighted summation on vectors corresponding to the two obtained texts and carrying out weighted summation on vectors corresponding to the two corresponding target results, wherein the obtained results are used as training input samples;

training the language model to be trained according to the obtained training input sample to obtain the category identification model.

6. The method according to claim 5, wherein training the language model to be trained by the training text and the noise text in the training text set to obtain the category recognition model comprises:

dividing the text in the training text set into a plurality of text batches;

for each text batch, respectively inputting the text batch into a first model to be trained and a second model to be trained to obtain a first training result and a second training result, wherein the first model to be trained and the second model to be trained are obtained through initializing a language model to be trained;

acquiring texts with different recognition results in the first training result and the second training result as texts to be updated;

selecting texts with preset proportions from the texts to be updated as a first group according to the loss values of the texts to be updated in the first training result, and selecting the texts with the preset proportions from the texts to be updated as a second group according to the loss values of the texts to be updated in the second training result;

performing parameter updating on the second model to be trained according to the first group, and performing parameter updating on the first model to be trained according to the second group;

And determining the first model to be trained and the second model to be trained as the category identification model.

7. The method of claim 4, wherein training the language model to be trained by training text and noise text in the training text set to obtain the category recognition model comprises:

reconstructing training texts and noise texts in the training text set according to a plurality of preset model target tasks to obtain a plurality of task samples corresponding to each model target task;

for each model target task, selecting a sample from the corresponding task samples as a training sample batch;

according to the training sample batch, executing prediction of a corresponding model target task in a language model to be trained to obtain a training result of each model target task;

and updating model parameters of the language model to be trained according to the training result of each model target task to obtain the category identification model.

8. A method for text category recognition, the method comprising:

the training process of the category identification model is as follows:

obtaining M candidate samples, wherein M is an integer greater than 0;

9. The method of claim 8, wherein the concatenating the text description information based on the plurality of text categories and the text to be identified to generate the input text comprises:

performing text splicing on the text categories and the text description information of the text to be identified to obtain category splicing results;

if the number of the text categories is smaller than the total category number of the category identification model, adding a preset mark to the identification splicing result according to the difference between the total category number and the text category number to obtain an input text.

10. The method according to claim 9, wherein performing text splicing on the plurality of text categories and the text description information of the text to be recognized to obtain a category splicing result includes:

according to the text templates corresponding to each text category, performing text conversion on the plurality of text categories to obtain a plurality of category input texts;

and performing text splicing on the plurality of category input texts and the text description information of the text to be identified to obtain a category splicing result.

11. The method according to claim 8, wherein the text category recognition of the input text by the category recognition model, determining a category recognition result of the text to be recognized, comprises:

carrying out text category recognition on the input text through a category recognition model to obtain an output result;

if the data dimension of the output result is larger than the category number of the text categories, cutting the output result according to the category number to obtain a category index;

and according to the ordering of the text categories in the input text, acquiring the text category corresponding to the category index as a category recognition result of the text to be recognized.

12. A training device for a language model, comprising:

13. A text category recognition device, comprising:

The training process of the category identification model is as follows:

obtaining M candidate samples, wherein M is an integer greater than 0;

14. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the training method of the language model of any one of claims 1 to 11 via execution of the executable instructions.

15. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a method of training a language model according to any one of claims 1 to 11.

16. A computer program product, characterized in that the computer program product comprises a computer program, the computer program being stored in a computer readable storage medium, from which computer readable storage medium a processor of an electronic device reads and executes the computer program, causing the electronic device to perform the training method of the language model according to any one of claims 1 to 11.