CN111859951B - Language model training method and device, electronic equipment and readable storage medium - Google Patents

Language model training method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111859951B
CN111859951B CN202010564362.9A CN202010564362A CN111859951B CN 111859951 B CN111859951 B CN 111859951B CN 202010564362 A CN202010564362 A CN 202010564362A CN 111859951 B CN111859951 B CN 111859951B
Authority
CN
China
Prior art keywords
word
language model
input text
groups
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010564362.9A
Other languages
Chinese (zh)
Other versions
CN111859951A (en
Inventor
朱丹翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010564362.9A priority Critical patent/CN111859951B/en
Publication of CN111859951A publication Critical patent/CN111859951A/en
Application granted granted Critical
Publication of CN111859951B publication Critical patent/CN111859951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application discloses a training method and device of a language model, electronic equipment and a readable storage medium, and relates to the technical field of deep learning and natural language processing. The specific implementation scheme is as follows: acquiring word segmentation information of an original input text; annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample; the input text sample is input into a language model to train the language model. As the semantic information representation with larger granularity is introduced, the learning capability of the language model on word sense information is enhanced, the performance of the language model is improved, the universality of the language model is not reduced, and the method is more friendly to downstream sequence labeling tasks.

Description

Language model training method and device, electronic equipment and readable storage medium
Technical Field
The invention relates to the technical field of computers, in particular to the technical field of deep learning and natural language processing, and especially relates to a training method and device of a language model, electronic equipment and a readable storage medium.
Background
In the field of Chinese natural language processing (Natural Language Processing, NLP), a large amount of unsupervised text is used for performing self-supervised pre-training learning (pre-training) of a language model, and then supervised task data is adopted for performing fine-tuning (fine-tuning) on the language model, so that the method is an advanced language model training technology in the current NLP field.
In the prior art, in order to prevent the training effect of the language model from being influenced by the performance of the word segmentation device, the self-supervision pre-training learning of the language model is performed based on word granularity, so that the language model is difficult to learn information with larger semantic granularity (such as words), the semantics of the words are very important in Chinese language expression, and the learning based on the word granularity may damage the learning of the language model on the semantics of the words, thereby influencing the performance of the language model.
Disclosure of Invention
Aspects of the present application provide a training method, apparatus, electronic device, and readable storage medium for a language model, so as to enhance learning ability of the language model on word sense information and improve performance of the language model.
According to a first aspect, there is provided a training method of a language model, including:
acquiring word segmentation information of an original input text;
annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample;
the input text sample is input into a language model to train the language model.
According to a second aspect, there is provided a training apparatus of a language model, comprising:
the acquisition unit is used for acquiring word segmentation information of the original input text;
the marking unit is used for annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample;
and the language model is used for receiving the input text sample so as to train based on the input text sample.
According to a third aspect, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the aspects and methods of any one of the possible implementations described above.
According to a fourth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the aspects and any possible implementation described above.
According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspects and any possible implementation described above.
According to the technical scheme, the word segmentation information of the original input text is obtained, word segmentation information and word segmentation information are annotated to each word mark in the original input text to obtain an input text sample, and then the input text sample is input into a language model to train the language model, so that the language model can learn semantic information based on word granularity.
In addition, in the prior art, in a sequence labeling task of the language model pre-training, each word in an input text is labeled, the input text is required to be segmented according to the word, if the input text is segmented according to the word, the universality of the language model is reduced, and the downstream sequence labeling task is not friendly. By adopting the technical scheme provided by the application, on the basis of not changing the word segmentation of the original input text, word segmentation information is additionally introduced, so that the semantic learning capacity of a language model can be improved, the universality of the language model can not be reduced, and the method is more friendly to a downstream sequence labeling task compared with a mode of directly introducing word segmentation.
In addition, by adopting the technical scheme provided by the application, the trained language model has better semantic information expression capability, so that the accuracy of the processing result of the NLP task can be effectively improved when the trained language model is used for the subsequent NLP task.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art. The drawings are only for better understanding of the present solution and are not to be construed as limiting the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram according to a second embodiment of the present application;
FIG. 3 is a schematic diagram according to a third embodiment of the present application;
FIG. 4 is a schematic diagram according to a fourth embodiment of the present application;
FIG. 5 is a schematic diagram according to a fifth embodiment of the present application;
FIG. 6 is a schematic diagram of an electronic device for implementing a training method for language models of embodiments of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It will be apparent that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that, the terminal in the embodiments of the present application may include, but is not limited to, a mobile phone, a personal digital assistant (Personal Digital Assistant, PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a personal Computer (Personal Computer, PC), an MP3 player, an MP4 player, a wearable device (e.g., smart glasses, smart watches, smart bracelets, etc.), a smart home device, and other smart devices.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the prior art, in order to prevent the training effect of the language model from being influenced by the performance of the word segmentation device, the self-supervision pre-training learning of the language model is performed based on word granularity, so that the language model is difficult to learn information with larger semantic granularity (such as words), the semantics of the words are very important in Chinese language expression, and the learning based on the word granularity may damage the learning of the language model on the semantics of the words, thereby influencing the performance of the language model.
Aiming at the problems, the application provides a training method and device of a language model, electronic equipment and a readable storage medium, which are used for enhancing the learning ability of the language model to word sense information and improving the performance of the language model.
Fig. 1 is a schematic view according to a first embodiment of the present application, as shown in fig. 1.
101. And acquiring word segmentation information of the original input text.
102. And annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample.
The word segmentation information is a word segmentation result of the original input text and is used for identifying each word in the original input text. The word segmentation information is a word segmentation result of the original input text and is used for identifying each word in the original input text.
103. The input text sample is input into a language model to train the language model.
The 101-103 may be an iterative process, and training of the language model is achieved through the iterative process 101-103 until a preset training completion condition is met, where training of the language model is completed.
Optionally, in a possible implementation manner of this embodiment, the preset training completion condition may be set according to actual requirements, for example, may be: the number of training of the language model (i.e., the number of iterations performed 101-103) reaches a first preset threshold, for example, 100 tens of thousands.
The execution subjects 101 to 103 may be partly or entirely applications located in the local terminal, or may be functional units such as plug-ins or software development kits (Software Development Kit, SDKs) provided in the applications located in the local terminal, or may be processing engines located in a network side server, which is not particularly limited in this embodiment.
It will be appreciated that the application may be a native program (native app) installed on the terminal, or may also be a web page program (webApp) of a browser on the terminal, which is not limited in this embodiment.
In this embodiment, the language model is trained by labeling the word segmentation information and the input text sample of the word segmentation information, so that the language model can learn semantic information based on word granularity, and the word granularity contains richer semantic information representation relative to the word granularity due to the introduction of the semantic information of the word granularity, so that the modeling of the word meaning information by the language model is enhanced based on the semantic learning of the word granularity, the learning capability of the language model on the word meaning information is enhanced, and the performance of the language model is improved.
In addition, in the prior art, in a sequence labeling task of the language model pre-training, each word in an input text is labeled, the input text is required to be segmented according to the word, if the input text is segmented according to the word, the universality of the language model is reduced, and the downstream sequence labeling task is not friendly. By adopting the technical scheme provided by the application, on the basis of not changing the word segmentation of the original input text, word segmentation information is additionally introduced, so that the semantic learning capacity of a language model can be improved, the universality of the language model can not be reduced, and the method is more friendly to a downstream sequence labeling task compared with a mode of directly introducing word segmentation.
In addition, by adopting the technical scheme provided by the application, the trained language model has better semantic information expression capability, so that the accuracy of the processing result of the NLP task can be effectively improved when the trained language model is used for the subsequent NLP task.
Optionally, in one possible implementation manner of this embodiment, at least one word may be obtained by segmenting the original input text in 101, where each word in the at least one word includes at least one character, and the characters may include a text, a letter, a number, an operation symbol, punctuation marks and other symbols, and some functional symbols, and so on; then, determining whether each character is a first character of the word according to whether each character is the first character of the word, wherein the word segmentation information of the original input text comprises: and marking information of whether each character in the at least one word is a first character.
In this embodiment, according to the word segmentation result, the marking information of the first character in the word where each character is located is used as the word segmentation information of each word, so that the word segmentation result can be represented on the basis of not changing the word segmentation of the original input text, the word segmentation information can be conveniently introduced, the universality of the ERNIE model is not reduced, and the method is more friendly to the downstream sequence labeling task compared with the direct word segmentation introduction mode.
Optionally, in one possible implementation of this embodiment, in 101, the original input text includes at least one sentence. Accordingly, in 102, word segmentation information and word segmentation information may be annotated to each word label in the original input text, and sentence identification (sentence embedding) may be annotated to each sentence in the original input text, so as to obtain the input text sample. Wherein the sentence identification is used to identify the current sentence as the number of sentences in the original input text.
In this embodiment, except that word segmentation information and word segmentation information are annotated to each word mark in the original input text, and sentence identification is annotated to each sentence in the original input text, so that the language model can learn information with larger semantic granularity from a sentence level, and the semantic learning and expression capability of the language model is further improved; and, annotating each sentence in the original input text with a sentence identification can also be used to train the sentence tasks (e.g., sentence order, sentence distance, sentence logical relationship) on the original input text.
Alternatively, in one possible implementation of this embodiment, the language model in the foregoing embodiment of the present application may be any language model, for example, a knowledge-enhanced semantic representation (Enhanced Representation from kNowledge IntEgration, ERNIE) model may be used.
The ERNIE model can learn the semantic representation of the complete concept through modeling priori semantic knowledge such as entity concepts in mass data, and pre-trains the ERNIE model by marking word segmentation information and input text samples of the word segmentation information, so that the representation of the ERNIE model on semantic knowledge units is closer to the real world, and the ERNIE model is modeled based on word feature input and word feature input at the same time, and has strong semantic representation capability. In this embodiment, the ERNIE model is used as a language model, so that the strong semantic representation capability of the ERNIE model can be utilized to model words, entities and entity relationships in mass data, and learn the semantic knowledge of the real world, thereby enhancing the semantic representation capability of the model.
Fig. 2 is a schematic diagram according to a second embodiment of the present application, as shown in fig. 2.
First, word segmentation information of an original input text is acquired. The method comprises two steps:
in step one, word segmentation is performed on the original input text. Assuming that the original input text is "big brothers get up and get up, the big brothers in the morning carry bricks today no", and the word is obtained after word segmentation: (big) (brother) (get up) (la) (breakfast) (big) (brother) (today) (move brick) (no);
and step two, according to word segmentation, whether each character in each word is the first character in the word is obtained, and whether the marking information of the first character of each character is determined.
Assuming that the marking information of the first character is B and the marking information of the non-first character is I, the word segmentation information of the original input text is obtained and comprises the marking information of whether each character in the original input text is the first character or not.
For example, original text: big brothers get up and get out of bed, breakfast big brothers move bricks today or not
Word segmentation information: BBIB, IB, B I B B I B I B I B
Next, word segmentation information (token segmentation) and word segmentation information (seg segmentation) are annotated to each word in the original input text, and an input text sample is obtained.
Then, input text samples marked with word segmentation information and word segmentation information are input into an ERNIE model, and the language model is trained. As shown in fig. 2, the word segmentation information is additionally introduced in the case of keeping the text segmented by words, and the text segmented by words is kept.
Fig. 3 is a schematic view according to a third embodiment of the present application, as shown in fig. 3.
On the basis of the first embodiment, after the training of the language model is completed, the language model can be further optimized through the supervised NLP task, so that the prediction performance of the language model in the NLP task is further improved.
In a second embodiment, the optimization of the language model by the supervised NLP task may be achieved specifically by the following steps:
201. and performing NLP tasks by using the trained language model to obtain a processing result.
Optionally, in one possible implementation manner of this embodiment, the NLP task may be any one or more of classification, matching, sequence labeling, and the like, which is not limited in this embodiment. Accordingly, the processing result is a processing result of a specific NLP task, such as a classification result, a matching result, a sequence labeling result, and the like.
Optionally, in one possible implementation manner of this embodiment, in 201, the trained language model is specifically used to combine with other network models for implementing classification, matching, and sequence labeling, for example, a convolutional neural network (convolutional neural network, CNN), a long short term memory (Long Short Term Memory, LSTM) model, a Word Bag (Bag of Word, BOW) model, and perform NLP tasks to obtain processing results, for example, other network models for implementing classification, matching, and sequence labeling perform classification, matching, sequence labeling, and other processing results based on the output of the language model, so as to obtain corresponding classification results, matching results, and sequence labeling results.
202. And according to the difference between the processing result and the labeling result information corresponding to the processing result, performing fine-tuning (fine-tuning) on the parameter value in the language model, namely fine-tuning the parameter value in the language model.
The labeling result information is a correct processing result manually labeled for the NLP task to be performed in advance.
The 201 to 202 may be an iterative process, and the iterative process 201 to 202 performs multiple fine tuning on the language model until a preset condition is satisfied, so as to complete fine tuning on the language model.
Optionally, in a possible implementation manner of this embodiment, the preset condition may be set according to actual requirements, and may include, for example: the difference between the processing result and the labeling result information is smaller than a preset difference and smaller than a second preset threshold; and/or the number of fine-tuning times (i.e., the number of iterative executions of 201-202) of the language model reaches a third preset threshold.
In this embodiment, the parameter values in the language model can be further optimized by the NLP task with the supervision data (i.e., the labeling result information) under the condition of not changing the overall structure of the language model, so that optimization iteration is conveniently performed on the language model according to each NLP task, and the prediction performance of the language model is improved.
Optionally, in one possible implementation manner of this embodiment, in 201, the performing a natural language processing task with the trained language model may include any one or more of the following, for example:
classifying the text to be processed by using the trained language model to obtain a classification task of the text to be processed, for example, the text to be processed is derived from a plurality of articles, which emotion type the text to be processed belongs to, and the like, so that the text content can be classified; and/or the number of the groups of groups,
matching the text to be processed with other texts by using the trained language model to obtain other texts matched with the text to be processed, so that the content or article matched with the content of the text to be processed can be obtained; and/or the number of the groups of groups,
marking the content in the text to be processed by using the trained language model to obtain marking results of corresponding content in the text to be processed, such as key content information of each part of content in the text to be processed, so as to realize sequence marking of the text content to be processed; and/or the number of the groups of groups,
predicting the sequence among sentences in the text to be processed by using the trained language model, thereby realizing the processing of sentence sequencing tasks; and/or the number of the groups of groups,
predicting semantic distances (e.g., adjacent, from the same article, from different articles, etc.) between sentences in the text to be processed by using the trained language model, thereby realizing the prediction of sentence distances; and/or the number of the groups of groups,
and predicting the logical relations (such as causal relations, progressive relations, parallel relations and the like) among sentences in the text to be processed by using the trained language model, so that the prediction of the logical relations among the sentences is realized.
Therefore, the language model obtained by training based on the embodiment can be used for subsequent arbitrary word-level or sentence-level and article-level tasks, and has better processing performance.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
Fig. 4 is a schematic view according to a fourth embodiment of the present application, as shown in fig. 4. The training apparatus 300 of the language model of the present embodiment may include an acquisition unit 301, a labeling unit 302, and a language model 303. Wherein, the obtaining unit 301 is configured to obtain word segmentation information of an original input text; the labeling unit 302 is configured to annotate word segmentation information and word segmentation information for each word label in the original input text, so as to obtain an input text sample; a language model 303 for receiving the input text sample for training based on the input text sample.
The execution subject of the training device of the language model of the present embodiment may be an application located in a local terminal, or may be a functional unit such as a plug-in unit or a software development kit (Software Development Kit, SDK) provided in an application located in a local terminal, or may be a processing engine located in a network server, which is not particularly limited in this embodiment.
It will be appreciated that the application may be a native program (native app) installed on the terminal, or may also be a web page program (webApp) of a browser on the terminal, which is not limited in this embodiment.
In this embodiment, the language model is trained by labeling the word segmentation information and the input text sample of the word segmentation information, so that the language model can learn semantic information based on word granularity, and the word granularity contains richer semantic information representation relative to the word granularity due to the introduction of the semantic information of the word granularity, so that the modeling of the word meaning information by the language model is enhanced based on the semantic learning of the word granularity, the learning capability of the language model on the word meaning information is enhanced, and the performance of the language model is improved.
In addition, in the prior art, in a sequence labeling task of the language model pre-training, each word in an input text is labeled, the input text is required to be segmented according to the word, if the input text is segmented according to the word, the universality of the language model is reduced, and the downstream sequence labeling task is not friendly. By adopting the technical scheme provided by the application, on the basis of not changing the word segmentation of the original input text, word segmentation information is additionally introduced, so that the semantic learning capacity of a language model can be improved, the universality of the language model can not be reduced, and the method is more friendly to a downstream sequence labeling task compared with a mode of directly introducing word segmentation.
In addition, by adopting the technical scheme provided by the application, the trained language model has better semantic information expression capability, so that the accuracy of the processing result of the NLP task can be effectively improved when the trained language model is used for the subsequent NLP task.
Optionally, in one possible implementation manner of this embodiment, the acquiring unit 301 is specifically configured to: word segmentation is carried out on the original input text to obtain at least one word; each word of the at least one word includes at least one character; determining whether each character is a first character of the word according to whether each character in the at least one word is the first character of the word; the word segmentation information of the original input text comprises: and marking information of whether each character in the at least one word is a first character.
Optionally, in a possible implementation manner of this embodiment, the original input text includes at least one sentence; the labeling unit 302 is specifically configured to: and annotating word cutting information and word cutting information on each word mark in the original input text, and labeling sentence identification on each sentence in the original input text to obtain an input text sample.
Alternatively, in one possible implementation manner of this embodiment, the language model 303 in the foregoing embodiment of the present application may be any language model, for example, an ERNIE model may be used.
Optionally, in a possible implementation manner of this embodiment, the language model 303 is further configured to perform a natural language processing task after training is completed, so as to obtain a processing result.
Fig. 5 is a schematic diagram according to a fifth embodiment of the present application, as shown in fig. 5, on the basis of the embodiment shown in fig. 4, the training apparatus 300 for a language model of the present embodiment may further include: and the fine tuning unit 401 is configured to fine tune the parameter values in the language model 303 according to the difference between the processing result and the labeling result information corresponding to the processing result.
Optionally, in one possible implementation manner of this embodiment, when the language model 303 performs a natural language processing task, the method is specifically used for: classifying the text to be processed; and/or matching the text to be processed with other texts; and/or marking the content in the text to be processed; and/or predicting the sequence among sentences in the text to be processed; and/or predicting semantic distances between sentences in the text to be processed; and/or predicting the logical relation between sentences in the text to be processed.
It should be noted that, the method in the embodiments corresponding to fig. 1 to fig. 3 may be implemented by the training device of the language model provided in the embodiments of fig. 4 to fig. 5. The detailed description may refer to the relevant content in the corresponding embodiments of fig. 1 to 3, and will not be repeated here.
According to embodiments of the present application, there is also provided an electronic device and a non-transitory computer-readable storage medium storing computer instructions.
FIG. 6 is a schematic diagram of an electronic device for implementing a training method for language models of embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 6, the electronic device includes: one or more processors 501, memory 502, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a GUI (graphical user interface) on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 501 is illustrated in fig. 6.
Memory 502 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of training the language model provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the language model provided by the present application.
The memory 502 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and units, such as program instructions/units (e.g., the acquisition unit 301, the labeling unit 302, and the language model 303 shown in fig. 4) corresponding to the training method of the language model in the embodiment of the present application. The processor 501 executes various functional applications of the server and data processing, i.e., implements the training method of the language model in the above-described method embodiment, by running non-transitory software programs, instructions, and units stored in the memory 502.
Memory 502 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device implementing the training method of the language model provided in the embodiment of the present application, and the like. In addition, memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 502 optionally includes memory remotely located with respect to processor 501, which may be connected via a network to an electronic device implementing the training method of the language model provided by embodiments of the present application. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the training method of the language model may further include: an input device 503 and an output device 504. The processor 501, memory 502, input devices 503 and output devices 504 may be connected by a bus or otherwise, for example in fig. 6.
The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device implementing the language model training method provided by embodiments of the present application, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, an LCD (liquid crystal display), an LED (light emitting diode) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, ASIC (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, PLDs (programmable logic devices)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (local area network), WAN (wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, in the embodiment, the language model is trained by marking the input text samples of the word segmentation information and the word segmentation information, so that the language model can learn semantic information based on word granularity, the word granularity contains richer semantic information representation relative to the word granularity due to the introduction of the semantic information of the word granularity, the modeling of the word meaning information by the language model is enhanced based on the semantic learning of the word granularity, the learning capability of the language model on the word meaning information is enhanced, and the performance of the language model is improved.
In addition, in the prior art, in a sequence labeling task of the language model pre-training, each word in an input text is labeled, the input text is required to be segmented according to the word, if the input text is segmented according to the word, the universality of the language model is reduced, and the downstream sequence labeling task is not friendly. By adopting the technical scheme provided by the application, on the basis of not changing the word segmentation of the original input text, word segmentation information is additionally introduced, so that the semantic learning capacity of a language model can be improved, the universality of an ERNIE model can not be reduced, and the method is more friendly to a downstream sequence labeling task compared with a mode of directly introducing word segmentation.
In addition, by adopting the technical scheme provided by the application, the trained language model has better semantic information expression capability, so that the accuracy of the processing result of the NLP task can be effectively improved when the trained language model is used for the subsequent NLP task.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A method of training a language model, comprising:
acquiring word segmentation information of an original input text;
annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample;
inputting the input text sample into a language model to train the language model; wherein,
the obtaining word segmentation information of the original input text comprises the following steps:
word segmentation is carried out on the original input text to obtain at least one word; each word of the at least one word includes at least one character;
determining whether each character is a first character of the word according to whether each character in the at least one word is the first character of the word; the word segmentation information of the original input text comprises: marking information of whether each character in the at least one word is a first character;
the original input text includes at least one sentence; the step of annotating and word-cutting information on each word mark in the original input text to obtain an input text sample comprises the following steps:
and annotating word cutting information and word cutting information on each word mark in the original input text, and labeling sentence identification on each sentence in the original input text to obtain an input text sample.
2. The method of claim 1, wherein the language model comprises a knowledge-enhanced semantic representation ERNIE model.
3. The method of any of claims 1-2, wherein the inputting the input text sample into a language model to train the language model further comprises:
performing natural language processing tasks by using the trained language model to obtain a processing result;
and fine tuning the parameter value in the language model according to the difference between the processing result and the labeling result information corresponding to the processing result.
4. The method of claim 3, wherein the performing natural language processing tasks using a trained language model comprises:
classifying the text to be processed by using the trained language model; and/or the number of the groups of groups,
matching the text to be processed with other texts by using the trained language model; and/or the number of the groups of groups,
marking the content in the text to be processed by using the trained language model; and/or the number of the groups of groups,
predicting the sequence among sentences in the text to be processed by using a trained language model; and/or the number of the groups of groups,
predicting semantic distances among sentences in the text to be processed by using the trained language model; and/or the number of the groups of groups,
and predicting the logical relations among sentences in the text to be processed by using the trained language model.
5. A training apparatus for a language model, comprising:
the acquisition unit is used for acquiring word segmentation information of the original input text;
the marking unit is used for annotating word cutting information and word cutting information on each word mark in the original input text to obtain an input text sample;
a language model for receiving the input text sample for training based on the input text sample; wherein,
the acquisition unit is particularly used for
Word segmentation is carried out on the original input text to obtain at least one word; each word of the at least one word includes at least one character;
determining whether each character is a first character of the word according to whether each character in the at least one word is the first character of the word; the word segmentation information of the original input text comprises: marking information of whether each character in the at least one word is a first character;
the original input text includes at least one sentence; the labeling unit is particularly used for
And annotating word cutting information and word cutting information on each word mark in the original input text, and labeling sentence identification on each sentence in the original input text to obtain an input text sample.
6. The apparatus of claim 5, wherein the language model comprises a knowledge-enhanced semantic representation ERNIE model.
7. The apparatus of any of claims 5-6, wherein the language model is further configured to perform a natural language processing task after training is completed to obtain a processing result;
the apparatus further comprises:
and the fine tuning unit is used for fine tuning the parameter value in the language model according to the difference between the processing result and the labeling result information corresponding to the processing result.
8. The apparatus of claim 7, wherein the language model is specifically configured to, when performing a natural language processing task
Classifying the text to be processed; and/or the number of the groups of groups,
matching the text to be processed with other texts; and/or the number of the groups of groups,
labeling the content in the text to be processed; and/or the number of the groups of groups,
predicting the sequence among sentences in the text to be processed; and/or the number of the groups of groups,
predicting semantic distances between sentences in the text to be processed; and/or the number of the groups of groups,
predicting the logical relation between sentences in the text to be processed.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN202010564362.9A 2020-06-19 2020-06-19 Language model training method and device, electronic equipment and readable storage medium Active CN111859951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010564362.9A CN111859951B (en) 2020-06-19 2020-06-19 Language model training method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010564362.9A CN111859951B (en) 2020-06-19 2020-06-19 Language model training method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111859951A CN111859951A (en) 2020-10-30
CN111859951B true CN111859951B (en) 2024-03-26

Family

ID=72987596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010564362.9A Active CN111859951B (en) 2020-06-19 2020-06-19 Language model training method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111859951B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487814B (en) * 2020-11-27 2024-04-02 北京百度网讯科技有限公司 Entity classification model training method, entity classification device and electronic equipment
CN112507101B (en) * 2020-12-18 2024-04-05 北京百度网讯科技有限公司 Method and device for establishing pre-training language model
CN112669816B (en) * 2020-12-24 2023-06-02 北京有竹居网络技术有限公司 Model training method, voice recognition method, device, medium and equipment
CN113011135A (en) * 2021-03-03 2021-06-22 科大讯飞股份有限公司 Arabic vowel recovery method, device, equipment and storage medium
CN113011176A (en) * 2021-03-10 2021-06-22 云从科技集团股份有限公司 Language model training and language reasoning method, device and computer storage medium thereof
CN113220836B (en) * 2021-05-08 2024-04-09 北京百度网讯科技有限公司 Training method and device for sequence annotation model, electronic equipment and storage medium
CN114817469B (en) * 2022-04-27 2023-08-08 马上消费金融股份有限公司 Text enhancement method, training method and training device for text enhancement model
CN116052648B (en) * 2022-08-03 2023-10-20 荣耀终端有限公司 Training method, using method and training system of voice recognition model
CN115600646B (en) * 2022-10-19 2023-10-03 北京百度网讯科技有限公司 Language model training method, device, medium and equipment
CN115688796B (en) * 2022-10-21 2023-12-05 北京百度网讯科技有限公司 Training method and device for pre-training model in natural language processing field
CN115640611B (en) * 2022-11-25 2023-05-23 荣耀终端有限公司 Method for updating natural language processing model and related equipment
CN117744661A (en) * 2024-02-21 2024-03-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Text generation model training method and text generation method based on prompt word engineering

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007087397A (en) * 2005-09-21 2007-04-05 Fujitsu Ltd Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method
CN102929916A (en) * 2012-09-19 2013-02-13 无锡华御信息技术有限公司 Method for backing up document based on document name identification
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
CN109033080A (en) * 2018-07-12 2018-12-18 上海金仕达卫宁软件科技有限公司 Medical terms standardized method and system based on probability transfer matrix
WO2019147804A1 (en) * 2018-01-26 2019-08-01 Ge Inspection Technologies, Lp Generating natural language recommendations based on an industrial language model
CN110110327A (en) * 2019-04-26 2019-08-09 网宿科技股份有限公司 A kind of text marking method and apparatus based on confrontation study
CN110134949A (en) * 2019-04-26 2019-08-16 网宿科技股份有限公司 A kind of text marking method and apparatus based on teacher's supervision

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162467A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods and systems for language-agnostic machine learning in natural language processing using feature extraction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007087397A (en) * 2005-09-21 2007-04-05 Fujitsu Ltd Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method
CN102929916A (en) * 2012-09-19 2013-02-13 无锡华御信息技术有限公司 Method for backing up document based on document name identification
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
WO2018149326A1 (en) * 2017-02-16 2018-08-23 阿里巴巴集团控股有限公司 Natural language question answering method and apparatus, and server
WO2019147804A1 (en) * 2018-01-26 2019-08-01 Ge Inspection Technologies, Lp Generating natural language recommendations based on an industrial language model
CN109033080A (en) * 2018-07-12 2018-12-18 上海金仕达卫宁软件科技有限公司 Medical terms standardized method and system based on probability transfer matrix
CN110110327A (en) * 2019-04-26 2019-08-09 网宿科技股份有限公司 A kind of text marking method and apparatus based on confrontation study
CN110134949A (en) * 2019-04-26 2019-08-16 网宿科技股份有限公司 A kind of text marking method and apparatus based on teacher's supervision

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于深度学习的简历信息实体抽取方法;黄胜;李伟;张剑;;计算机工程与设计;20181216(12);全文 *
结合预训练模型和语言知识库的文本匹配方法;周烨恒;石嘉晗;徐睿峰;;中文信息学报;20200215(02);全文 *

Also Published As

Publication number Publication date
CN111859951A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111859951B (en) Language model training method and device, electronic equipment and readable storage medium
CN111539223B (en) Language model training method and device, electronic equipment and readable storage medium
CN111221983B (en) Time sequence knowledge graph generation method, device, equipment and medium
CN111428008B (en) Method, apparatus, device and storage medium for training a model
CN111737994B (en) Method, device, equipment and storage medium for obtaining word vector based on language model
US11556715B2 (en) Method for training language model based on various word vectors, device and medium
US20210383064A1 (en) Text recognition method, electronic device, and storage medium
US11526668B2 (en) Method and apparatus for obtaining word vectors based on language model, device and storage medium
US11663258B2 (en) Method and apparatus for processing dataset
CN111104514B (en) Training method and device for document tag model
CN111079442B (en) Vectorization representation method and device of document and computer equipment
US20210216819A1 (en) Method, electronic device, and storage medium for extracting spo triples
US20210397791A1 (en) Language model training method, apparatus, electronic device and readable storage medium
US20220019736A1 (en) Method and apparatus for training natural language processing model, device and storage medium
CN111783468B (en) Text processing method, device, equipment and medium
CN111680145A (en) Knowledge representation learning method, device, equipment and storage medium
CN110674314A (en) Sentence recognition method and device
CN111611468B (en) Page interaction method and device and electronic equipment
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN111339759A (en) Method and device for training field element recognition model and electronic equipment
CN111522944A (en) Method, apparatus, device and storage medium for outputting information
CN111831814A (en) Pre-training method and device of abstract generation model, electronic equipment and storage medium
CN111126063B (en) Text quality assessment method and device
CN112329453B (en) Method, device, equipment and storage medium for generating sample chapter
CN111241302B (en) Position information map generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant