CN113836925A - Training method and device for pre-training language model, electronic equipment and storage medium - Google Patents

Training method and device for pre-training language model, electronic equipment and storage medium Download PDF

Info

Publication number
CN113836925A
CN113836925A CN202111089927.3A CN202111089927A CN113836925A CN 113836925 A CN113836925 A CN 113836925A CN 202111089927 A CN202111089927 A CN 202111089927A CN 113836925 A CN113836925 A CN 113836925A
Authority
CN
China
Prior art keywords
training
entity
language model
training sample
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111089927.3A
Other languages
Chinese (zh)
Other versions
CN113836925B (en
Inventor
卓安
黄际洲
王晓敏
鲁倪佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111089927.3A priority Critical patent/CN113836925B/en
Publication of CN113836925A publication Critical patent/CN113836925A/en
Application granted granted Critical
Publication of CN113836925B publication Critical patent/CN113836925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a training method and device for a pre-training language model, electronic equipment and a storage medium, and relates to the technical field of computers, in particular to the fields of natural language processing and deep learning. The specific implementation scheme is as follows: obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and entities in the pre-training corpus and labeling information of entity types; masking at least part of entities in the pre-training samples; and performing geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask. The scheme can enable the pre-training language model to learn the geographical entity knowledge, and improves the adaptability of the model.

Description

Training method and device for pre-training language model, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of natural language processing and deep learning, and in particular, to a method and an apparatus for training a pre-trained language model, an electronic device, and a storage medium.
Background
The pre-trained model can be pre-trained on large-scale unlabeled corpus and can learn a generic language representation. These representations can be used for other tasks, avoiding training new models from scratch, and thus can improve the efficiency of the training of subtask models. In recent years, the use of pre-trained Language models has achieved a good improvement in the task of multi-term NLP (Natural Language Processing).
At present, pre-training language models are mostly obtained by using corpus training in a general scene. However, a map is a special field, and a training set formed by a general language database is not directly related to NLP tasks on the map, so that when the existing pre-training language model is applied to the map field, there are field adaptability problems to a certain extent, such as problems of partial need understanding ambiguity in an actual service scene, low service model tuning efficiency, and the like.
Disclosure of Invention
The disclosure provides a training method and device for a pre-training language model, electronic equipment and a storage medium.
According to a first aspect of the present disclosure, there is provided a training method of a pre-training language model, including:
obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and entities in the pre-training corpus and labeling information of entity types;
masking at least part of entities in the pre-training samples;
and performing geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask.
According to a second aspect of the present disclosure, there is provided a training apparatus for pre-training a language model, comprising:
the device comprises a first acquisition module, a second acquisition module and a control module, wherein the first acquisition module is used for acquiring a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and entities in the pre-training corpus and labeling information of entity types;
the mask module is used for masking at least part of entities in the pre-training samples;
and the first training module is used for carrying out geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.
According to the technical scheme, the pre-training language model is trained by acquiring the pre-training corpus based on the map retrieval keywords and the information of the target interest points poi and pre-training samples of entities in the pre-training corpus and the labeling information of the entity types, and masking partial entities in the pre-training samples, so that the pre-training language model can learn the knowledge of geographic entities, the occurrence of adaptability problems when the pre-training language model is applied to the map field can be avoided, the optimization efficiency of the pre-training language model for subsequent tasks can be improved, and the landing implementation of relevant services in the poi field can be accelerated.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of obtaining pre-training samples in an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure
FIG. 5 is a block diagram of a training apparatus for pre-training a language model according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of an alternative training apparatus for pre-training a language model according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a method of training a pre-trained language model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order. The personal information of the involved users is acquired, stored and applied in the event of user consent.
It should be noted that the pre-training model can be pre-trained on large-scale unmarked corpus and can learn general language representation. These representations can be used for other tasks, avoiding training new models from scratch, and thus can improve the efficiency of the training of subtask models. In recent years, the use of pre-trained language models has achieved good improvement over multiple NLP tasks.
Because the pre-training language model is mostly obtained by using corpus training in a general scene, the existing pre-training language model has the problem of field adaptability to a certain extent when being applied to the map field. Aiming at the prior technical scheme, two problems mainly exist: (1) in the aspect of corpus training, the existing entity recognition technology aims at general corpus, and the corpus of a general pre-training language model is greatly different from the corpus of a map service, lacks high-quality geographic knowledge and cannot help the pre-training language model to perform geographic knowledge fusion; (2) in the aspect of pre-training tasks, certain limitations exist on the quality and quantity of the linguistic data of map scenes learned by tasks in the general field, the learning is insufficient for special semantic expression appearing in map services, and the defects exist in some middle and long tail problems of map tasks; in addition, the tasks in the general field and the subtasks in the map field have certain differences, so that the problem of field adaptability needs to be solved when the tasks in the map scene are used, and the model tuning cost is increased.
Based on the above problems, the present disclosure provides a training method and apparatus for pre-training a language model, an electronic device, and a storage medium. According to the scheme, the pre-training language model can be subjected to geographic entity learning, and the adaptability of the pre-training language model is improved.
Fig. 1 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure. It should be noted that the training method of the pre-training language model provided in the embodiments of the present disclosure may be applied to a training apparatus of the pre-training language model in the embodiments of the present disclosure, and the apparatus may be configured in an electronic device. As shown in fig. 1, the method may include the steps of:
step 101, obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point (point of interest) information, and labeling information of entities and entity types in the pre-training corpus.
It is understood that in order for the pre-trained language model to learn the knowledge of the geographic entity, the pre-trained corpus of the geographic knowledge, and the entity and entity type information in the pre-trained corpus need to be included in the pre-trained sample.
In the embodiment of the present disclosure, the target poi information may be information corresponding to the poi clicked by the user according to the search result. That is, the pre-training corpus based on the map search keyword and the target poi information refers to corpus information obtained by combining the keyword actually used by the user in the map field with the click behavior of the user, and the pre-training corpus obtained has strong correlation with the actual service due to the combination of the actual service scene, so that the pre-training effect of the pre-training language module can be improved.
As an example, the obtaining manner of the pre-training corpus based on the map retrieval keyword and the target poi information may be: according to the map field user behavior data, acquiring information such as names, addresses and types of the poi clicked by the user; acquiring a map retrieval keyword set corresponding to the clicked poi within a preset time range; and aiming at each clicked poi, splicing the information of the poi with a map retrieval keyword set corresponding to the poi to form a pre-training corpus.
As another example, a map search keyword used by a user and poi information clicked by the user based on a search result may be obtained from a map search log; and splicing the obtained map retrieval key words and the target poi information to form a pre-training corpus.
The entity in the embodiment of the present disclosure refers to a geographic entity, and may include, for example, entity information of geographic locations such as XX province, XX city, XX district and county, and may also include entity information related to poi such as XX company, XX district, XX subway line, and the like. In addition, the entity type refers to a classification of a geographic entity, such as a country, a province, a city, a prefecture, a road, a poi name, a traffic route, a poi type, and the like, and a specific entity classification may be divided according to an application scenario, which is not limited in the present disclosure.
As an example, the implementation manner of obtaining the labeling information of the entity and the entity type in the pre-training corpus may be: and using the entity recognition model to recognize the entity and the entity type of the pre-training corpus so as to obtain the labeling information of the entity and the entity type in the pre-training corpus. The entity identification model may be a model in the prior art, or may be an entity identification model constructed according to an actual scene, which is not limited in this disclosure.
Step 102, masking at least part of the entities in the pre-training samples.
In order to enable the pre-training language model to learn the geographic entity, the embodiment of the present disclosure adopts a mask mode for the entity in the pre-training sample.
In the embodiment of the present disclosure, in order to improve the robustness of model training, for each sample data in a pre-training sample, one or more entities may be randomly selected for masking, where the number of the selected entities may also be random. In addition, each sample data in the pre-training sample can be subjected to mask processing for multiple times to obtain the sample data after the multiple mask processing, so that the learning effect of the pre-training language model is improved.
It should be noted that the masking process may be performed in different manners, and when the masking process is performed, different masking manners may be randomly selected for performing the processing. The masking may be performed in such a manner that all entities are masked, for example, "a city" is an entity, "B city" is also an entity, and after the entity is masked, "mask" may be a "B area"; in addition, the masking may be performed in such a way that a part of an entity is masked, for example, the "B area in city a" may be a "B area in city a"; in addition, the masking processing mode may be to mask a certain entity locally and replace the characters randomly, for example, the "a city B area" may be "a city road [ mask ]" after being masked.
And 103, performing geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask.
Namely, the pre-training sample after being masked is input into a training pre-training language model, so that the pre-training language model outputs predicted entity data, and the pre-training language model is subjected to geographic entity learning according to the difference between the predicted entity data and the real data.
As an implementation manner, the pre-training samples after being masked may be input to a pre-training language model to obtain entity prediction data; performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity; the entity prediction data is the prediction result of the pre-training language model on the masked entity according to the context of the pre-training sample after masking and the entity type.
According to the training method of the pre-training language model provided by the embodiment of the disclosure, the pre-training language model is trained by obtaining the pre-training corpus based on the map retrieval keywords and the information of the target interest points poi and the pre-training samples of the entity and the labeling information of the entity type in the pre-training corpus and performing mask processing on part of the entity in the pre-training samples, so that the pre-training language model can learn the knowledge of the geographic entity, thereby avoiding the occurrence of adaptability problem when the pre-training language model is applied to the map field, improving the tuning efficiency of the pre-training language model for the subsequent tasks, and further accelerating the landing implementation of the relevant services in the poi field.
The present disclosure provides yet another embodiment for the manner of obtaining the pre-training samples.
Fig. 2 is a flow chart of obtaining pre-training samples in an embodiment of the present disclosure. As shown in fig. 2, an implementation of obtaining the pre-training samples may include:
step 201, obtaining a plurality of map retrieval keywords and target poi information of each map retrieval keyword according to the map retrieval log and the poi database.
In some embodiments of the present disclosure, the map retrieval keywords initiated by the user and the target poi clicked by the user in the retrieval result corresponding to each map retrieval keyword may be obtained in the map retrieval log. In order to make the knowledge covered by the obtained information wide, the target poi information can be obtained from the poi database, and the poi information can comprise a poi name, a poi alias, a poi address, a poi type and the like, and in some embodiments, comment data of the user on the target poi can also be used as the information of the target poi. It should be noted that, in the map retrieval log, a part of the target poi information, such as the poi name, the poi address, and the like, may be acquired, and in this case, the target poi information may be supplemented according to the poi database.
Step 202, aiming at each map retrieval keyword, splicing the map retrieval keyword with target poi information of the map retrieval keyword to obtain a pre-training corpus.
That is, each map search keyword is spliced with its target poi information to serve as one piece of data in the pre-training corpus.
Optionally, the map search keyword and the target poi information thereof may be spliced in a preset splicing manner, for example, the map search keyword is inserted into a preset position in the target poi information, and the specific preset position may be determined according to an actual situation. The map search keyword and the target poi information thereof may also be spliced in a random manner, which is not limited in this disclosure.
Step 203, identifying the entity and the entity type in the pre-training corpus to obtain the labeling information of the entity and the entity type in the pre-training corpus, and using the pre-training corpus and the labeling information thereof as a pre-training sample.
In some embodiments of the present disclosure, identifying entities and entity types in the pre-training corpus may be implemented by an entity identification model. The entity recognition model can be an existing model or a model constructed according to an actual application scene, and meanwhile, the model can be trained based on samples of labeled entities and entity types. The entity type refers to an entity type divided for a geographic entity in actual application.
It should be noted that the BiLSTM-CRF model is a named entity recognition model, and the model can recognize entities and entity types of characters input into the model by learning entity text features of each entity type. The BilSTM layer can process the text based on the input model by taking the characters as units, predict the score of each character corresponding to each label and input the predicted score into the CRF layer. The CRF layer may determine the label corresponding to each word by learning the order dependency information of each label based on each label score corresponding to each word output by the BiLSTM, and output the entity and the type of each entity included in the text.
As an example, a BilSTM-CRF model may be employed to identify entities and entity types in the pre-training corpus. Before identification, entity labeling and entity type labeling can be carried out on map retrieval keywords in the map retrieval log, and the labeled map retrieval keywords are used as training samples of a BilSTM-CRF model, so that the model can learn identification aiming at geographic entities and entity types. And inputting the pre-training corpus into the trained model to obtain the labeling information of the entity and the entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information thereof as a pre-training sample.
In some embodiments of the present disclosure, in order to optimize the quality of the pre-training sample, the entity and the labeling information of the entity type in the pre-training corpus may also be preprocessed first, and the pre-training corpus and the preprocessed labeling information thereof are used as the pre-training sample. The preprocessing may include operations such as entity merging, text normalization, etc., for example, when some types of entity combinations occur, they may be merged into one entity with more complete information through entity merging. For another example, when the format of the numbers in the entity is not uniform, the numbers can be uniform through a text normalization operation.
Further, in some embodiments of the present disclosure, in order to improve the training efficiency of the pre-training language model, the pre-training corpus may be serialized according to the entity of the pre-training corpus and the label information of the entity type, and a result of the serialization processing may be used as a pre-training sample. And aiming at the marking information which is subjected to preprocessing operation, the pre-training corpus can be serialized according to the preprocessed marking information, and a serialization processing result is used as a pre-training sample.
According to the training method of the pre-training language model provided by the embodiment of the disclosure, the pre-training corpus is obtained through the map retrieval log and the poi database based on the user click behavior, which is equivalent to obtaining the pre-training corpus according to the retrieval behavior and click behavior of the actual user aiming at the map, so that not only can high-quality corpus information in the map field be obtained, but also the sample data can have strong correlation with the actual service, and the learning efficiency of the pre-training language model can be improved. In addition, the entity and the entity type of the pre-training corpus are identified, and the pre-training corpus and the identified labeling information are used as pre-training samples, so that the model can learn the geographic entity more effectively based on the entity context and the entity type.
In order to further improve the training effect of the model, the present disclosure proposes yet another embodiment.
Fig. 3 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 3, on the basis of the above embodiment, the method may further include:
step 301, performing word replacement processing on a first type entity in a pre-training sample to obtain a processed pre-training sample; wherein, the word replacement comprises the replacement of a character with a similar shape and/or the replacement of a character with a similar pinyin.
It is understood that, in order to make the trained model automatically perform error correction, the pre-training sample may be replaced with font data as the interference data, so that the pre-training language model may avoid the influence of the interference data through learning.
In some implementations of the present disclosure, a first type of entity refers to an entity that can perform word replacement processing. It should be noted that, when performing font conversion processing on each sample data in the pre-training sample, one or more of the first type entities of the sample data may be randomly selected for performing font conversion processing, and the number of the selected entities may also be random. In addition, the word replacement processing mode may also be random, that is, each sample data may only have the replacement of a word with a similar shape, may only have the replacement of a word with a similar pinyin, and may also have the word replacement processing of the two replacement modes, so as to improve the robustness of the model training. Meanwhile, each sample data in the pre-training sample can be subjected to font conversion for multiple times to obtain the sample data after the conversion of multiple fonts, so that the learning effect of the pre-training language model is improved.
The font conversion process may include the replacement of a font character and/or the replacement of a character similar to pinyin. For example, after the characters similar to pinyin are replaced for the new world, the characters can be the new view, the new time, the heart world and the like; and after the shape and the word are replaced, the new world can be a salary world and the like. In addition, the word replacement process may also include an adjustment to the order of words in the entity, such as "playfulness" may be replaced with "playfulness". In some embodiments of the present disclosure, the word replacement processing may be performed by randomly selecting from entity candidate replacement words obtained from a preset dictionary.
And step 302, performing geographical error correction learning on the pre-training language model subjected to geographical entity learning according to the processed pre-training sample.
It can be understood that the purpose of this step is to make the pre-trained language model learned by the geographic entity avoid the interference of the word replacement process through training, so that it can still predict the entity before the word replacement process, thereby implementing the geographic error correction learning of the model.
As an example, the pre-trained language model learned by the geographic entity may be learned by performing geographic error correction learning in the following manner: inputting the processed pre-training sample into a pre-training language model which is learned by a geographic entity, wherein the model can predict a word replacement entity according to the entity and the entity type of the context, and output entity prediction data; and calculating a loss value according to the entity prediction data and the entity before the word replacement processing, and continuously adjusting the model parameters according to the loss value until the entity prediction result meets the prediction, so as to realize the geographic error correction learning of the model.
According to the training method of the pre-training language model provided by the embodiment of the disclosure, the entity in the pre-training sample is subjected to word replacement processing, and the pre-training language model subjected to geographic entity learning is subjected to geographic error correction learning according to the processed pre-training sample, so that the pre-training language model can more fully learn geographic knowledge, and the effect of subsequent tasks using the model can be further improved.
The present disclosure presents yet another embodiment for the learning of relevance by a pre-trained language model.
FIG. 4 is a flowchart of a training method for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 4, on the basis of the foregoing embodiment, the method may further include:
step 401, obtaining a correlation training sample according to the map retrieval log.
In some embodiments of the present disclosure, correlation training samples with different correlation levels may be set, and three correlation levels will be described as an example. For example, three levels of strong correlation, weak correlation, and no correlation may be set, wherein the correlation sample may be the correlation between the map search keyword and the poi name. As an example, a poi name clicked by a user in a search result of the map search keyword can be used as a strong relevant poi corresponding to the map search keyword in the map search log based on the map search keyword of each search record; taking the poi name which is not clicked by the user in the retrieval result as the weakly related poi corresponding to the map retrieval keyword; randomly taking out a poi name from a poi database as an irrelevant poi corresponding to the map retrieval keyword; for each map search keyword, a combination of the map search keyword and the poi name of its corresponding three levels is used as one piece of data in the correlation training sample.
And step 402, performing relevance learning on the pre-trained language model subjected to the geographic error correction learning according to the relevance training sample.
In some embodiments of the present disclosure, the relevance training samples may be input to a pre-trained language model that is subjected to geo-error correction learning, which trains the learning of different levels of relevance in the samples according to the relevance. As an example, based on the correlation training sample in the above example, the sample data is input to a pre-training language model that is subjected to geographic error correction learning, and the model may predict, for each map search keyword in the sample data, a strong relevant poi name, a weak relevant poi name, and an irrelevant poi name corresponding to the map search keyword according to learned geographic knowledge, and calculate a loss value according to the prediction result and the poi name at the level corresponding to the map search keyword in the sample data, thereby training the pre-training language model according to the loss value, and further implementing the learning of the model on correlation.
It should be noted that, according to the requirement of the actual scene, the pre-trained language model learned by the geographic entity may also be subjected to correlation learning according to the correlation training sample. That is, for the pre-trained language model that is not subjected to error correction learning, correlation learning may be performed to enhance the training effect of the pre-trained language model. As shown in fig. 4, the method may further include:
and step 403, performing relevance learning on the pre-trained language model which is learned by the geographic entity according to the relevance training sample.
As an example, based on the correlation training sample in the above example, the sample data is input to a pre-training language model learned by a geographic entity, and the model may predict, for each map search keyword in the sample data, a strong correlation poi name, a weak correlation poi name, and an irrelevant poi name corresponding to the map search keyword according to the learned knowledge of the geographic entity, and calculate a loss value according to the prediction result and the poi name at the level corresponding to the map search keyword in the sample data, thereby training the pre-training language model according to the loss value, and further implementing the learning of the model on the correlation.
According to the training method of the pre-training language model provided by the embodiment of the disclosure, a training mode of multi-task learning is established for the pre-training language model, and the pre-training language model which is learned by the geographic entity is subjected to correlation training according to the correlation training sample, so that the learning capability of the geographic field of the pre-training language model can be enhanced, and the model effect is improved. In addition, the pre-training language model can be used for sequentially carrying out geographic entity learning, geographic error correction learning and correlation learning, so that the map domain knowledge learned by the pre-training language model can be more sufficient, the training effect of the pre-training language model can be enhanced, and the applicability of the model training method can be improved.
In order to implement the above embodiments, the present disclosure provides a training apparatus for pre-training a language model.
Fig. 5 is a block diagram illustrating a structure of a training apparatus for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus may include:
a first obtaining module 510, configured to obtain a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and the information of the target interest point poi, and entities in the pre-training corpus and labeling information of entity types;
a mask module 520, configured to mask at least some entities in the pre-training samples;
the first training module 530 is configured to perform geographic entity learning on the pre-training language model according to the pre-training sample after the mask.
In some embodiments of the present disclosure, the first training model 530 is specifically configured to:
inputting the pre-training sample subjected to the mask to a pre-training language model to obtain entity prediction data; the entity prediction data is a prediction result of the pre-training language model on the masked entity according to the context of the pre-training sample after masking and the entity type;
and performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity.
As an implementation manner, in the embodiment of the present disclosure, the first obtaining module 510 includes:
an acquisition unit 511 configured to acquire a plurality of map retrieval keywords and target poi information of each map retrieval keyword, according to the map retrieval log and the poi database;
the splicing unit 512 is configured to splice the map retrieval keywords with the target poi information of the map retrieval keywords to obtain a pre-training corpus for each map retrieval keyword;
the identifying unit 513 is configured to identify an entity and an entity type in the pre-training corpus to obtain labeling information of the entity and the entity type in the pre-training corpus, and use the pre-training corpus and the labeling information as a pre-training sample.
Optionally, in some embodiments of the present disclosure, the identifying unit 513 is further configured to:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking a serialization processing result as a pre-training sample.
According to the training device for the pre-training language model provided by the embodiment of the disclosure, the pre-training language model is trained by acquiring the pre-training corpus based on the map retrieval keywords and the information of the target interest points poi and the pre-training samples of the entity and the labeling information of the entity type in the pre-training corpus and performing mask processing on part of the entity in the pre-training samples, so that the pre-training language model can learn the knowledge of geographic entities, thereby avoiding the occurrence of adaptability problem when the pre-training language model is applied to the map field, improving the tuning efficiency of the pre-training language model for subsequent tasks, and further accelerating the landing implementation of relevant services in the poi field.
In order to further improve the training effect of the pre-training language model, the application provides another embodiment.
Fig. 6 is a block diagram illustrating a structure of another training apparatus for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 6, on the basis of the above embodiment, the apparatus further includes:
the replacing module 640 is configured to perform word replacing processing on the first type entity in the pre-training sample to obtain a processed pre-training sample; wherein, the word replacement processing comprises the replacement of similar words and/or the replacement of words with similar pinyin;
and the second training module 650 is configured to perform, according to the processed pre-training sample, geographic error correction learning on the pre-training language model that is subjected to geographic entity learning.
For further training effect of the model, the apparatus may further include:
the second obtaining module 660 is configured to obtain a correlation training sample according to the map retrieval log;
and the third training module 670 is configured to perform correlation learning on the pre-trained language model subjected to the geographic error correction learning according to the correlation training sample.
Further, in some embodiments of the present disclosure, the apparatus may further include:
a third obtaining module 680, configured to obtain a correlation training sample according to the map retrieval log;
the fourth training module 690 is configured to perform correlation learning on the pre-trained language model learned by the geographic entity according to the correlation training sample.
It should be noted that, in some embodiments of the present disclosure, the functions of the second obtaining module 660 and the third obtaining module 680 may be configured in the same functional module, and the functions of the third training module 670 and the fourth training module 690 may also be configured in the same functional module, and in practical applications, the functions may be determined according to practical application scenarios.
The modules 510 to 530 in fig. 5 have the same functional structures as the modules 610 to 630 in fig. 6, and are not described again here.
According to the training device for the pre-training language model provided by the embodiment of the disclosure, a training mode of multi-task learning is established for the pre-training language model, and the pre-training language model which is learned by the geographic entity is subjected to correlation training according to the correlation training sample, so that the learning capability of the geographic field of the pre-training language model can be enhanced, and the model effect is improved. In addition, the pre-training language model can be used for sequentially carrying out geographic entity learning, geographic error correction learning and correlation learning, so that the map domain knowledge learned by the pre-training language model can be more sufficient, the training effect of the pre-training language model can be enhanced, and the applicability of the model training method can be improved.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 805 is also connected to bus 804.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a training method of a pre-training language model. For example, in some embodiments, the training method of the pre-trained language model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method of training a pre-trained language model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the pre-trained language model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A method of training a pre-trained language model, comprising:
obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and entities in the pre-training corpus and labeling information of entity types;
masking at least part of entities in the pre-training samples;
and performing geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask.
2. The method of claim 1, wherein the performing geo-entity learning on the pre-trained language model according to the masked pre-trained samples comprises:
inputting the pre-training sample subjected to the mask to a pre-training language model to obtain entity prediction data; the entity prediction data is the prediction result of the entity type to the masked entity according to the context of the pre-training language model after the mask;
and performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity.
3. The method of claim 1, wherein the obtaining pre-training samples comprises:
according to a map retrieval log and a poi database, obtaining a plurality of map retrieval keywords and target poi information of each map retrieval keyword;
for each map retrieval keyword, splicing the map retrieval keyword with target poi information of the map retrieval keyword to obtain a pre-training corpus;
and identifying the entity and the entity type in the pre-training corpus to obtain the labeling information of the entity and the entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information as a pre-training sample.
4. The method according to claim 3, wherein the using the pre-training corpus and the labeling information as pre-training samples comprises:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking a serialization processing result as a pre-training sample.
5. The method of claim 1, further comprising:
performing word replacement processing on the first type of entity in the pre-training sample to obtain a processed pre-training sample; wherein, the word replacement processing comprises the replacement of similar words and/or the replacement of words with similar pinyin;
and performing geographical error correction learning on the pre-training language model which is subjected to the geographical entity learning according to the processed pre-training sample.
6. The method of claim 5, further comprising:
obtaining a correlation training sample according to the map retrieval log;
and performing relevance learning on the pre-trained language model subjected to the geographic error correction learning according to the relevance training sample.
7. The method of any of claims 1 to 4, further comprising:
obtaining a correlation training sample according to the map retrieval log;
and performing relevance learning on the pre-trained language model which is learned by the geographic entity according to the relevance training sample.
8. A training apparatus for pre-training a language model, comprising:
the device comprises a first acquisition module, a second acquisition module and a control module, wherein the first acquisition module is used for acquiring a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and entities in the pre-training corpus and labeling information of entity types;
the mask module is used for masking at least part of entities in the pre-training samples;
and the first training module is used for carrying out geographic entity learning on the pre-training language model according to the pre-training sample subjected to the mask.
9. The apparatus of claim 8, wherein the first training module is specifically configured to:
inputting the pre-training sample subjected to the mask to a pre-training language model to obtain entity prediction data; the entity prediction data is the prediction result of the entity type to the masked entity according to the context of the pre-training language model after the mask;
and performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity.
10. The apparatus of claim 8, wherein the first obtaining means comprises:
the map retrieval system comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit is used for acquiring a plurality of map retrieval keywords and target poi information of each map retrieval keyword according to a map retrieval log and a poi database;
the splicing unit is used for splicing the map retrieval keywords with target poi information of the map retrieval keywords to obtain pre-training corpora;
and the identification unit is used for identifying the entity and the entity type in the pre-training corpus to obtain the labeling information of the entity and the entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information as a pre-training sample.
11. The apparatus of claim 10, wherein the identifying unit is further configured to:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking a serialization processing result as a pre-training sample.
12. The apparatus of claim 8, further comprising:
the replacing module is used for carrying out word replacing processing on the first type of entity in the pre-training sample to obtain a processed pre-training sample; wherein, the word replacement processing comprises the replacement of similar words and/or the replacement of words with similar pinyin;
and the second training module is used for carrying out geographical error correction learning on the pre-training language model which is subjected to the geographical entity learning according to the processed pre-training sample.
13. The apparatus of claim 12, further comprising:
the second acquisition module is used for acquiring a correlation training sample according to the map retrieval log;
and the third training module is used for performing correlation learning on the pre-trained language model subjected to the geographic error correction learning according to the correlation training sample.
14. The apparatus of any of claims 8 to 11, further comprising:
the third acquisition module is used for acquiring a correlation training sample according to the map retrieval log;
and the fourth training module is used for performing correlation learning on the pre-trained language model which is subjected to the geographic entity learning according to the correlation training sample.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.
CN202111089927.3A 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium Active CN113836925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111089927.3A CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111089927.3A CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113836925A true CN113836925A (en) 2021-12-24
CN113836925B CN113836925B (en) 2023-07-07

Family

ID=78959695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111089927.3A Active CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113836925B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580543A (en) * 2022-03-07 2022-06-03 北京百度网讯科技有限公司 Model training method, interactive log analysis method, device, equipment and medium
CN114861889A (en) * 2022-07-04 2022-08-05 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN115081453A (en) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 Named entity identification method and system
CN115346657A (en) * 2022-07-05 2022-11-15 深圳市镜象科技有限公司 Training method and device for improving senile dementia recognition effect by transfer learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN112559885A (en) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 Method and device for determining training model of map interest point and electronic equipment
US20210103775A1 (en) * 2019-10-08 2021-04-08 International Business Machines Corporation Span selection training for natural language processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210103775A1 (en) * 2019-10-08 2021-04-08 International Business Machines Corporation Span selection training for natural language processing
CN111539223A (en) * 2020-05-29 2020-08-14 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN112559885A (en) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 Method and device for determining training model of map interest point and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Y SUN 等: "ERNIE: Enhanced Representation through Knowledge Integration", 《ARXIV.ORG》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580543A (en) * 2022-03-07 2022-06-03 北京百度网讯科技有限公司 Model training method, interactive log analysis method, device, equipment and medium
CN114580543B (en) * 2022-03-07 2023-09-29 北京百度网讯科技有限公司 Model training method, interaction log analysis method, device, equipment and medium
CN114861889A (en) * 2022-07-04 2022-08-05 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN114861889B (en) * 2022-07-04 2022-09-27 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN115346657A (en) * 2022-07-05 2022-11-15 深圳市镜象科技有限公司 Training method and device for improving senile dementia recognition effect by transfer learning
CN115346657B (en) * 2022-07-05 2023-07-28 深圳市镜象科技有限公司 Training method and device for improving identification effect of senile dementia by utilizing transfer learning
CN115081453A (en) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 Named entity identification method and system
CN115081453B (en) * 2022-08-23 2022-11-04 北京睿企信息科技有限公司 Named entity identification method and system

Also Published As

Publication number Publication date
CN113836925B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN113836925A (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN112560496A (en) Training method and device of semantic analysis model, electronic equipment and storage medium
CN112507706B (en) Training method and device for knowledge pre-training model and electronic equipment
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN112541070B (en) Mining method and device for slot updating corpus, electronic equipment and storage medium
CN114490998B (en) Text information extraction method and device, electronic equipment and storage medium
CN112559885A (en) Method and device for determining training model of map interest point and electronic equipment
CN113053367A (en) Speech recognition method, model training method and device for speech recognition
CN113407610A (en) Information extraction method and device, electronic equipment and readable storage medium
CN112507103A (en) Task type dialogue and model training method, device, equipment and storage medium
CN114244795B (en) Information pushing method, device, equipment and medium
CN114218951B (en) Entity recognition model training method, entity recognition method and device
CN114399772B (en) Sample generation, model training and track recognition methods, devices, equipment and media
CN113408273B (en) Training method and device of text entity recognition model and text entity recognition method and device
CN112270169B (en) Method and device for predicting dialogue roles, electronic equipment and storage medium
CN112699237B (en) Label determination method, device and storage medium
CN112528146B (en) Content resource recommendation method and device, electronic equipment and storage medium
CN113157877A (en) Multi-semantic recognition method, device, equipment and medium
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN115292467A (en) Information processing and model training method, apparatus, device, medium, and program product
CN114417862A (en) Text matching method, and training method and device of text matching model
CN113051926A (en) Text extraction method, equipment and storage medium
CN113886543A (en) Method, apparatus, medium, and program product for generating an intent recognition model
CN113807390A (en) Model training method and device, electronic equipment and storage medium
CN113553833A (en) Text error correction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant