CN113836925B - Training method and device for pre-training language model, electronic equipment and storage medium - Google Patents

Training method and device for pre-training language model, electronic equipment and storage medium Download PDF

Info

Publication number
CN113836925B
CN113836925B CN202111089927.3A CN202111089927A CN113836925B CN 113836925 B CN113836925 B CN 113836925B CN 202111089927 A CN202111089927 A CN 202111089927A CN 113836925 B CN113836925 B CN 113836925B
Authority
CN
China
Prior art keywords
training
entity
language model
learning
geographic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111089927.3A
Other languages
Chinese (zh)
Other versions
CN113836925A (en
Inventor
卓安
黄际洲
王晓敏
鲁倪佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111089927.3A priority Critical patent/CN113836925B/en
Publication of CN113836925A publication Critical patent/CN113836925A/en
Application granted granted Critical
Publication of CN113836925B publication Critical patent/CN113836925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a training method, a training device, electronic equipment and a storage medium for a pre-training language model, and relates to the technical field of computers, in particular to the fields of natural language processing and deep learning. The specific implementation scheme is as follows: obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus; masking at least some entities in the pre-training samples; and according to the pre-training samples after masking, performing geographic entity learning on the pre-training language model. The scheme can enable the pre-training language model to learn the geographical entity knowledge, and improves the adaptability of the model.

Description

Training method and device for pre-training language model, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of natural language processing and deep learning, and in particular, to a training method and apparatus for a pre-training language model, an electronic device, and a storage medium.
Background
The pre-training model can pre-train on a large scale of unlabeled corpora and can learn a generic language representation. These representations can be used for other tasks, avoiding training new models from scratch, and thus improving the efficiency of each subtask model training. In recent years, the use of pre-trained language models has achieved a good improvement over many NLP (Natural Language Processing ) tasks.
Currently, most of pre-training language models are obtained by corpus training in a general scene. However, the map is a special field, and the training set formed by the general corpus is not directly related to the NLP task on the map, so that when the existing pre-training language model is applied to the field of the map, the problem of field adaptability to a certain extent exists, such as ambiguity of partial demand understanding in an actual service scene, low optimization efficiency of the service model and the like.
Disclosure of Invention
The disclosure provides a training method and device for a pre-training language model, electronic equipment and a storage medium.
According to a first aspect of the present disclosure, there is provided a training method of a pre-training language model, including:
obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus;
Masking at least some entities in the pre-training samples;
and according to the pre-training samples after masking, performing geographic entity learning on the pre-training language model.
According to a second aspect of the present disclosure, there is provided a training apparatus for pre-training a language model, comprising:
the first acquisition module is used for acquiring a pre-training sample; the pre-training sample comprises a pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus;
a masking module, configured to mask at least part of entities in the pre-training samples;
and the first training module is used for carrying out geographic entity learning on the pre-training language model according to the pre-training samples after mask.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect described above.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect described above.
According to the technical scheme, the pre-training language model is trained by acquiring the pre-training corpus comprising the POI information based on the map retrieval keywords and the target interest points and the pre-training sample of the entity and entity type marking information in the pre-training corpus and carrying out mask processing on part of the entities in the pre-training sample, so that the geographic entity knowledge can be learned, the adaptability problem of the pre-training language model when the pre-training language model is applied to the map field can be avoided, the tuning efficiency of the pre-training language model for the follow-up task can be improved, and the floor implementation of related business in the POI field can be accelerated.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a training method for a pre-training language model provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart of obtaining a pre-training sample in an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method for yet another pre-training language model provided by an embodiment of the present disclosure;
FIG. 4 is a flow chart of a training method of yet another pre-training language model provided by an embodiment of the present disclosure
FIG. 5 is a block diagram of a training apparatus for pre-training a language model provided in an embodiment of the present disclosure;
FIG. 6 is a block diagram of a training apparatus of another pre-training language model provided by an embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a training method of a pre-trained language model of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated. The user personal information involved is acquired, stored and applied in the event of contending for user consent.
It should be noted that the pre-training model can pre-train on a large scale of unlabeled corpus and can learn a generic language representation. These representations can be used for other tasks, avoiding training new models from scratch, and thus improving the efficiency of each subtask model training. In recent years, the use of pre-trained language models has achieved a good boost over multiple NLP tasks.
Because the pre-training language model is mostly obtained by corpus training under a general scene, the prior pre-training language model has a problem of field adaptability to a certain extent when being applied to the map field. Aiming at the prior art, the problems in two aspects mainly exist: (1) In the aspect of training linguistic data, the current entity recognition technology aims at general linguistic data, but the linguistic data of a general pre-training language model and the linguistic data of map business are greatly different, and high-quality geographic knowledge is lacking, so that the pre-training language model cannot be helped to carry out geographic knowledge fusion; (2) In the aspect of pre-training tasks, the tasks in the general field have certain limitations on the corpus quality and quantity of map scenes, the learning of special semantic expressions appearing in map business is insufficient, and the problems of long tails in some map tasks are insufficient; in addition, the tasks in the general field are different from the subtasks in the map field to a certain extent, the field adaptability problem needs to be overcome when the tasks in the map scene are used, and the model tuning cost is increased.
Based on the above problems, the present disclosure provides a training method, device, electronic device and storage medium for a pre-training language model. The scheme can learn the geographic entity of the pre-training language model, and improves the adaptability of the pre-training language model.
Fig. 1 is a flowchart of a training method of a pre-training language model according to an embodiment of the present disclosure. It should be noted that, the training method of the pre-training language model provided by the embodiment of the disclosure may be applied to the training device of the pre-training language model in the embodiment of the disclosure, and the device may be configured in an electronic device. As shown in fig. 1, the method may include the steps of:
step 101, obtaining a pre-training sample; the pre-training sample comprises a pre-training corpus based on the information of map search keywords and target interest points poi (point of interest), and labeling information of entities and entity types in the pre-training corpus.
It will be appreciated that in order for the pre-training language model to learn the knowledge of the geographic entities, the pre-training samples need to include pre-training corpus about the geographic knowledge, as well as entity and entity type information in the pre-training corpus.
In the embodiment of the present disclosure, the target poi information may be information corresponding to a poi clicked by a user according to a search result. That is, the pre-training corpus based on the map search keyword and the target poi information refers to corpus information obtained by combining the keyword actually searched by the user in the map field and the clicking behavior of the user, and because the actual business scene is combined, the obtained pre-training corpus has stronger correlation with the actual business, so that the pre-training effect of the pre-training language module can be improved.
As an example, the pre-training corpus based on the map search keyword and the target poi information may be obtained by: acquiring information such as names, addresses, types and the like of poi clicked by a user according to user behavior data in the map field; acquiring a map retrieval keyword set corresponding to the same clicked poi within a preset time range; and aiming at each clicked poi, splicing the information of the poi with the map search keyword set corresponding to the poi to form a pre-training corpus.
As another example, map search keywords used by the user and poi information clicked by the user based on the search result may be acquired from the map search log; and splicing the obtained map retrieval keywords and the target poi information to form a pre-training corpus.
The entity in the embodiment of the disclosure refers to a geographical entity, for example, may include geographical location entity information such as XX province, XX city, XX county, and the like, and may also include entity information related to poi such as XX company, XX cell, XX subway, and the like. In addition, the entity type refers to classification of geographic entities, such as country, province, city, county, road, poi name, traffic line, poi type, etc., and specific entity classification may be classified according to application scenarios, which is not limited in this disclosure.
As an example, the implementation manner of obtaining the labeling information of the entity and the entity type in the pre-training corpus may be: and carrying out entity and entity type recognition on the pre-training corpus by using the entity recognition model to obtain labeling information of the entity and entity type in the pre-training corpus. The entity recognition model may be a model in the prior art, or may be an entity recognition model constructed according to an actual scene, which is not limited in this disclosure.
At step 102, at least some of the entities in the pre-training samples are masked.
In order for the pre-training language model to learn about geographic entities, embodiments of the present disclosure employ a mask approach for entities among the pre-training samples.
In the embodiment of the disclosure, in order to improve robustness of model training, for each sample data in the pre-training samples, one or several entities may be selected randomly for masking, where the number of selected entities may also be random. In addition, each sample data in the pre-training samples can be subjected to multiple masking processes to obtain sample data after multiple masking processes, so that the learning effect of the pre-training language model is improved.
It should be noted that the masking process may have different manners, and when the masking process is performed, the masking process may be performed by randomly selecting a different masking manner. The masking process may be to mask all entities, such as "a city B area", where "a city" is an entity, and "B area" is also an entity, and after masking an entity, it may be "mask" B area "; in addition, the masking may be performed by partially masking an entity, for example, the "a-city B area" may be "mask" city B area "after being masked; in addition, the masking processing mode can also be to partially mask a certain entity and randomly replace characters in the entity, for example, the area B of the A city can be "A city road [ mask ]", after being masked.
And 103, performing geographic entity learning on the pre-training language model according to the pre-training samples after masking.
That is, the masked pre-training samples are input into the training pre-training language model to output predicted entity data, and geographic entity learning is performed on the pre-training language model according to the difference between the predicted entity data and the real data.
As an implementation manner, the pre-training samples after being masked can be input into a pre-training language model to obtain entity prediction data; performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity; the entity prediction data is the prediction result of the entity type on the masked entity according to the context of the pre-training sample after masking by the pre-training language model.
According to the training method of the pre-training language model, the pre-training language model is trained by acquiring the pre-training corpus comprising the pre-training corpus based on the map retrieval keywords and the poi information of the target interest points and the labeling information of the entities and the entity types in the pre-training corpus, and masking the partial entities in the pre-training sample, so that the pre-training language model can learn the knowledge of the geographic entities, the adaptability problem of the pre-training model when the pre-training language model is applied to the map field can be avoided, the tuning efficiency of the pre-training model for the follow-up tasks can be improved, and the floor implementation of related services in the poi field can be accelerated.
Yet another embodiment is presented in this disclosure for the manner in which pre-training samples are obtained.
Fig. 2 is a flow chart of obtaining a pre-training sample in an embodiment of the present disclosure. As shown in fig. 2, an implementation of obtaining the pre-training samples may include:
step 201, obtaining a plurality of map search keywords and target poi information of each map search keyword according to the map search log and the poi database.
In some embodiments of the present disclosure, a map search keyword initiated by a user and a target poi clicked by the user in a search result corresponding to each map search keyword may be obtained in a map search log. In order to make the obtained knowledge of the information coverage wide, target poi information may be obtained according to a poi database, where the poi information may include a poi name, a poi alias, a poi address, a poi type, and the like, and in some embodiments, comment data of a user on the target poi may also be used as information of the target poi. In the map retrieval log, a part of target poi information, such as poi name, poi address, etc., may be acquired, in which case the target poi information may be supplemented according to the poi database.
Step 202, for each map search keyword, splicing the map search keywords with target poi information of the map search keywords to obtain a pre-training corpus.
That is, each map search keyword and the target poi information thereof are spliced to be used as one piece of data in the pre-training corpus.
Alternatively, the map search keyword and the target poi information thereof may be spliced in a preset splicing manner, for example, the map search keyword is inserted into a preset position in the target poi information, and the specific preset position may be determined according to the actual situation. The map search keywords and their target poi information may also be spliced in a random manner, which is not limited by the present disclosure.
And 203, identifying the entity and entity type in the pre-training corpus to obtain labeling information of the entity and entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information thereof as pre-training samples.
In some embodiments of the present disclosure, identifying entities and entity types in a pre-training corpus may be accomplished through an entity identification model. The entity recognition model can be an existing model or a model constructed according to an actual application scene, and meanwhile, the model can be trained on samples based on marked entities and entity types. The entity type refers to the entity type aiming at geographic entity division in practical application.
It should be noted that the BiLSTM-CRF model is a named entity recognition model, and the model can recognize entities and entity types of characters input into the model by learning the entity text characteristics of each entity type. Wherein the BiLSTM layer may process in units of words based on text of the input model, predict a score of each word corresponding to each tag, and input the predicted score to the CRF layer. The labels and entity types are corresponding, for example, if a certain entity type is an Organization, the corresponding labels can be B-Organization (the beginning part of the Organization) and I-Organization (the middle part of the Organization), so that the CRF layer can determine the label corresponding to each word by learning the sequence dependent information of each label based on the label scores corresponding to each word output by BiLSTM, and output the entity and the type of each entity contained in the text.
As one example, a BiLSTM-CRF model may be employed to identify entities and entity types in a pre-training corpus. Before identification, entity labeling and entity type labeling can be carried out on map retrieval keywords in a map retrieval log, and the labeled map retrieval keywords are used as training samples of a BiLSTM-CRF model, so that the model can learn identification aiming at geographic entities and entity types. Inputting the pre-training corpus into the trained model, obtaining labeling information of entities and entity types in the pre-training corpus, and taking the pre-training corpus and the labeling information thereof as pre-training samples.
In some embodiments of the present disclosure, in order to optimize the quality of the pre-training sample, the labeling information of the entity and the entity type in the pre-training corpus may be preprocessed, and the pre-training corpus and the labeling information preprocessed by the pre-training corpus are used as the pre-training sample. The preprocessing may include operations such as entity merging, text normalization, etc., for example, when some types of entity combinations occur, the entity combinations may be merged into a more complete information entity. For another example, when the formats of the numbers in the entities are not uniform, the numbers can be unified through a text normalization operation.
Further, in some embodiments of the present disclosure, in order to improve training efficiency of the pre-training language model, serialization processing may be performed on the pre-training corpus according to the entity of the pre-training corpus and the labeling information of the entity type, and the result of the serialization processing may be used as a pre-training sample. The pre-training corpus can be subjected to serialization processing according to the pre-processed labeling information by preprocessing the labeling information, and the serialization processing result is used as a pre-training sample.
According to the training method of the pre-training language model, the pre-training corpus is obtained through the map retrieval log and the poi database and based on the clicking behaviors of the user, which is equivalent to obtaining the pre-training corpus according to the retrieval behaviors and the clicking behaviors of the actual user on the map, so that not only can the high-quality corpus information in the map field be obtained, but also the sample data and the actual business have stronger correlation, and the learning efficiency of the pre-training language model can be improved. In addition, entity and entity type recognition is performed on the pre-training corpus, and the pre-training corpus and the recognized labeling information are used as pre-training samples, so that the model can learn the geographic entity more effectively based on the entity context and the entity type.
To further enhance the training effect of the model, the present disclosure proposes yet another embodiment.
FIG. 3 is a flowchart of a training method for a pre-training language model according to an embodiment of the present disclosure. As shown in fig. 3, on the basis of the above embodiment, the method may further include:
step 301, performing word replacement processing on a first type entity in the pre-training sample to obtain a processed pre-training sample; wherein the word replacement includes a replacement of a word that is similar in shape and/or a replacement of a word that is pinyin-like.
It will be appreciated that in order to enable the trained model to automatically perform error correction processing, the trained model may be used as interference data in a font-changing manner in the pre-training sample, so that the pre-training language model may avoid the influence of the interference data through learning.
In some implementations of the present disclosure, the first type of entity refers to an entity that can perform word replacement processing. When the font conversion processing is performed on each piece of sample data in the pre-training sample, one or more of the first kind of entities of the piece of sample data may be selected randomly to perform the font conversion processing, and the number of selected entities may be random. In addition, the word replacement processing mode can be random, that is, each sample data can only replace the near word, only replace the word with similar pinyin, and also can have the word replacement processing of the two replacement modes at the same time, so that the robustness of model training is improved. Meanwhile, each sample data in the pre-training sample can be subjected to font conversion processing for multiple times, so that sample data subjected to the font conversion processing is obtained, and the learning effect of the pre-training language model is improved.
The font exchange process may include the replacement of a near word and/or the replacement of a pinyin-like word. For example, after the word similar to the pinyin is replaced for the new world, the word may be "new world", "new time world", "heart world", etc.; and after the new world is replaced by the shape near word, the new world can be a salary world and the like. In addition, the word replacement process may also include an adjustment to the order of words in the entity, such as "playfulness" may be replaced by "playfulness". In some embodiments of the present disclosure, word replacement processing may be performed by randomly selecting among the entity candidate replacement words obtained from a preset dictionary.
And 302, performing geographic error correction learning on the pre-trained language model subjected to geographic entity learning according to the processed pre-trained sample.
It will be appreciated that the purpose of this step is to enable a pre-trained language model that has been learned by geographical entities, by training to avoid interference with the word replacement process, so that it can still predict the entity before the word replacement process, thereby enabling the model to learn geographical corrections.
As an example, for a pre-trained language model that is learned by geographic entities, implementation of geographic error correction learning may be: inputting the processed pre-training sample into a pre-training language model which is subjected to geographic entity learning, wherein the model can predict word replacement entities according to the entities and entity types of the contexts, and output entity prediction data; and calculating a loss value according to the entity prediction data and the entity before word replacement processing, and continuously adjusting model parameters according to the loss value until the entity prediction result meets the expectation, and realizing the geographic error correction learning of the model.
According to the training method of the pre-training language model, which is provided by the embodiment of the disclosure, the entities in the pre-training sample are subjected to word replacement processing, and the pre-training language model which is subjected to geographic entity learning is subjected to geographic error correction learning according to the processed pre-training sample, so that the pre-training language model can learn geographic knowledge more fully, and further, the effect of the follow-up task after using the model can be further improved.
Yet another embodiment is presented by the present disclosure for learning of correlations for a pre-trained language model.
FIG. 4 is a flow chart of a training method for yet another pre-training language model provided by an embodiment of the present disclosure. As shown in fig. 4, on the basis of the above embodiment, the method may further include:
step 401, obtaining a correlation training sample according to the map retrieval log.
In some embodiments of the present disclosure, correlation training samples of different correlation levels may be set, and three correlation levels will be described as examples. For example, three levels of strong correlation, weak correlation, and uncorrelation may be set, where the correlation sample may be the correlation of the map search keyword and the poi name. As an example, in the map search log, based on each map search keyword of the search record, the poi name clicked by the user in the search result of the map search keyword is used as the strongly-relevant poi corresponding to the map search keyword; taking the poi name which is not clicked by the user in the search result as a weak related poi corresponding to the map search keyword; randomly taking out a poi name from a poi database as an irrelevant poi corresponding to the map retrieval keyword; for each map search keyword, a combination of the map search keyword and the poi names of the three levels corresponding to the map search keyword is used as one piece of data in the correlation training sample.
Step 402, training samples based on the correlation, and performing correlation learning on the pre-trained language model subjected to the geographic error correction learning.
In some embodiments of the present disclosure, the correlation training samples may be input to a pre-trained language model that is subjected to geographic error correction learning, which model learns from different levels of correlation in the correlation training samples. As an example, based on the correlation training sample in the above example, the sample data is input into a pre-training language model subjected to geographic error correction learning, the model can predict, for each map search keyword in the sample data, a strong correlation poi name, a weak correlation poi name and an uncorrelated poi name corresponding to the map search keyword according to the learned geographic knowledge, and calculate a loss value according to the prediction result and the poi name of the level corresponding to the map search keyword in the sample data, so as to train the pre-training language model according to the loss value, thereby realizing the learning of the correlation by the model.
It should be noted that, according to the requirement of the actual scene, the correlation training sample may also be used to perform correlation learning on the pre-trained language model learned by the geographic entity. That is, for a pre-trained language model that has not undergone error correction learning, correlation learning may also be performed thereon to enhance the training effect of the pre-trained language model. As shown in fig. 4, the method may further include:
And step 403, performing correlation learning on the pre-trained language model subjected to the geographic entity learning according to the correlation training sample.
As an example, based on the correlation training sample in the above example, the sample data is input into a pre-training language model which is learned by geographic entities, the model can predict, for each map search keyword in the sample data, a strong correlation poi name, a weak correlation poi name and an uncorrelated poi name corresponding to the map search keyword according to the learned knowledge of the geographic entities, and calculate a loss value according to the prediction result and the poi name of the level corresponding to the map search keyword in the sample data, so as to train the pre-training language model according to the loss value, thereby realizing the learning of the correlation by the model.
According to the training method of the pre-training language model, a training mode of multi-task learning is built for the pre-training language model, and according to the correlation training sample, the pre-training language model subjected to geographic entity learning is subjected to correlation training, so that the geographic field learning capacity of the pre-training language model can be enhanced, and the model effect is improved. In addition, the pre-training language model can sequentially perform geographic entity learning, geographic error correction learning and correlation learning, so that map field knowledge of the pre-training language model learning is more sufficient, the training effect of the pre-training language model is enhanced, and the applicability of the model training method can be improved.
In order to implement the above embodiment, the present disclosure proposes a training apparatus for pre-training a language model.
Fig. 5 is a block diagram of a training device for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus may include:
a first obtaining module 510, configured to obtain a pre-training sample; the pre-training sample comprises pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus;
masking module 520 for masking at least some entities in the pre-training samples;
the first training module 530 is configured to learn the geographic entity of the pre-training language model according to the masked pre-training samples.
In some embodiments of the present disclosure, the first training model 530 is specifically for:
inputting the pre-training samples subjected to masking into a pre-training language model to obtain entity prediction data; the entity prediction data is a predicted result of the entity type on the masked entity according to the context of the pre-training sample after masking by the pre-training language model;
and according to the entity prediction data and the masked entity, performing geographic entity learning on the pre-training language model.
As an implementation manner, in an embodiment of the present disclosure, the first obtaining module 510 includes:
an acquisition unit 511 for acquiring a plurality of map retrieval keywords and target poi information of each map retrieval keyword from the map retrieval log and poi database;
a stitching unit 512, configured to stitch, for each map search keyword, the map search keyword and the target poi information of the map search keyword to obtain a pre-training corpus;
the recognition unit 513 is configured to recognize the entity and the entity type in the pre-training corpus, obtain labeling information of the entity and the entity type in the pre-training corpus, and use the pre-training corpus and the labeling information as a pre-training sample.
Optionally, in some embodiments of the present disclosure, the identifying unit 513 is further configured to:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking the serialization processing result as a pre-training sample.
According to the training device for the pre-training language model, which is provided by the embodiment of the disclosure, the pre-training language model is trained by acquiring the pre-training corpus comprising the poi information of the target interest points based on the map retrieval keywords and the labeling information of the entities and the entity types in the pre-training corpus and carrying out mask processing on part of the entities in the pre-training sample, so that the pre-training language model can learn the knowledge of the geographic entities, thereby avoiding the occurrence of adaptability problems when the pre-training model is applied to the map field, improving the tuning efficiency of the pre-training model for the subsequent tasks, and further accelerating the floor implementation of related services in the poi field.
To further enhance the training effect of the pre-training language model, a further embodiment is presented.
Fig. 6 is a block diagram of another training apparatus for pre-training a language model according to an embodiment of the present disclosure. As shown in fig. 6, on the basis of the above embodiment, the apparatus further includes:
a replacing module 640, configured to perform word replacement processing on a first type entity in the pre-training sample, so as to obtain a processed pre-training sample; wherein the word replacement processing comprises the replacement of the shape-similar word and/or the replacement of the pinyin-similar word;
and the second training module 650 is configured to perform geographic error correction learning on the pre-trained language model subjected to geographic entity learning according to the processed pre-training samples.
For further training effects of the model, the apparatus may further comprise:
a second obtaining module 660, configured to obtain a correlation training sample according to the map retrieval log;
and a third training module 670, configured to perform correlation learning on the pre-training language model subjected to geographic error correction learning according to the correlation training samples.
Further, in some embodiments of the present disclosure, the apparatus may further include:
a third obtaining module 680, configured to obtain a correlation training sample according to the map retrieval log;
And a fourth training module 690, configured to perform relevance learning on the pre-trained language model learned by the geographic entity according to the relevance training samples.
It should be noted that, in some embodiments of the present disclosure, the functions of the second obtaining module 660 and the third obtaining module 680 may be configured in the same functional module, and the functions of the third training module 670 and the fourth training module 690 may also be configured in the same functional module, which may be determined according to the actual application scenario during actual application.
The modules 510 to 530 in fig. 5 have the same functional structure as the modules 610 to 630 in fig. 6, and are not described herein.
According to the training device for the pre-training language model, a training mode of multi-task learning is built for the pre-training language model, and according to the correlation training sample, the pre-training language model subjected to geographic entity learning is subjected to correlation training, so that the geographic field learning capacity of the pre-training language model can be enhanced, and the model effect is improved. In addition, the pre-training language model can sequentially perform geographic entity learning, geographic error correction learning and correlation learning, so that map field knowledge of the pre-training language model learning is more sufficient, the training effect of the pre-training language model is enhanced, and the applicability of the model training method can be improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, a training method of a pre-training language model. For example, in some embodiments, the training method of the pre-trained language model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the training method of the pre-trained language model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the training method of the pre-trained language model in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A method of training a pre-trained language model, comprising:
obtaining a pre-training sample; the pre-training sample comprises pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus, wherein the pre-training corpus comprises corpus information obtained by combining keywords actually retrieved by users in the map field and clicking behaviors of the users, and the entity types comprise classifications of geographic entities;
Masking at least some entities in the pre-training samples;
according to the pre-training samples after masking, carrying out geographic entity learning on the pre-training language model;
the obtaining the pre-training sample includes:
acquiring a plurality of map retrieval keywords and target poi information of each map retrieval keyword according to a map retrieval log and a poi database;
for each map search keyword, splicing the map search keyword with target poi information of the map search keyword to obtain a pre-training corpus;
identifying the entity and entity type in the pre-training corpus to obtain labeling information of the entity and entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information as pre-training samples;
the performing geographic entity learning on the pre-training language model according to the pre-training sample after masking includes:
inputting the masked pre-training samples into a pre-training language model to obtain entity prediction data; wherein the entity prediction data is a prediction result of the entity type on the masked entity according to the context of the pre-trained language model after the masking;
Performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity;
performing word replacement processing on the first type entity in the pre-training sample to obtain a processed pre-training sample; wherein the word replacement processing comprises the replacement of a shape-similar word and/or the replacement of a pinyin-similar word;
performing geographic error correction learning on the pre-training language model subjected to geographic entity learning according to the processed pre-training sample, wherein the processed pre-training sample is input into the pre-training language model subjected to geographic entity learning, and the model predicts word replacement entities according to the entities and entity types of the context and outputs entity prediction data; and calculating a loss value according to the entity prediction data and the entity before word replacement processing, and continuously adjusting model parameters according to the loss value until the entity prediction result meets the expectation, and realizing the geographic error correction learning of the model.
2. The method of claim 1, wherein the taking the pre-training corpus and the labeling information as pre-training samples comprises:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking a serialization processing result as a pre-training sample.
3. The method of claim 1, further comprising:
acquiring a correlation training sample according to the map retrieval log;
and according to the correlation training sample, performing correlation learning on the pre-training language model subjected to the geographic error correction learning.
4. The method of any of claims 1-2, further comprising:
acquiring a correlation training sample according to the map retrieval log;
and performing correlation learning on the pre-trained language model learned by the geographic entity according to the correlation training sample.
5. A training apparatus for pre-training a language model, comprising:
the first acquisition module is used for acquiring a pre-training sample; the pre-training sample comprises pre-training corpus based on map retrieval keywords and target interest point poi information, and labeling information of entities and entity types in the pre-training corpus, wherein the pre-training corpus comprises corpus information obtained by combining keywords actually retrieved by users in the map field and clicking behaviors of the users, and the entity types comprise classifications of geographic entities;
a masking module, configured to mask at least part of entities in the pre-training samples;
The first training module is used for carrying out geographic entity learning on the pre-training language model according to the pre-training samples after mask;
the first training module is specifically configured to:
inputting the masked pre-training samples into a pre-training language model to obtain entity prediction data; wherein the entity prediction data is a prediction result of the entity type on the masked entity according to the context of the pre-trained language model after the masking;
performing geographic entity learning on the pre-training language model according to the entity prediction data and the masked entity;
the first acquisition module includes:
an acquisition unit, configured to acquire a plurality of map search keywords and target poi information of each of the map search keywords according to a map search log and a poi database;
the splicing unit is used for splicing the map search keywords with the target poi information of the map search keywords aiming at each map search keyword to obtain a pre-training corpus;
the recognition unit is used for recognizing the entity and the entity type in the pre-training corpus, obtaining the labeling information of the entity and the entity type in the pre-training corpus, and taking the pre-training corpus and the labeling information as pre-training samples;
The replacement module is used for carrying out word replacement processing on the first type entity in the pre-training sample to obtain a processed pre-training sample; wherein the word replacement processing comprises the replacement of a shape-similar word and/or the replacement of a pinyin-similar word;
the second training module is used for carrying out geographic error correction learning on the pre-training language model subjected to geographic entity learning according to the processed pre-training sample, wherein the processed pre-training sample is input into the pre-training language model subjected to geographic entity learning, and the model predicts word replacement entities according to the entities and entity types of the contexts and outputs entity prediction data; and calculating a loss value according to the entity prediction data and the entity before word replacement processing, and continuously adjusting model parameters according to the loss value until the entity prediction result meets the expectation, and realizing the geographic error correction learning of the model.
6. The apparatus of claim 5, wherein the identification unit is further configured to:
and carrying out serialization processing on the pre-training corpus according to the labeling information, and taking a serialization processing result as a pre-training sample.
7. The apparatus of claim 5, further comprising:
The second acquisition module is used for acquiring a correlation training sample according to the map retrieval log;
and the third training module is used for carrying out correlation learning on the pre-training language model subjected to the geographic error correction learning according to the correlation training sample.
8. The apparatus of any of claims 5 to 6, further comprising:
the third acquisition module is used for acquiring a correlation training sample according to the map retrieval log;
and the fourth training module is used for carrying out correlation learning on the pre-training language model which is learned by the geographic entity according to the correlation training sample.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 4.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 4.
CN202111089927.3A 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium Active CN113836925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111089927.3A CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111089927.3A CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113836925A CN113836925A (en) 2021-12-24
CN113836925B true CN113836925B (en) 2023-07-07

Family

ID=78959695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111089927.3A Active CN113836925B (en) 2021-09-16 2021-09-16 Training method and device for pre-training language model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113836925B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580543B (en) * 2022-03-07 2023-09-29 北京百度网讯科技有限公司 Model training method, interaction log analysis method, device, equipment and medium
CN114861889B (en) * 2022-07-04 2022-09-27 北京百度网讯科技有限公司 Deep learning model training method, target object detection method and device
CN115346657B (en) * 2022-07-05 2023-07-28 深圳市镜象科技有限公司 Training method and device for improving identification effect of senile dementia by utilizing transfer learning
CN115081453B (en) * 2022-08-23 2022-11-04 北京睿企信息科技有限公司 Named entity identification method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11556712B2 (en) * 2019-10-08 2023-01-17 International Business Machines Corporation Span selection training for natural language processing
CN111539223B (en) * 2020-05-29 2023-08-18 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium
CN112559885B (en) * 2020-12-25 2024-01-12 北京百度网讯科技有限公司 Training model determining method and device for map interest points and electronic equipment

Also Published As

Publication number Publication date
CN113836925A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN113836925B (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN113553414A (en) Intelligent dialogue method and device, electronic equipment and storage medium
CN113053367A (en) Speech recognition method, model training method and device for speech recognition
CN112949818A (en) Model distillation method, device, equipment and storage medium
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN114399772B (en) Sample generation, model training and track recognition methods, devices, equipment and media
CN114970540A (en) Method and device for training text audit model
CN113190746B (en) Recommendation model evaluation method and device and electronic equipment
CN114490985A (en) Dialog generation method and device, electronic equipment and storage medium
CN112699237B (en) Label determination method, device and storage medium
CN112270169B (en) Method and device for predicting dialogue roles, electronic equipment and storage medium
CN112528146A (en) Content resource recommendation method and device, electronic equipment and storage medium
CN111611364A (en) Intelligent response method, device, equipment and storage medium
CN116383382A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN115035890A (en) Training method and device of voice recognition model, electronic equipment and storage medium
CN113807390A (en) Model training method and device, electronic equipment and storage medium
CN113051926A (en) Text extraction method, equipment and storage medium
CN113705206B (en) Emotion prediction model training method, device, equipment and storage medium
CN115033701B (en) Text vector generation model training method, text classification method and related device
CN115965018B (en) Training method of information generation model, information generation method and device
CN113204667B (en) Method and device for training audio annotation model and audio annotation
CN116244413B (en) New intention determining method, apparatus and storage medium
US20240221727A1 (en) Voice recognition model training method, voice recognition method, electronic device, and storage medium
CN115879446B (en) Text processing method, deep learning model training method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant