WO2022141878A1 - 端到端的语言模型预训练方法、系统、设备及存储介质 - Google Patents

端到端的语言模型预训练方法、系统、设备及存储介质 Download PDF

Info

Publication number
WO2022141878A1
WO2022141878A1 PCT/CN2021/084283 CN2021084283W WO2022141878A1 WO 2022141878 A1 WO2022141878 A1 WO 2022141878A1 CN 2021084283 W CN2021084283 W CN 2021084283W WO 2022141878 A1 WO2022141878 A1 WO 2022141878A1
Authority
WO
WIPO (PCT)
Prior art keywords
knowledge
existing
fragments
training
fragment
Prior art date
Application number
PCT/CN2021/084283
Other languages
English (en)
French (fr)
Inventor
谯轶轩
陈浩
高鹏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022141878A1 publication Critical patent/WO2022141878A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Definitions

  • the present application relates to a language model pre-training method, in particular to an end-to-end language model pre-training method, system, device and storage medium.
  • the pre-training model is to make natural language processing from the original stage of manual parameter adjustment and relying on ML experts to enter the stage of large-scale and reproducible large-scale industrial deployment.
  • pre-trained models extend from monolingual, to multilingual, multimodal tasks.
  • Pretraining obtains task-independent pretrained models from large-scale data through self-supervised learning.
  • the pre-training model is an application of transfer learning, which uses almost unlimited text to learn the context-sensitive representation of each member of the input sentence, which implicitly learns the general syntax and semantics. Knowledge.
  • transfer learning uses almost unlimited text to learn the context-sensitive representation of each member of the input sentence, which implicitly learns the general syntax and semantics.
  • pretrained models achieve state-of-the-art results on almost all NLP tasks.
  • this pre-training model + fine-tuning mechanism has good scalability. When supporting a new task, it only needs to use the labeled data of the task for fine-tuning, which can be achieved by ordinary engineers.
  • the third is fine-tuning, which aims to use its annotated samples to adjust the parameters of the pre-trained network.
  • the mainstream pre-training language model (language modeling) method injects a large amount of knowledge contained in the data set into the parameters of the model itself by pre-training on large-scale data sets. Fine-tuning on tasks such as question answering achieves very good performance. Many improvement methods for subsequent development are also from: 1. Training on a larger-scale data set; 2. Using a model with a larger number of parameters, these two perspectives have further improved the performance.
  • the present application provides an end-to-end language model pre-training method, system, equipment and storage media.
  • End-to-end language model pre-training methods including:
  • the preset knowledge similarity judgment rule retrieve the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • the masked spliced knowledge fragments are used as the input of language model pre-training for prediction training, and the end-to-end language model pre-training is completed.
  • the existing knowledge base is an existing external knowledge base and/or a collection of other knowledge fragments in the language model pre-training samples except the input knowledge fragment.
  • An end-to-end language model pre-training system including:
  • the retrieval enhancement module retrieves the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • a splicing module which splices the inputted knowledge fragments and the retrieved existing knowledge fragments to obtain spliced knowledge fragments
  • masking module performing mask processing on the splicing knowledge fragment
  • the pre-training module uses the masked spliced knowledge fragments as the input of language model pre-training for prediction training, and completes the end-to-end language model pre-training.
  • An electronic device comprising:
  • the memory stores instructions that can be executed by at least one processor, and when the instructions are executed by the at least one processor, implement:
  • the preset knowledge similarity judgment rule retrieve the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • the masked spliced knowledge fragments are used as the input of language model pre-training for prediction training, and the end-to-end language model pre-training is completed.
  • a computer-readable storage medium storing executable computer instructions, which, when executed, realizes:
  • the preset knowledge similarity judgment rule retrieve the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • the masked spliced knowledge fragments are used as the input of language model pre-training for prediction training, and the end-to-end language model pre-training is completed.
  • the embodiments of the present application have the following beneficial effects: the present application uses the preset similarity judgment rules to search for similar existing knowledge fragments in the existing knowledge base by retrieving, thereby reducing the number of models during training. Requirements for parameters; splicing the inputted knowledge fragments and the retrieved existing knowledge fragments to obtain spliced knowledge fragments, so that the existing knowledge fragments of the existing knowledge base can be fully utilized, and the language model training is improved. efficiency; by masking the spliced knowledge fragments, as the input of language model pre-training, so that the language model can be enhanced based on retrieval, combined with the information that is really valuable to itself found through retrieval, that is, the knowledge that is similar to the knowledge.
  • the method of having knowledge fragments and then learning through predictive training realizes the transformation process from training that relies on direct static input of parameters to training that dynamically changes input parameters according to retrieval, and is more in line with the learning process of knowledge itself;
  • the above method opens up another language model training paradigm in addition to enlarging the dataset or the model, which reduces the cost of offline training and online deployment of the model.
  • Fig. 2 is the end-to-end language model pre-training system block diagram described in the example of this application;
  • Fig. 3 is the architecture diagram of the pre-trained language model based on retrieval enhancement described in the example of the application;
  • Fig. 5 is the structural block diagram of the data processing module described in the example of the application.
  • Fig. 7 is the structural block diagram of the retrieval enhancement module described in the example of this application.
  • Fig. 9 is the flow chart of the splicing method described in the example of the application.
  • FIG. 10 is a structural block diagram of the splicing mask module described in the example of the application.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, elements, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • module refers to related entities applied to a computer, such as hardware, a combination of hardware and software, software or software in execution, and the like.
  • an element may be, but is not limited to, a process running on a processor, a processor, an object, an executable element, a thread of execution, a program, and/or a computer.
  • an application program or script program running on the server, and the server can be a component.
  • One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be executed from various computer readable media .
  • Elements may also pass through a signal having one or more data packets, for example, a signal from one interacting with another element in a local system, in a distributed system, and/or with data interacting with other systems through a network of the Internet local and/or remote processes to communicate.
  • BERT Bidirectional Encoder Representation from Transformers, that is, the encoder (decoder) of the two-way converter (Transformer), Because the decoder cannot obtain the information to be predicted.
  • the main innovation of the model is in the pre-training method, that is, the two methods of masked language model (Masked LM) and language prediction (Next Sentence Prediction) are used to capture word and sentence level semantics (representation) respectively. .
  • the BERT model pre-training stage consists of two unsupervised prediction tasks: masked language model and next sentence prediction.
  • Masked Language Model Masked Language Model (MLM) - Due to the bidirectional function (bidirectionality) of the BERT model and the effect of the multilayer self-attention mechanism it uses, in order to train deep bidirectional representations, some percentage (15% in the paper) of the input token is input are simply masked at random and then predict those masked tokens.
  • Next Sentence Prediction NSP
  • NSP Next Sentence Prediction
  • this application introduces a retrieval enhancement module or method on the basis of the mainstream pre-trained language model BERT, and performs end-to-end learning, thereby realizing the ability of the model to find relevant or complementary knowledge for existing knowledge, and reducing the amount of model parameters.
  • pressure so that a model with a smaller number of parameters can be used for later deployment; and through end-to-end learning, the retrieval enhancement module itself can be continuously updated, so that it has the ability to retrieve in a specific field, which is scarce for some samples. It is of great value in the field; when analyzing specific tasks, we can judge whether the understanding of the model itself is wrong by observing the knowledge returned by the retrieval enhancement module, thus improving the interpretability of the model.
  • the end-to-end learning is to obtain the result of the output end directly from the data at the input end. That is, without feature preprocessing and feature extraction, the original data is directly input into the language model to obtain the final result.
  • the splicing knowledge fragments are masked and directly used as input for pre-training without feature preprocessing. And extraction, the final result is directly obtained through prediction training.
  • Step 101 retrieve an existing knowledge segment that is similar to the knowledge of the input knowledge segment from an existing knowledge base;
  • this step is the retrieval enhancement step, in which one or more pieces of similar existing knowledge can be selected according to the preset judgment rules, and the preset judgment rules can be used to quickly retrieve the existing knowledge base from the existing knowledge base.
  • the target object is obtained quickly in the method, which improves the determination of the input object and reduces the need for parameters.
  • Step 102 splicing the inputted knowledge segment and the retrieved existing knowledge segment to obtain a spliced knowledge segment
  • the input knowledge fragments and the existing knowledge fragments obtained through retrieval are merged through splicing, which greatly improves the controllable dimension of the input parameters, and can realize various combinations of input and retrieval.
  • the change of the retrieval result changes the splicing result, which further improves the fission type of the language model input when the actual input is stable;
  • Step 103 performing mask processing on the spliced knowledge fragments
  • Step 104 using the masked spliced knowledge segment as the input of language model pre-training to perform prediction training to complete end-to-end language model pre-training.
  • the knowledge fragments are spliced to form the input of the language model pre-training, so that the preset goal can be achieved through continuous repeated training, and the end-to-end language model pre-training can be completed.
  • the corresponding pre-trained language model based on retrieval enhancement can be obtained through the above method.
  • the retrieval enhancement of the similar existing knowledge fragments obtained through retrieval as described in step 101 is integrated into the BERT model for end-to-end. After learning, it only needs to have an input of knowledge fragments, which can be used as input by retrieving existing knowledge fragments, making full use of the existing knowledge base, and no need to expand the volume of parameters, which greatly weakens It reduces the model's thirst for parameters and significantly enhances the model's ability to learn and digest knowledge.
  • the existing knowledge fragment and the input knowledge fragment are spliced into the mask and used as input, so that the pre-trained language model based on retrieval enhancement learns how to use external knowledge to first find information that is truly valuable to itself and then learn the way to achieve The transformation process from static to dynamic is more in line with the learning process of knowledge itself.
  • the existing knowledge base as the retrieval object is actually obtained from the existing external knowledge base or as a language model pre-training sample except for the input knowledge fragments.
  • a collection of knowledge fragments, or a collection of them, from which to retrieve existing knowledge fragments, can greatly reduce the dependence on the scale and parameters of input samples when pre-training is directly input, and can also be used for existing knowledge bases.
  • the knowledge fragments are fully utilized to ensure the pre-training effect and efficiency of the language model.
  • the existing knowledge base is the existing external knowledge base
  • the existing knowledge base is retrieved from the existing knowledge base and the inputted knowledge fragment is obtained.
  • the method also includes the following preliminary preprocessing, which is represented by a vector, as shown in FIG. 4 .
  • Step 401 Divide each target article in the external knowledge base to obtain knowledge fragments whose length is less than the set number of words; that is, on the premise that each knowledge fragment contains the most complete sentences, make each knowledge fragment The length of the fragment is less than the set number of words;
  • Step 402 converting each said knowledge segment obtained by segmentation into a corresponding vector representation, and the vector representation contains all semantic information of the corresponding knowledge segment;
  • Step 403 Index the vector representations corresponding to all knowledge fragments in the external knowledge base, and complete the preliminary preprocessing of the external knowledge base.
  • the specific processing of the external knowledge base is described by taking a common external knowledge base as an example.
  • Build an external knowledge database database based on the public Wikipedia English article data pre-process it through the WikicorpusTextFormatting.py script file officially provided by NVIDIA, and extract the main content information in each Wikipedia article, record it as:
  • each article the article di is divided into m small segments in the order from front to back, so that the length of each small segment pij is less than 128 words.
  • the principle of the segmentation point is to include as much as possible. Multiple complete sentences are convenient for the later BERT model processing.
  • the segmented segments of each article are scrambled and merged to be abbreviated as:
  • pi comes from the segmented segment of an article.
  • K knowledge segments sk which constitute the final knowledge segment base of the model, that is, the existing external knowledge base.
  • the vector representation hi of all knowledge fragments is indexed through Microsoft's open-source MIPS framework.
  • the purpose is to efficiently and quickly return to the knowledge fragment base with the semantic level (meaning) when a knowledge fragment (query) is given. top) The closest k related knowledge fragments.
  • the proximity relationship is specifically embodied as similarity or complementarity.
  • the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments are retrieved from the existing knowledge base, and the above-mentioned similarity can be obtained through the following steps Judgment and selection are performed to obtain the retrieval result, as shown in Figure 6, which specifically includes the following steps:
  • Step 601 the vector representation of the inputted knowledge fragment and the vector representation of the existing knowledge fragment do a vector inner product
  • Step 602 taking the obtained inner product of the vectors as the score of the correlation, and sorting from large to small;
  • Step 603 Select the target vector inner product that meets the set threshold from the sorting, and obtain that the existing knowledge segment corresponding to the target vector inner product is a similar existing knowledge segment.
  • the vector representation of the knowledge fragment (query) item obtained through BERT and each related knowledge fragment hi do a vector inner product, and the inner product is used as a measure for a given query (knowledge fragment), each related knowledge fragment hi and its
  • the correlation of we sort according to the size of the correlation score, and select the c knowledge fragments that are most similar to the given knowledge.
  • we set the corresponding threshold c 3, which is The three most similar or complementary pieces of knowledge are taken as the retrieval results under limited conditions, that is, the output of the retrieval enhancement module.
  • the above description only describes the first part of the pre-training method described in the present application.
  • the pre-training input is obtained with minimal consumption and parameters, the result and the pre-training language are also required.
  • model fusion that is, the input knowledge fragments and the retrieved existing knowledge fragments are spliced, and the obtained spliced knowledge fragments are used as the input of language model pre-training to mask; combined into a specific pre-training model, the enhancement proposed in this application
  • the final training of the language model is only based on one of the two tasks in the BERT model, that is, the masked language modeling task (MLM).
  • MLM masked language modeling task
  • the obtained masked language modeling model is shown in Figure 8. After splicing The language model pre-training of BERT is performed after masking in the spliced knowledge segment.
  • [sep] characters are used to distinguish different sentences.
  • the example in Figure 8 assumes that the first knowledge segment has 10 words, the second knowledge segment has 16 words, and each knowledge segment has different Sentences are also separated by [sep] characters.
  • the MLM task first randomly replaces some words in a paragraph (here refers to two knowledge fragments, which can contain multiple knowledge fragments) into a mask [mask], which refers to the first knowledge fragment here word 2 in and word 15 in the second knowledge fragment, the model uses the context of the masked word, i.e. the part of the paragraph that is not masked, that is, except for words 2 and 15 in the first knowledge fragment
  • the word 15 in the second knowledge fragment predicts it, enabling pre-training of the language model.
  • a paragraph that may contain multiple pieces of knowledge is composed of multiple pieces of knowledge that are continuous in the article di itself, that is, a paragraph that is essentially continuous.
  • the content of the fragment dynamically adjusts the selection of subsequent knowledge fragments, which is more flexible.
  • the context information that the model can use is fixed and immutable during the training process of the model from beginning to end.
  • the parameters in the retrieval enhancement are also continuously updated, that is, for the first knowledge segment, the existing knowledge segment paired with it is constantly changing during the training process, so each masked word can be changed.
  • the contextual information available during training is also constantly changing with the most relevant pieces of knowledge.
  • the input knowledge fragments and the retrieved existing knowledge fragments are spliced to obtain spliced knowledge fragments; that is, the input knowledge fragments and the retrieved knowledge fragments are spliced
  • the existing knowledge fragments are spliced, as shown in Figure 9, the method is as follows:
  • Step 901 sequentially splicing the inputted knowledge segment and the retrieved existing knowledge segment after the first character
  • Step 902 respectively setting a second character between the inputted knowledge segment and the retrieved existing knowledge segment
  • the first character is used as the initial identification of the entire splicing knowledge fragment
  • the second character is used as a separation identifier between knowledge fragments.
  • the present application incorporates retrieval enhancement into it, so that the spliced objects have changed.
  • the present application has a Knowledge fragments can obtain multiple similar knowledge fragments, thereby reducing the need for input parameters.
  • use detection enhancement to return its knowledge The three most similar knowledge fragments si1, si2, si3;
  • the [cls] character is used as the first character to identify the start of the entire input input, and the knowledge fragments are separated by the second character [sep];
  • the mask language modeling task of the BERT model is used. Predict all the words masked in the above input, so as to realize the pre-training of the language model. Taking the i-th word mi that is masked out as an example, the vector representation mi of each masked word is obtained, and the cross entropy is calculated with the one-hot code yi corresponding to the real word:
  • K represents the number of words that are masked out, that is, the number of samples, there are K samples in total
  • (l) represents the lth dimension of the vector
  • the one-hot encoding points to the position of the real word in the vector is 1, and the rest of the positions are Representation of 0.
  • the stochastic gradient descent algorithm (SGD) is used, and the pre-trained language model is modeled in the pytorch framework to realize the prediction of the mask, and at the same time, the parameters of the retrieval enhancement module and the BERT module can be updated.
  • SGD stochastic gradient descent
  • gradient descent is performed with one sample iteration.
  • the stochastic gradient descent method uses only one sample to iterate at a time, so the training speed is very fast, thereby reducing the computational cost, and at the same time meeting the requirements for sample changes and very few parameters as described in this application.
  • the inputted knowledge fragments and the retrieved existing knowledge fragments are spliced, and the spliced knowledge fragments are masked.
  • the splicing and masking methods of the BERT model can be directly used.
  • the retrieval enhancement module is integrated into the BERT model for end-to-end learning, which greatly weakens the model's thirst for parameters and significantly enhances the model's ability to learn and digest knowledge.
  • the retrieval enhancement module After the training of the enhanced language model through end-to-end retrieval is completed, it is necessary to first search for knowledge fragments containing answers from massive information in solving problems such as open domain question answering (open domain QA), which has significant advantages in tasks such as answering questions.
  • This application integrates all stages of the above tasks through retrieval enhancement, which greatly simplifies the process of the task and reduces the difficulty of the task; it opens up another language model training paradigm besides increasing the data set or increasing the model. Reduced costs for offline training and online deployment of models.
  • an end-to-end language model pre-training system is provided corresponding to the processes of the above methods, as shown in FIG. 2 , which includes:
  • the retrieval enhancement module 201 retrieves the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • the splicing module 202 splices the inputted knowledge fragments and the retrieved existing knowledge fragments to obtain spliced knowledge fragments;
  • masking module 203 performing mask processing on the spliced knowledge segment
  • the pre-training module 204 uses the masked spliced knowledge segment as the input of language model pre-training to perform prediction training, so as to complete the end-to-end language model pre-training.
  • the existing knowledge base is the existing external knowledge base
  • the existing knowledge base is retrieved in the retrieval enhancement module to obtain the knowledge that is the same as the inputted knowledge fragment
  • the following preliminary preprocessing is performed on the external knowledge base through the data processing module, as shown in Figure 5, the data processing module includes the following units:
  • Segmentation unit 501 for segmenting each target article in the external knowledge base to obtain knowledge fragments whose length is less than the set number of words;
  • the vector representation unit 502 converts each said knowledge segment obtained by segmentation into a corresponding vector representation, and the vector representation contains all semantic information of the corresponding knowledge segment;
  • the indexing unit 503 establishes an index on the vector representations corresponding to all knowledge segments in the external knowledge base, and completes the preliminary preprocessing of the external knowledge base.
  • the retrieval enhancement module when the retrieval enhancement module performs retrieval according to the preset knowledge similarity judgment rules, according to the preset knowledge similarity judgment rules, the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments are retrieved from the existing knowledge base, as shown in the figure 7, it specifically includes the following units:
  • a vector inner product unit 701 which performs a vector inner product between the input vector representation of the knowledge fragment and the vector representation of the existing knowledge fragment;
  • the sorting unit 702 uses the obtained inner product of the vectors as the score of the correlation, and sorts from large to small;
  • the selection unit 703 selects the target vector inner product that meets the set threshold from the sorting, and obtains that the existing knowledge segment corresponding to the target vector inner product is a similar existing knowledge segment.
  • the splicing unit splices the inputted knowledge fragments and the retrieved existing knowledge fragments
  • the spliced knowledge fragments are obtained, as shown in Figure 10, and the splicing processing is performed by the following units:
  • the initial splicing unit 1001 sequentially splices the inputted knowledge fragment and the retrieved existing knowledge fragment after the first character; the first character is used as the initial identification of the entire spliced knowledge fragment;
  • the separation and splicing unit 1002 respectively sets a second character between the inputted knowledge segment and the retrieved existing knowledge segment; the second character is used as a separation identifier between knowledge segments.
  • An electronic device using an end-to-end language model pre-training includes: at least one processor; and a memory communicatively connected to the at least one processor; the memory stores data that can be processed by the at least one processor.
  • processor-executed instructions that, when executed by at least one processor, implement:
  • the preset knowledge similarity judgment rule retrieve the existing knowledge fragments that are similar to the knowledge of the input knowledge fragments from the existing knowledge base;
  • the masked spliced knowledge fragments are used as the input of language model pre-training for prediction training, and the end-to-end language model pre-training is completed.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and stores executable computer instructions, and the computer instructions are executed. , implement:
  • the masked spliced knowledge fragments are used as the input of language model pre-training for prediction training, and the end-to-end language model pre-training is completed.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种端到端的语言模型预训练方法、系统、设备及存储介质。所述方法包括:根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段(101);将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段(102);将所述拼接知识片段进行掩码处理(103);将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练(104)。上述方法利用预设的相近判断规则,通过检索在现有知识库中进行相近的现有知识片段的检索,减小了训练时模型对参数的需求,从而使得语言模型能够基于检索增强利用外部知识,提高了语言模型训练的效率。

Description

端到端的语言模型预训练方法、系统、设备及存储介质
本申请要求于2020年12月28日提交中国专利局、申请号为202011587439.0,发明名称为“端到端的语言模型预训练方法、系统、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语言模型的预训练方法,具体为一种端到端的语言模型预训练方法、系统、设备及存储介质。
背景技术
问、答、搜索、摘要、对话和聊天等能力,并可利用知识和常识进行推理和决策,并支持客服、诊断、法律、教学等场景。自然语言理解,被认为是AI皇冠上的明珠。一旦有突破,则会大幅度推动AI在很多重要场景落地。
预训练模型,则是使自然语言处理由原来的手工调参、依靠ML专家的阶段,进入到可以大规模、可复制的大工业施展的阶段。而且预训练模型从单语言、扩展到多语言、多模态任务。预训练通过自监督学习从大规模数据中获得与具体任务无关的预训练模型。
之所以需要做预训练模型,首先,预训练模型是一种迁移学习的应用,利用几乎无限的文本,学习输入句子的每一个成员的上下文相关的表示,它隐式地学习到了通用的语法语义知识。第二,它可以将从开放领域学到的知识迁移到下游任务,以改善低资源任务,对低资源语言处理也非常有利。第三,预训练模型在几乎所有NLP任务中都取得了目前最佳的成果。最后,这个预训练模型+微调机制具备很好的可扩展性,在支持一个新任务时,只需要利用该任务的标注数据进行微调即可,一般工程师就可以实现。
预训练有三个关键技术,第一个是用来对输入的一句话或者一个段落进行编码或者进行解码的转换器(Transformer),第二个是用于学习单词的上下文相关表示的自监督学习,第三个就是微调,旨在利用其标注样本对预训练网络的参数进行调整。
近年来,主流的预训练(Pre-training)语言模型(language modeling)方法通过在大规模数据集上进行预训练,将数据集中蕴含着的大量知识注入到模型本身的参数中,在下游领域的任务(例如问答任务)上进行微调(fine-tuning)达到了非常不错的性能。后续发展的诸多改进方法也从:1.在更大规模的数据集上进行训练;2.采用参数量更加庞大的模型,这两个角度对性能做了更进一步的提升。
技术问题
综上,发明人意识到,在实际场景中,收集大规模的高质量样本是极其昂贵的,甚至在某些领域,只能从逻辑上或者概念上相近的其他领域的样本中借鉴并吸取相通的知识,参数量巨大的模型在实际的部署及使用中也是极其耗费资源的,一方面加大的服务器的负载,另一方面,训练和微调它的成本也是非常巨大的。更进一步的,即使模型本身具备如此庞大的参数量,但是依然无法将样本中蕴含的所有知识全部存储在模型本身的参数中,一方面我们无法得知数据样本中本身究竟蕴含多少知识,另一方面,我们也无法操控对模型已经学习到的知识进行修改或者增加,而这一点对后期的可视化分析及可解释性带来了
许多无法逾越的困难。
技术解决方案
针对现有技术中存在的语言模型的预训练对参数的需求量庞大,训练效率低,同时还增加了语言模型的部署成本问题,本申请提供一种端到端的语言模型预训练方法、系统、设备及存储介质。
本申请是通过以下技术方案来实现:
端到端的语言模型预训练方法,包括:
根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
将所述拼接知识片段进行掩码处理;
将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
优选的,所述现有知识库为现有的外部知识库和/或语言模型预训练样本中除输入的所述知识片段外其他知识片段的集合。
端到端的语言模型预训练系统,包括:
检索增强模块,根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
拼接模块,将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
掩码模块,将所述拼接知识片段进行掩码处理;
预训练模块,将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
一种电子设备,包括:
至少一个处理器;以及
与至少一个处理器通信连接的存储器;其中,
所述存储器存储有能够被至少一个处理器执行的指令,所述指令被至少一个处理器执行时,实现:
根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
将所述拼接知识片段进行掩码处理;
将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
一种计算机可读存储介质,存储有能够执行的计算机指令,所述的计算机指令被执行时,实现:
根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
将所述拼接知识片段进行掩码处理;
将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
有益效果
本申请实施例与现有技术相比存在的有益效果是:本申请利用预设的相近判断规则,通过检索在现有知识库中进行相近的现有知识片段的检索,减小了训练时模型对参数的需求;将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段,从而能够充分的利用现有知识库的现有知识片段,提高了语言模型训练的效率;通过对拼接知识片段进行掩码处理,作为语言模型预训练的输入,从而使得语言模型能够基于检索增强,结合经检索找寻到的对自己真正有价值的信息即所述知识相近的现有知识片段,然后再通过预测训练进行学习的方式,实现了从依赖于参数直接静态输入的训练到输入参数根据检索动态变化的训练的转换过程,也更加符合知识本身的学习过程;本申请所述方法开辟了除增大数据集或者增大模型的另外一种语言模型训练范式,为模型的线下训练和线上部署降低了成本。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实例中所述的端到端的语言模型预训练方法流程图;
图2为本申请实例中所述的端到端的语言模型预训练系统框图;
图3为本申请实例中所述的基于检索增强的预训练语言模型的架构图;
图4为本申请实例中所述的对知识库初步的预处理方法流程图;
图5为本申请实例中所述的数据处理模块的结构框图;
图6为本申请实例中所述的相近判断方法的流程图;
图7为本申请实例中所述的检索增强模块的结构框图;
图8为本申请实例中所述的现有的掩码语言建模模型的架构图;
图9为本申请实例中所述的拼接方法的流程图;
图10为本申请实例中所述的拼接掩码模块的结构框图。
本发明的实施方式
下面结合具体的实施例对本申请做进一步的详细说明,所述是对本申请的解释而不是限定。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、元件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
在本申请中,“模块”、“装置”、“系统”等指应用于计算机的相关实体,如硬件、硬件和软件的组合、软件或执行中的软件等。详细地说,例如,元件可以、但不限于是运行于处理器的过程、处理器、对象、可执行元件、执行线程、程序和/或计算机。还有,运行于服务器上的应用程序或脚本程序、服务器都可以是元件。一个或多个元件可在执行的过程和/或线程中,并且元件可以在一台计算机上本地化和/或分布在两台或多台计算机之间,并可以由各种计算机可读介质运行。元件还可以根据具有一个或多个数据包的信号,例如,来自一个与本地系统、分布式系统中另一元件交互的,和/或在因特网的网络通过信号与其它系统交互的数据的信号通过本地和/或远程过程来进行通信。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”,不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本领域技术人员对端到端的语言模型预训练,一般都会套用现有的训练方法,例如BERT模型,BERT的全称是Bidirectional Encoder Representation from Transformers,即双向转换器(Transformer)的编码器(decoder),因为decoder是不能获要预测的信息的。模型的主要创新点都在预训练(pre-train)方法上,即用了掩码语言模型(Masked LM)和语言预测(Next Sentence Prediction)两种方法分别捕捉词语和句子级别的语义(representation)。
BERT模型预训练阶段包括两个无监督预测任务:掩蔽语言模型和下一句预测。掩蔽语言模型(MLM)——由于BERT模型的双向功能(双向性)及其使用的多层自关注机制的效果,为了训练深度双向表示,一些百分比(论文中为15%)输入令牌的输入被简单地随机掩盖,然后预测那些被屏蔽的令牌。下一句预测(NSP)。为了训练理解句子关系以及单词之间的语义关系的模型,BERT模型还预先训练二进制化的下一句预测任务,该任务可以从任何文本语料库中轻易生成。
但是在实际的使用和预测过程中,要想提高现有的端到端的语言模型的预训练效果,要么需要更大规模的训练集,要么采用参数更加庞大的模型,导致了后期模型中对已经学习到的知识无法进行修改或增加,并且可视化分析和实际部署使用极其耗费资源。
实时上与其让语言模型一次性学习样本中全部的知识并存储到参数中,不如让模型学会如何在需要的时候先寻找有用的知识,再进行学习的能力,也更加的符合知识学习的一个过程。
因此本申请在主流预训练语言模型BERT的基础上引入了检索增强模块或方法,进行端到端的学习,从而实现了模型可以对现有知识找寻相关或者互补知识的能力,减轻了模型参数量的压力,从而可以采用参数量较小的模型便于后期进行部署;并且通过端到端的学习,使得检索增强模块本身可以不断地进行学习更新,从而具备在特定领域进行检索的能力,这对于一些样本稀缺的领域价值极大;在分析具体任务时,我们可以通过观察检索增强模块返回的知识来断定模型本身的理解是否有误,因此提高了模型的可解释性。
其中,所述端到端的学习是由输入端的数据直接得到输出端的结果。就是不要特征预处理和特征提取,直接把原始数据输入到语言模型中得到最终结果,本优选实例中,就是通过将拼接知识片段掩码后直接作为输入进行预训练,而不对其进行特征预处理和提取,通过预测训练直接得到最终的结果。
因此在本优选实例中,以BERT模型的预训练方法为例,基于主流预训练语言模型方法BERT,遵循检索增强和语言模型预训练一体化的思路,本申请一种端到端的语言模型预训练方法的整体流程如图1所示,可以包括:
步骤101,根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
本实例中,该步骤即为检索增强步骤,其中相近的现有知识片段可以根据预设的判断规则选择一个或者是多个,利用预设的判断规则能够快速的从已有的现有知识库中快速的得到目标对象,提高了输入对象的确定,同时减少了对参数的需求。
步骤102,将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
本实例中,通过输入的知识片段和通过检索得到的现有知识片段通过拼接实现融合,极大的提高了输入参量的可控维度,能够实现输入和检索的多种组合,输入的变 化带来检索结果的变化,则使得拼接结果也发生改变,更进一步的提高了在实际输入稳定的情况下,对语言模型输入的裂变式增加;
步骤103,将所述拼接知识片段进行掩码处理;
步骤104,将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。本实例中,通过掩码的处理使得拼接知识片段形成语言模型预训练的对象即输入,从而能够通过不断的重复训练,达到预设的目标,完成端到端的语言模型预训练。
通过上述的方法能够得到对应的基于检索增强的预训练语言模型,如图3所示,将如步骤101所述的通过检索得到的相近的现有知识片段的检索增强融入BERT模型进行端到端的学习后,其只需要有一个知识片段的输入,能够通过对现有知识片段的检索一起作为输入,充分的利用现有的知识库,无需再对参数的体量进行扩大,极大程度的削弱了模型对参数的渴求性,显著增强了模型对知识的学习与消化能力。将现有知识片段与输入的知识片段拼接掩码后作为输入,则使得基于检索增强的预训练语言模型学会了如何利用外部知识首先找寻对自己真正有价值的信息再进行的学习的方式,实现了从静态到动态的转换过程,也更加符合知识本身的学习过程。
本申请一实施例中,在上述步骤中,作为检索对象的现有知识库,实际上就是从现有的外部知识库或者是作为语言模型预训练样本中的除输入的知识片段外的其他所有知识片段的集合,或者是它们的合集,从中进行现有知识片段的检索,能够极大的减少预训练直接输入时,对输入样本规模及参数的依赖,并且能够对现有的知识库中的知识片段进行充分的利用,保证对语言模型的预训练效果和效率。
本申请一实施例中,当所述现有知识库为现有的所述外部知识库时,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段之前,所述方法还包括如下初步的预处理,对其进行向量表示,如图4所示。
步骤401,对所述外部知识库中的每篇目标文章进行切分,得到长度小于设定的单词数量的知识片段;也就是在每个知识片段包含最多完整句子的前提下,使得每个知识片段的长度小于设定的单词数量;
步骤402,将每一个由切分得到的每个所述知识片段转化为对应的向量表示,该向量表示中包含了对应知识片段的所有语义信息;
步骤403,对所述外部知识库中所有知识片段对应的向量表示建立索引,完成外部知识库的初步预处理。
具体的对外部知识库的处理,以常用的外部知识库为例进行说明。基于公开的维基百科英文文章数据构建外部知识数据库database,通过英伟达官方提供的WikicorpusTextFormatting.py脚本文件对其进行初步的预处理,提取出每篇wiki百科文章中的主要内容信息,记做:
Database={d1,d2,d3,…,di,…,dN}
总计共有N篇文章,di表示第i篇文章的主要内容信息,由于每篇文章的长度不一,这里对每一篇文章di做进一步地处理:
di={pi1,pi2,…pij,…,pim};
具体流程如下,对每一篇文章,按照从前向后的顺序对文章di进行切分m个小片段,使得每一个小片段pij的长度小于128个单词,切分点的原则是尽可能包含更多完整的句子,便于后期BERT模型处理,将每片文章切分后的片段进行打乱处理,进行合并后简记为:
Database={s1,s2,s3,…sK};
其中,pi来自于某一篇文章切分后的片段,处理合并后一共有K个知识片段sk,构成模型最终的知识片段库,即现有的外部知识库。
将每一个知识片段si通过BERT模型获得一个512维的向量表示hi,该向量表示蕴含了该知识片段的所有语义信息;
通过微软开源的MIPS框架对所有知识片段的向量表示hi建立索引(index),其目的在于当给定一个知识片段(query)时,能高效快速的返回知识片段库中与其在语义层面上(意思上)最接近的k个相关知识片段。
在初步预处理完成后,基于初步预处理的数据,能够方便的进行相近的判断,而所述的相近的关系具体的体现为相似或互补。
本申请一实施例中,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段,上述的这种相近关系能够通过如下步骤进行判断和选择,从而得到检索的结果,如图6所示,具体包括如下步骤:
步骤601,将输入的所述知识片段的向量表示,与所述现有知识片段的向量表示做向量内积;
步骤602,将得到的所述向量内积作为相关性的得分,从大到小进行排序;
步骤603,从排序中选择符合设定阈值的目标向量内积,得到目标向量内积对应的现有知识片段为相近的现有知识片段。
继续结合上述的实例,其中知识片段(query)与每个相关知识片段的相似度得分根据下式进行描述:
score=BERT(query) Thi
即知识片段(query)项通过BERT得到的向量表示与每一个相关知识片段hi做向量内积,该内积作为衡量对给定的query(知识片段)来说,每个相关知识片段hi与它的相关性,我们根据相关性得分的大小进行排序,选取与给定知识最为相近的c个知识片段,在本申请中,出于内存方面的考虑,我们设置对应的阈值c=3,也就是取三个最为相似或互补的知识片段,作为限定条件下的检索结果,也即检索增强模块的输出。
本申请一实施例中,以上的描述只是对本申请所述预训练方法的第一部分进行了说明,当通过极小的消耗和参数得到预训练的输入时,还需要对其结果和预训练的语言模型进行融合;即将输入知识片段和检索到的现有知识片段进行拼接,将得到的拼 接知识片段作为语言模型预训练的输入进行掩码;结合到具体的预训练模型中,本申请提出的增强式语言模型仅基于BERT模型中两个任务中的一个进行最终的训练,即掩码语言建模任务(masked language modeling,MLM),得到的掩码语言建模模型如图8所示,在拼接后的拼接知识片段中进行掩码[mask]后进行BERT的语言模型预训练。
本优选实例中,对于不同的句子间用[sep]字符进行区分,图8的示例假设第一个知识片段有10个单词,第二个知识片段有16个单词,每个知识片段里不同的句子间也用[sep]字符进行分隔。简单来说,MLM任务首先随机将一个段落(在这里指两个知识片段,可以包含多个知识片段)中的某些单词替换成掩码[mask],在这里指的是第一个知识片段中的单词2和第二个知识片段中的单词15,模型利用被掩码掉的单词的上下文,即该段落中未被掩码的部分,也就是除了第一个知识片段中的单词2和第二个知识片段中的单词15对其进行预测,从而实现对语言模型的预训练。
在上述的在原始的现有技术方法中,可能包含多个知识片段的段落是由文章di中本身连续的多个知识片段,即本质上连续的一段构成,而本申请可以根据第一个知识片段的内容动态的调整后续知识片段的选取,灵活性更高。
并且对于给定的段落,原始的现有技术方法中对于被掩码的单词来说,模型所能利用的上下文信息在模型从始至终的训练过程中都是固定的,一成不变的,而在本申请中,检索增强中的参数也是不断更新的,即对于第一个知识片段来说,与其配对的现有知识片段在训练过程中是不断发生变化的,因此可以每一个被掩码的单词在训练过程中可以利用的上下文信息也是随着最相关知识片段的变化而不断发生变化的。
本优选实例中进行预训练时,首先,本申请中所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;也就是将输入知识片段和检索到的现有知识片段进行拼接,如图9所示,其方法如下:
步骤901,在第一字符后依次拼接输入的所述知识片段和检索到的所述现有知识片段;
步骤902,在输入的所述知识片段和检索到的所述现有知识片段之间分别设置第二字符;
所述第一字符作为整个拼接知识片段的起始标识;
所述第二字符作为知识片段间的分隔标识。
在上述的BERT模型的处理基础上,本申请将检索增强融入到其中,使得拼接的对象发生了变化,相对于原始的现有技术方法中需要输入的多个知识片段,本申请对输入的一个知识片段能够得到多个相近的知识片段,从而减少了对输入参数的需求,具体的对包含K个知识片段的知识片段库中的每一个知识片段,以si为例,利用检测增强返回与其知识最为相近的3个知识片段si1,si2,si3;
将si与si1,si2,si3拼接在一起,拼接方式采用和BERT方法相同的处理方式,即:
Input=[cls]si[sep]si1[sep]si2[sep]si3[sep];
其中,[cls]字符作为第一字符,标识整个input输入的起始,知识片段间用第二字符[sep]进行分隔;
采用和BERT模型相同的掩码方法,即对上述已经拼接好的input,随机掩码掉15%的单词,被掩码的单词用[mask]字符进行表示。
本申请一实施例中,所述将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练时,在上述拼接和掩码实例的基础上,利用BERT模型的掩码语言建模任务对上述input中被mask的所有单词进行预测,从而实现对语言模型的预训练。以被mask掉的第i个单词mi为例,得到每个被掩码的单词的向量表示mi,与真实单词所对应的one-hot编码yi计算交叉熵:
Figure PCTCN2021084283-appb-000001
其中,K表示被掩码掉的单词数量,即样本数量,共有K个样本,(l)表示向量的第l维,one-hot编码指向量中真实单词所在的位置为1,其余位置均为0的表示方法。
最后,根据交叉熵损失函数,采用随机梯度下降算法(SGD),并在pytorch框架进行预训练语言模型的建模,实现对掩码的预测,同时能够使得检索增强模块和BERT模块参数得到更新。采用随机梯度下降法,采用一个样本迭代来进行梯度下降。对于训练速度来说,随机梯度下降法由于每次仅仅采用一个样本来迭代,训练速度很快,从而也减小了运算的开支,同时符合本申请所述对样本变化以及极少参量的需求。
本申请一实施例中,除了上述具体的拼接和掩码的方法,本申请中将输入的所述知识片段和检索到的所述现有知识片段进行拼接,以及将所述拼接知识片段进行掩码处理时,可以直接采用BERT模型的拼接和掩码方法。
本申请中将检索增强模块融入BERT模型进行端到端的学习后,极大程度的削弱了模型对参数的渴求性,显著增强了模型对知识的学习与消化能力。通过端到端的检索增强语言模型在训练完成后,在解决诸如开放领域问答(open domain QA)等需要首先从海量信息中找寻包含答案的知识片段,在进行问题的回答等任务时具备显著优势,本申请通过检索增强将上述任务各阶段融为一体,极大的简化了任务的流程,也降低了任务的难度;开辟了除增大数据集或者增大模型的另外一种语言模型训练范式,为模型的线下训练和线上部署降低了成本。
本申请一实施例中,对应上述各方法的流程提供一种端到端的语言模型预训练系统,如图2所示,其包括:
检索增强模块201,根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
拼接模块202,将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
掩码模块203,将所述拼接知识片段进行掩码处理;
预训练模块204,将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
其中,当所述现有知识库为现有的所述外部知识库时,所述根据预设的知识相近判断规则,在检索增强模块对现有知识库进行检索得到与输入的知识片段的知识相近的现有知识片段之前,通过数据处理模块对外部知识库进行如下的初步预处理,如图5所示,数据处理模块包括如下单元:
切分单元501,对所述外部知识库中的每篇目标文章进行切分,得到长度小于设定的单词数量的知识片段;
向量表示单元502,将由切分得到的每个所述知识片段转化为对应的向量表示,该向量表示中包含了对应知识片段的所有语义信息;
索引单元503,对所述外部知识库中所有知识片段对应的向量表示建立索引,完成外部知识库的初步预处理。
其中,检索增强模块根据预设的知识相近判断规则进行检索时,根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段,如图7所示,具体包括如下单元:
向量内积单元701,将输入的所述知识片段的向量表示,与所述现有知识片段的向量表示做向量内积;
排序单元702,将得到的所述向量内积作为相关性的得分,从大到小进行排序;
选择单元703,从排序中选择符合设定阈值的目标向量内积,得到目标向量内积对应的现有知识片段为相近的现有知识片段。
其中,拼接单元对输入的所述知识片段和检索到的所述现有知识片段进行拼接时,得到拼接知识片段,如图10所示,通过如下单元进行拼接处理:
起始拼接单元1001,在第一字符后依次拼接输入的所述知识片段和检索到的所述现有知识片段;所述的第一字符作为整个拼接知识片段的起始标识;
分隔拼接单元1002,在输入的所述知识片段和检索到的所述现有知识片段之间分别设置第二字符;所述的第二字符作为知识片段间的分隔标识。
本申请一实施例提供的应用一种端到端的语言模型预训练的电子设备,其包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;所述存储器存储有能够被至少一个处理器执行的指令,所述指令被至少一个处理器执行时,实现:
根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
将所述拼接知识片段进行掩码处理;
将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
本申请一实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,存储有能够执行的计算机指令,所述的计算机指令被执行时,实现:
根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
将所述拼接知识片段进行掩码处理;
将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
最后应当说明的是:以上实施例仅用以说明本申请的技术方案而非对其限制,尽管参照上述实施例对本申请进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本申请的具体实施方式进行修改或者等同替换,而未脱离本申请精神和范围的任何修改或者等同替换,其均应涵盖在本申请的权利要求保护范围之内。

Claims (20)

  1. 端到端的语言模型预训练方法,其中,包括:
    根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
    将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
    将所述拼接知识片段进行掩码处理;
    将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
  2. 根据权利要求1所述的端到端的语言模型预训练方法,其中,所述现有知识库为现有的外部知识库和/或语言模型预训练样本中除输入的所述知识片段外其他知识片段的集合。
  3. 根据权利要求2所述的端到端的语言模型预训练方法,其中,当所述现有知识库为现有的所述外部知识库时,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段之前,所述方法还包括:
    对所述外部知识库中的每篇目标文章进行切分,得到长度小于设定的单词数量的知识片段;
    将由切分得到的每个所述知识片段转化为对应的向量表示,该向量表示中包含了对应知识片段的所有语义信息;
    对所述外部知识库中所有知识片段对应的向量表示建立索引,完成外部知识库的初步预处理。
  4. 根据权利要求1所述的端到端的语言模型预训练方法,其中,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段包括:
    将输入的所述知识片段的向量表示,与所述现有知识片段的向量表示做向量内积;
    将得到的所述向量内积作为相关性的得分,从大到小进行排序;
    从排序中选择符合设定阈值的目标向量内积,得到目标向量内积对应的现有知识片段为相近的现有知识片段。
  5. 根据权利要求1所述的端到端的语言模型预训练方法,其中,所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段包括:
    在第一字符后依次拼接输入的所述知识片段和检索到的所述现有知识片段;
    在输入的所述知识片段和检索到的所述现有知识片段之间分别设置第二字符;
    所述第一字符作为整个拼接知识片段的起始标识;
    所述第二字符作为知识片段间的分隔标识。
  6. 根据权利要求1所述的端到端的语言模型预训练方法,其中,所述将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练时,预测训练的交叉熵表示如下,
    Figure PCTCN2021084283-appb-100001
    其中,K表示被掩码掉的单词数量,即样本数量;(l)表示向量的第l维;mi表示被掩码掉的第i个单词;yi为真实单词所对应的编码。
  7. 根据权利要求1所述的端到端的语言模型预训练方法,其中,所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,以及所述将所述拼接知识片段进行掩码处理时,均采用BERT模型的拼接和掩码方法。
  8. 端到端的语言模型预训练系统,其中,包括:
    检索增强模块,根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
    拼接模块,将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
    掩码模块,将所述拼接知识片段进行掩码处理;
    预训练模块,将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
  9. 一种电子设备,其中,包括:
    至少一个处理器;以及
    与至少一个处理器通信连接的存储器;其中,
    所述存储器存储有能够被至少一个处理器执行的指令,所述指令被至少一个处理器执行时,实现:
    根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
    将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
    将所述拼接知识片段进行掩码处理;
    将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
  10. 根据权利要求9所述的电子设备,其中,所述现有知识库为现有的外部知识库和/或语言模型预训练样本中除输入的所述知识片段外其他知识片段的集合。
  11. 根据权利要求10所述的电子设备,其中,当所述现有知识库为现有的所述外部知识库时,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段之前,还包括:
    对所述外部知识库中的每篇目标文章进行切分,得到长度小于设定的单词数量的知识片段;
    将由切分得到的每个所述知识片段转化为对应的向量表示,该向量表示中包含了 对应知识片段的所有语义信息;
    对所述外部知识库中所有知识片段对应的向量表示建立索引,完成外部知识库的初步预处理。
  12. 根据权利要求9所述的电子设备,其中,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段包括:
    将输入的所述知识片段的向量表示,与所述现有知识片段的向量表示做向量内积;
    将得到的所述向量内积作为相关性的得分,从大到小进行排序;
    从排序中选择符合设定阈值的目标向量内积,得到目标向量内积对应的现有知识片段为相近的现有知识片段。
  13. 根据权利要求9所述的电子设备,其中,所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段包括:
    在第一字符后依次拼接输入的所述知识片段和检索到的所述现有知识片段;
    在输入的所述知识片段和检索到的所述现有知识片段之间分别设置第二字符;
    所述第一字符作为整个拼接知识片段的起始标识;
    所述第二字符作为知识片段间的分隔标识。
  14. 根据权利要求9所述的电子设备,其中,所述将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练时,预测训练的交叉熵表示如下,
    Figure PCTCN2021084283-appb-100002
    其中,K表示被掩码掉的单词数量,即样本数量;(l)表示向量的第l维;mi表示被掩码掉的第i个单词;yi为真实单词所对应的编码。
  15. 根据权利要求9所述的电子设备,其中,所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,以及所述将所述拼接知识片段进行掩码处理时,均采用BERT模型的拼接和掩码方法。
  16. 一种计算机可读存储介质,其中,存储有能够执行的计算机指令,所述的计算机指令被执行时,实现:
    根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段;
    将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段;
    将所述拼接知识片段进行掩码处理;
    将掩码后的拼接知识片段作为语言模型预训练的输入进行预测训练,完成端到端的语言模型预训练。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述现有知识库为现有的外部知识库和/或语言模型预训练样本中除输入的所述知识片段外其他知识片段的集 合。
  18. 根据权利要求17所述的计算机可读存储介质,其中,当所述现有知识库为现有的所述外部知识库时,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段之前,还包括:
    对所述外部知识库中的每篇目标文章进行切分,得到长度小于设定的单词数量的知识片段;
    将由切分得到的每个所述知识片段转化为对应的向量表示,该向量表示中包含了对应知识片段的所有语义信息;
    对所述外部知识库中所有知识片段对应的向量表示建立索引,完成外部知识库的初步预处理。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述根据预设的知识相近判断规则,从现有知识库中检索得到与输入的知识片段的知识相近的现有知识片段包括:
    将输入的所述知识片段的向量表示,与所述现有知识片段的向量表示做向量内积;
    将得到的所述向量内积作为相关性的得分,从大到小进行排序;
    从排序中选择符合设定阈值的目标向量内积,得到目标向量内积对应的现有知识片段为相近的现有知识片段。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述将输入的所述知识片段和检索到的所述现有知识片段进行拼接,得到拼接知识片段包括:
    在第一字符后依次拼接输入的所述知识片段和检索到的所述现有知识片段;
    在输入的所述知识片段和检索到的所述现有知识片段之间分别设置第二字符;
    所述第一字符作为整个拼接知识片段的起始标识;
    所述第二字符作为知识片段间的分隔标识。
PCT/CN2021/084283 2020-12-28 2021-03-31 端到端的语言模型预训练方法、系统、设备及存储介质 WO2022141878A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011587439.0A CN112699216A (zh) 2020-12-28 2020-12-28 端到端的语言模型预训练方法、系统、设备及存储介质
CN202011587439.0 2020-12-28

Publications (1)

Publication Number Publication Date
WO2022141878A1 true WO2022141878A1 (zh) 2022-07-07

Family

ID=75511469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084283 WO2022141878A1 (zh) 2020-12-28 2021-03-31 端到端的语言模型预训练方法、系统、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112699216A (zh)
WO (1) WO2022141878A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204642A (zh) * 2023-03-06 2023-06-02 上海阅文信息技术有限公司 数字阅读中角色隐式属性智能识别分析方法、系统和应用
CN116245197A (zh) * 2023-02-21 2023-06-09 北京数美时代科技有限公司 一种提升语言模型的训练速率的方法、系统、介质及设备
CN116719911A (zh) * 2023-08-10 2023-09-08 成都不烦智能科技有限责任公司 自动化流程生成方法、装置、设备及存储介质
CN117312928A (zh) * 2023-11-28 2023-12-29 南京网眼信息技术有限公司 一种基于aigc识别用户设备信息的方法及系统
WO2024031891A1 (zh) * 2022-08-10 2024-02-15 浙江大学 知识表征解耦的分类模型的微调方法、装置和应用
CN117933401A (zh) * 2024-03-22 2024-04-26 北京大学 基于大语言模型投机采样推理的加速器硬件及加速方法
CN118313411A (zh) * 2024-06-07 2024-07-09 浙江实在智能科技有限公司 基于大模型检索增强生成的rpa智能体及其交互方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673702B (zh) * 2021-07-27 2022-07-29 北京师范大学 一种预训练语言模型的评测方法、装置以及存储介质
CN114625861B (zh) * 2022-05-11 2022-09-06 之江实验室 改进Transformer融入知识的端到端对话方法
CN115640520B (zh) * 2022-11-07 2023-07-14 北京百度网讯科技有限公司 跨语言跨模态模型的预训练方法、设备和存储介质
CN116501859B (zh) * 2023-06-26 2023-09-01 中国海洋大学 基于冰箱领域的段落检索方法、设备和介质
CN118114743B (zh) * 2024-04-29 2024-09-13 支付宝(杭州)信息技术有限公司 医疗模型预训练的方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373682A1 (en) * 2017-05-19 2018-12-27 salesforce.come, inc, Natural language processing using context-specific word vectors
CN111563166A (zh) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 一种针对数学问题分类的预训练模型方法
CN111680145A (zh) * 2020-06-10 2020-09-18 北京百度网讯科技有限公司 知识表示学习方法、装置、设备以及存储介质
CN111859982A (zh) * 2020-06-19 2020-10-30 北京百度网讯科技有限公司 语言模型的训练方法、装置、电子设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373682A1 (en) * 2017-05-19 2018-12-27 salesforce.come, inc, Natural language processing using context-specific word vectors
CN111563166A (zh) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 一种针对数学问题分类的预训练模型方法
CN111680145A (zh) * 2020-06-10 2020-09-18 北京百度网讯科技有限公司 知识表示学习方法、装置、设备以及存储介质
CN111859982A (zh) * 2020-06-19 2020-10-30 北京百度网讯科技有限公司 语言模型的训练方法、装置、电子设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KELVIN GUU; KENTON LEE; ZORA TUNG; PANUPONG PASUPAT; MING-WEI CHANG: "REALM: Retrieval-Augmented Language Model Pre-Training", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 February 2020 (2020-02-10), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081604391 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024031891A1 (zh) * 2022-08-10 2024-02-15 浙江大学 知识表征解耦的分类模型的微调方法、装置和应用
CN116245197A (zh) * 2023-02-21 2023-06-09 北京数美时代科技有限公司 一种提升语言模型的训练速率的方法、系统、介质及设备
CN116245197B (zh) * 2023-02-21 2023-11-07 北京数美时代科技有限公司 一种提升语言模型的训练速率的方法、系统、介质及设备
CN116204642A (zh) * 2023-03-06 2023-06-02 上海阅文信息技术有限公司 数字阅读中角色隐式属性智能识别分析方法、系统和应用
CN116204642B (zh) * 2023-03-06 2023-10-27 上海阅文信息技术有限公司 数字阅读中角色隐式属性智能识别分析方法、系统和应用
CN116719911A (zh) * 2023-08-10 2023-09-08 成都不烦智能科技有限责任公司 自动化流程生成方法、装置、设备及存储介质
CN116719911B (zh) * 2023-08-10 2023-10-31 成都不烦智能科技有限责任公司 自动化流程生成方法、装置、设备及存储介质
CN117312928A (zh) * 2023-11-28 2023-12-29 南京网眼信息技术有限公司 一种基于aigc识别用户设备信息的方法及系统
CN117312928B (zh) * 2023-11-28 2024-02-13 南京网眼信息技术有限公司 一种基于aigc识别用户设备信息的方法及系统
CN117933401A (zh) * 2024-03-22 2024-04-26 北京大学 基于大语言模型投机采样推理的加速器硬件及加速方法
CN117933401B (zh) * 2024-03-22 2024-06-07 北京大学 基于大语言模型投机采样推理的加速器硬件及加速方法
CN118313411A (zh) * 2024-06-07 2024-07-09 浙江实在智能科技有限公司 基于大模型检索增强生成的rpa智能体及其交互方法

Also Published As

Publication number Publication date
CN112699216A (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2022141878A1 (zh) 端到端的语言模型预训练方法、系统、设备及存储介质
CN111738003B (zh) 命名实体识别模型训练方法、命名实体识别方法和介质
CN113268995B (zh) 中文学术关键词抽取方法、装置和存储介质
CN111737496A (zh) 一种电力设备故障知识图谱构建方法
CN112100356A (zh) 一种基于相似性的知识库问答实体链接方法及系统
CN110020438A (zh) 基于序列识别的企业或组织中文名称实体消歧方法和装置
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN113190656B (zh) 一种基于多标注框架与融合特征的中文命名实体抽取方法
CN111414481A (zh) 基于拼音和bert嵌入的中文语义匹配方法
CN109670050A (zh) 一种实体关系预测方法及装置
CN111145914B (zh) 一种确定肺癌临床病种库文本实体的方法及装置
Song et al. Classification of traditional chinese medicine cases based on character-level bert and deep learning
CN114398900A (zh) 一种基于RoBERTa模型的长文本语义相似度计算方法
CN114298055B (zh) 基于多级语义匹配的检索方法、装置、计算机设备和存储介质
CN113609267B (zh) 基于GCNDT-MacBERT神经网络框架的话语关系识别方法及系统
Yan et al. Implicit emotional tendency recognition based on disconnected recurrent neural networks
Jiang et al. A hierarchical model with recurrent convolutional neural networks for sequential sentence classification
CN113051886B (zh) 一种试题查重方法、装置、存储介质及设备
Cui et al. A chinese text classification method based on bert and convolutional neural network
CN114238649A (zh) 一种常识概念增强的语言模型预训练方法
Xue et al. A method of chinese tourism named entity recognition based on bblc model
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.
CN114386425B (zh) 用于对自然语言文本内容进行处理的大数据体系建立方法
CN115630140A (zh) 一种基于文本特征融合的英语阅读材料难度判断的方法
Putra et al. Textual Entailment Technique for the Bahasa Using BiLSTM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912648

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912648

Country of ref document: EP

Kind code of ref document: A1