WO2024074100A1 - 自然语言处理、模型训练方法、装置、设备及存储介质 - Google Patents

自然语言处理、模型训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024074100A1
WO2024074100A1 PCT/CN2023/121267 CN2023121267W WO2024074100A1 WO 2024074100 A1 WO2024074100 A1 WO 2024074100A1 CN 2023121267 W CN2023121267 W CN 2023121267W WO 2024074100 A1 WO2024074100 A1 WO 2024074100A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
query
machine learning
text
learning model
Prior art date
Application number
PCT/CN2023/121267
Other languages
English (en)
French (fr)
Inventor
徐蔚文
李昕
张雯轩
邴立东
司罗
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2024074100A1 publication Critical patent/WO2024074100A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to the field of information technology, and in particular to a natural language processing, model training method, device, equipment and storage medium.
  • natural language understanding tasks can be performed by machines, but a machine learning model that can process natural language is required. Since there is less target task data for natural language understanding tasks, the machine learning model is usually pre-trained before using a small amount of target task data to train the machine learning model.
  • the current goal of pre-training is different from that of natural language understanding tasks.
  • the current goal of pre-training is to allow the machine learning model to recover the contaminated text
  • the goal of the natural language understanding task is to solve specific problems, such as identifying named entities, completing extractive question-answering, sentiment analysis, completing multiple-choice question-answering, etc. Therefore, if the current pre-training method is used to pre-train the machine learning model, the pre-trained machine learning model will not be able to be used to process natural language understanding tasks, and it will be difficult to calibrate the pre-trained machine learning model with a small amount of target task data, resulting in the machine learning model still being inaccurate after fine-tuning.
  • the present disclosure provides a natural language processing, model training method, device, equipment and storage medium to improve the accuracy of the fine-tuned machine learning model.
  • an embodiment of the present disclosure provides a natural language processing method, including:
  • Determine second target information from the sample text provided by the natural language understanding task generate a second query corresponding to the second target information, and use the sample text, the second query and the second target information to train the pre-trained machine learning model.
  • the present disclosure provides a model training method, including:
  • the query information and the target text are used as inputs of a machine learning model, so that the machine learning model outputs an answer in the target text corresponding to the query information, and the machine learning model is trained according to the model training method described above.
  • model training device including:
  • a first acquisition module used to acquire first target information marked by the hyperlink
  • a second acquisition module is used to acquire a first query corresponding to the first target information from a homepage article of the first target information, and acquire at least one first context information of the first target information from at least one referenced article of the first target information;
  • a pre-training module used to pre-train the machine learning model according to the first target information, the first query, and the at least one first context information to obtain a pre-trained machine learning model
  • a fine-tuning module is used to determine the second target information from the sample text provided by the natural language understanding task, generate a second query corresponding to the second target information, and use the sample text, the second query and the second target information to train the pre-trained machine learning model.
  • an embodiment of the present disclosure provides a natural language processing device, including:
  • An acquisition module is used to acquire the target text
  • a determination module used to determine query information according to the natural language understanding task corresponding to the target text
  • An input module is used to use the query information and the target text as inputs of a machine learning model, so that the machine learning model outputs an answer in the target text corresponding to the query information, and the machine learning model is trained according to the model training method described above.
  • an electronic device including:
  • the computer program is stored in the memory and is configured to be executed by the processor to implement the method as described in the first aspect or the second aspect.
  • an embodiment of the present disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the method described in the first aspect or the second aspect.
  • the natural language processing, model training method, apparatus, device and storage medium provided by the embodiments of the present disclosure obtain the first target information marked by the hyperlink as the answer, obtain the first query corresponding to the first target information from the homepage article of the first target information, and obtain at least one first context information of the first target information from at least one referenced article of the first target information, so that the first target information, the first query and each first context information can constitute a triple in the style of machine reading comprehension. Furthermore, the machine learning model is pre-trained according to each triple, so that the pre-trained machine learning model can seamlessly and naturally handle a variety of natural language understanding tasks under the machine reading comprehension paradigm.
  • the data format used for model training in the fine-tuning phase is consistent with that used in the data format used for model training, which includes a triple of answer, query, and context information, making the goals of pre-training and fine-tuning the same, so that the pre-training phase and the fine-tuning phase can be seamlessly connected.
  • the pre-trained machine learning model can be calibrated with a small amount of target task data, so that the general knowledge learned in the pre-training phase can be smoothly transferred to the fine-tuned machine learning model, and the accuracy of the fine-tuned machine learning model is guaranteed.
  • FIG1 is a schematic diagram of the differences between MLM, S2S, and MRC in the pre-training and fine-tuning stages provided by an embodiment of the present disclosure
  • FIG2 is a flow chart of a model training method provided by an embodiment of the present disclosure.
  • FIG3 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • FIG4 is a schematic diagram of home page articles and cited articles provided by an embodiment of the present disclosure.
  • FIG5 is a schematic diagram of a PMR provided by another embodiment of the present disclosure.
  • FIG6 is a flow chart of a model training method provided by another embodiment of the present disclosure.
  • FIG7 is a schematic diagram of a probability matrix provided by another embodiment of the present disclosure.
  • FIG8 is a schematic diagram of a probability matrix provided by another embodiment of the present disclosure.
  • FIG9 is a schematic diagram of a probability matrix provided by another embodiment of the present disclosure.
  • FIG10 is a flow chart of a model training method provided by another embodiment of the present disclosure.
  • FIG11 is a schematic diagram of the structure of a model training device provided in an embodiment of the present disclosure.
  • FIG12 is a schematic diagram of the structure of a natural language processing device provided in an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of the structure of an electronic device embodiment provided by an embodiment of the present disclosure.
  • the machine learning model is usually pre-trained. For example, a machine learning model is pre-trained on a large amount of low-cost data through a certain pre-training method to obtain a pre-trained model (Pre-trained Models), so that the pre-trained model can learn the commonalities in a large amount of low-cost data and obtain general knowledge.
  • Pre-trained Models a pre-trained model
  • the pre-trained model is fine-tuned with a small amount of target task data, so as to transfer the general knowledge to the fine-tuned machine learning model, and the fine-tuned machine learning model can handle the target task well, such as natural language understanding tasks.
  • natural Language Understanding NLU is a general term for ideas, methods and tasks that support machine understanding of text data.
  • the current goal of pre-training is different from that of the natural language understanding task.
  • the current goal of pre-training is to allow the machine learning model to recover the contaminated text
  • the goal of the natural language understanding task is to solve specific problems, such as identifying named entities, completing extractive question-answering, sentiment analysis, completing multiple-choice question-answering, etc. Therefore, if the current pre-training method is used to pre-train the machine learning model, the pre-trained machine learning model will not be able to be used to process the natural language understanding task, and it will be difficult to calibrate the pre-trained machine learning model with a small amount of target task data, resulting in the fine-tuned machine learning model still not being accurate enough.
  • Step (1) automatically replace part of the text in the input text with special characters such as [MASK], and feed the input text containing special characters into the encoder.
  • Step (2) restore the text based on the contextual text representation of the replaced text.
  • Masked Language Model MLM
  • MLM Masked Language Model
  • the encoder can be a bidirectional encoder based on a transformer (Bidirectional Encoder Representation from Transformer, BERT) or a robustly optimized BERT (A Robustly Optimized BERT, RoBERTa).
  • BERT Bidirectional Encoder Representation from Transformer
  • BERT A Robustly Optimized BERT, RoBERTa
  • the language model output layer restores "invented” based on the contextual text representation of the replaced text.
  • This pre-training scheme can be pre-trained in a large-scale text corpus. Furthermore, when fine-tuning the pre-trained machine learning model, it is necessary to add an additional randomly initialized task-related module to achieve the goal of downstream task classification.
  • NER named entity recognition
  • NER Layer named entity recognition output layer
  • Figure 1 the named entity recognition output layer
  • the multi-classifier is randomly initialized and task-related, so it can only be fine-tuned with the data of the named entity recognition. If the data of the named entity recognition is small, it will be difficult to obtain a good fine-tuning effect, that is, it will be difficult to calibrate the pre-trained machine learning model with a small amount of target task data (such as the data of the named entity recognition), resulting in the fine-tuned machine learning model still not being enough.
  • the fine-tuned machine learning model refers to the machine learning model obtained by calibrating the pre-trained machine learning model through a small amount of target task data, that is, retraining or fine-tuning.
  • EQA Layer Extractive Question Answering output layer
  • NER Named Entity Recognition
  • Extractive Question Answering refers to extracting the corresponding answer from the relevant text according to the given question.
  • the question shown in Figure 1 is "Who is Xiao Zhang's father?"
  • the relevant text is "Xiao Zhang's father is Lao Zhang”
  • the answer is "Lao Zhang” and the position information of "Lao Zhang” in the relevant text.
  • 14, 15 shown in Figure 1 means that "Lao Zhang” is located at the position of the 14th and 15th characters.
  • each word in the question and each word in the related text are sorted uniformly.
  • step (1) contaminate some text segments in the input text, for example, replace the text segments with some special characters, such as [X], [Y], and then send them to the encoder.
  • step (2) Based on the text representation output by the encoder, the text segments contaminated by special characters are restored separately through the decoder.
  • Sequence-to-Sequence S2S is a model paradigm for natural language processing. Given a text input, the machine learning model needs to output text related to the corresponding generation task. For example, as shown in Figure 1, the text segment replaced by the special character [X] is "Invented silicon", and T5 includes an encoder and a decoder.
  • This method can convert various types of downstream natural language understanding tasks into text generation tasks respectively, so there is no need to add task-related modules.
  • the output text is the recovered text segment, which still has data format differences from the natural language input and natural language output of the real downstream task.
  • the natural language input is "[spot] person [spot] location [text] someone will fly to city A" as shown in Figure 1, and the natural language output is "person: someone, location: city A”.
  • the downstream task is an extractive question-answering task, the natural language input is "Who is Xiao Zhang's father?
  • the embodiments of the present disclosure provide a model training method, which includes pre-training the machine learning model using the machine reading comprehension (MRC) paradigm pre-training (MRC-style pre-training) shown in Figure 1.
  • the method also includes fine-tuning the pre-trained machine learning model.
  • the MRC paradigm is specifically a model paradigm for natural language processing, whose input includes two parts: a query and related context text (context), and the output is the position of some answers in the context text, so that the answer can satisfy the input query.
  • the query is "It is a chemical component of semiconductors, and the chemical symbol is Si”
  • the context text is "Someone invented silicon technology”
  • the answer is the position of "silicon” in the context text "24, 24", indicating that "silicon” is located at the position of the 24th word, wherein each word in the query and each word in the context text are uniformly sorted.
  • the machine The input and output of the learning model in the fine-tuning stage are in the same data format as the input and output of the machine learning model in the pre-training stage.
  • the input of the machine learning model is the query and context text, and the output is the answer.
  • the input of the machine learning model is also the query and context text, where the query is "People?" as shown in Figure 1, the context text is "Someone will fly to City A” as shown in Figure 1, and the output of the machine learning model is the answer, that is, the position "2, 3" of "Someone” in the context text.
  • the input of the machine learning model is also the query and context text, where the query is "Who is Xiao Zhang's father?" as shown in Figure 1, the context text is "Xiao Zhang's father is Lao Zhang” as shown in Figure 1, and the output of the machine learning model is the answer, that is, the position "14, 15" of "Lao Zhang” in the context text.
  • FIG2 is a flow chart of a model training method provided in an embodiment of the present disclosure.
  • the method can be performed by a model training device, which can be implemented in software and/or hardware, and the device can be configured in an electronic device, such as a server or a terminal, wherein the terminal specifically includes a mobile phone, a computer or a tablet computer, etc.
  • the model training method described in this embodiment can be applicable to an application scenario as shown in FIG3.
  • the application scenario includes a terminal 31 and a server 32, wherein the server 32 can use the model training method described in this embodiment to pre-train and fine-tune the machine learning model.
  • the server 32 can provide services to the terminal 31 according to the fine-tuned machine learning model.
  • the terminal 31 can send a query and context text to the server 32, and the server 32 can input the query and context text into the fine-tuned machine learning model, so that the fine-tuned machine learning model can output an answer, and further, the server 32 can feed back the answer to the terminal 31.
  • the terminal 31 can send the context text to the server 32, and the server 32 generates a query according to the specific requirements of the natural language understanding task, and inputs the query and the context text into the fine-tuned machine learning model, so that the fine-tuned machine learning model can output an answer.
  • the server 32 can also deploy the fine-tuned machine learning model on the terminal 31, so that the terminal 31 can perform the natural language understanding task through the fine-tuned machine learning model.
  • the method is described in detail below in conjunction with Figure 3. As shown in Figure 2, the specific steps of the method are as follows:
  • S201 Acquire first target information marked by a hyperlink.
  • the anchor marked by the hyperlink can be used as the answer to the machine reading comprehension.
  • the anchor marked by the hyperlink can be recorded as the first target information.
  • the hyperlink can be a hyperlink in a web page.
  • the web page can be a web page in Wikipedia.
  • S202 Acquire a first query corresponding to the first target information from a homepage article of the first target information, and acquire at least one first context information of the first target information from at least one referenced article of the first target information.
  • an anchor can link to two types of articles, one type of article is the home page article, and the other type of article is the reference article.
  • the home page article (Home Article) is used to explain the anchor in detail.
  • "Silicon” is an anchor marked by a hyperlink, and the home page article explains silicon in detail.
  • an anchor may be linked to one or more reference articles (Reference Article).
  • the "Silicon" link shown in Figure 4 has two reference articles, namely Reference Article 41 and Reference Article 42.
  • Reference Article 41 is an article introducing semiconductors
  • Reference Article 42 is an article introducing integrated circuits.
  • the query corresponding to the anchor can be obtained from the homepage article, and the query can be recorded as the first query.
  • the context information corresponding to the anchor can also be obtained from each cited article, and the context information is recorded as the first context information. At least one piece of context information can be obtained from a cited article.
  • obtaining a first query corresponding to the first target information from a homepage article of the first target information includes: taking at least one first sentence in the homepage article of the first target information as the first query corresponding to the first target information.
  • the first T sentences in the homepage article can be used as the query corresponding to the anchor, where T is greater than or equal to 1.
  • T is greater than or equal to 1.
  • the first two sentences in the homepage article are used as the query corresponding to the anchor.
  • obtaining at least one first context information of the first target information from at least one cited article of the first target information includes: for each cited article in the at least one cited article of the first target information, determining a sentence including the first target information from the cited article; and taking at least one sentence before and after the sentence in the cited article and the sentence together as a first context information.
  • a sentence containing the anchor i.e., "silicon”
  • the W sentences before and after the sentence and the sentence i.e., 2W+1 sentences
  • W is greater than or equal to 1. That is, for a cited article, if there are N sentences in the cited article that contain the anchor, then N context information can be obtained from the cited article.
  • context information 1 is extracted from the cited article 41, and context information 1 corresponds to answer 1
  • context information 2 is extracted from the cited article 41, and context information 2 corresponds to answer 2.
  • (query, context information 1, answer 1) can constitute a triple
  • (query, context information 2, answer 2) can constitute another triple.
  • negative data is also data in the form of triples, and the triples also include the query, context information, and answer as described above, but the query and context information do not match, that is, the answer corresponding to the query does not exist in the context information.
  • the answer in the triple can be directly assigned a special mark such as null or 0.
  • this embodiment provides a unified pre-trained reader (Pre-trained Machine Reader, PMR), which can be a machine learning model.
  • PMR Pre-trained Machine Reader
  • the pre-trained reader is pre-trained by the positive data and negative data as described above to obtain a pre-trained PMR, which can be recorded as a pre-trained machine learning model.
  • PMR Pre-trained Machine Reader
  • the query and context information in the triple are used as inputs of the PMR, so that the PMR outputs an answer according to the query and context information.
  • the output answer and the standard answer in the triplet are used to train the PMR, that is, the parameters of the PMR are iteratively updated.
  • the parameters of the PMR can be iteratively updated once according to each triplet. When the number of iterative updates reaches a preset number, or the parameters of the PMR tend to be stable, it can be determined that the pre-training of the PMR is completed.
  • S204 Determine second target information from the sample text provided by the natural language understanding task, generate a second query corresponding to the second target information, and use the sample text, the second query and the second target information to train the pre-trained machine learning model.
  • the natural language understanding task can be divided into several types.
  • the task usually provides a sample text.
  • the answer is determined from the sample text, and the answer can be used as the second target information.
  • a corresponding query can be generated, and the query can be recorded as the second query.
  • sample text, second target information, second query is taken as a triple, that is, the sample text here can be regarded as the context information as described above, and the triple can be the target task data as described above.
  • the pre-trained PMR can be trained, that is, fine-tuned, corrected or calibrated through the triple.
  • the second query and the sample text are used as the input of the pre-trained PMR, so that the pre-trained PMR outputs an answer. Further, the pre-trained PMR is fine-tuned according to the answer and the second target information output by the pre-trained PMR.
  • the natural language understanding task is a word-level extraction task with a fixed task label, such as a named entity recognition task
  • a query can be generated according to each task label, wherein each type of entity can correspond to a task label, and the type of entity can be divided into place names, people, organization names, proper nouns, etc.
  • the sample text provided by the named entity recognition task is "someone will fly to city A”
  • various types of entities are pre-labeled from the sample text, for example, "someone” is a person type entity, "city A" is a place name type entity, etc.
  • the person type and the place name type can correspond to different task labels.
  • a query "please find out the entity related to the person in the following text" is generated. Further, the sample text and the query are used as inputs of the pre-trained PMR, so that the pre-trained PMR retrieves the answer corresponding to the query from the sample text. Assuming that the pre-trained PMR outputs an answer, further, the pre-trained PMR is fine-tuned according to the answer output by the pre-trained PMR and the standard answer, namely "someone". For another example, for the task label corresponding to the place name type, a query "please find out the entity related to the place name in the following text" is generated.
  • the sample text and the query are used as inputs of the pre-trained PMR, so that the pre-trained PMR retrieves the answer corresponding to the query from the sample text. Assume that the pre-trained PMR outputs an answer. Further, the pre-trained PMR is fine-tuned according to the answer output by the pre-trained PMR and the standard answer, i.e., "City A".
  • the natural language understanding task is a word-level extraction task of natural language questions, such as an extractive question-answering task
  • at least one question is generated based on the sample text. For example, if the sample text is "Xiao Ming was born in country B", the questions may be "Who appeared in country B", "Where was Xiao Ming born”, etc.
  • the query and the sample text are used as the input of the pre-trained PMR, so that the pre-trained PMR retrieves the answer corresponding to the query in the sample text, and further, the pre-trained PMR is fine-tuned based on the answer output by the pre-trained PMR and the standard answer, namely "Xiao Ming".
  • each sentiment can correspond to a task label, for example, "positive sentiment” and "negative sentiment” correspond to different task labels, and a query is generated according to each task label.
  • the query generated for "positive sentiment” is "the text below represents positive sentiment”
  • the sample text is "Xiao Ming is very happy today”
  • the query and the sample text are used as the input of the pre-trained PMR, so that the pre-trained PMR determines whether the query and the sample text are related. If so, it means that the result of sentiment analysis is "positive sentiment”.
  • the query generated for "negative sentiment” is "the text below represents negative sentiment"
  • the query and the sample text are used as the input of the pre-trained PMR, so that the pre-trained PMR determines whether the query and the sample text are related. If not, it means that "negative sentiment" cannot be used as the result of sentiment analysis.
  • the natural language understanding task is a sequence-level classification task based on natural language questions on multiple options, such as the Multi-choice Question Answering (MCQA) task, that is, according to the given question and relevant reference information, the correct option is selected from multiple options.
  • the relevant reference information is a given passage in reading comprehension
  • the question is a question about the passage
  • the multiple options are multiple options corresponding to the question.
  • the question and an option can be used as a query together, and the query and the relevant reference information can be used as the input of the pre-trained PMR, so that the pre-trained PMR determines whether the query and the relevant reference information are related. If so, it means that the correct answer to the question is the option in the query. If not, it means that the correct answer to the question is not the option in the query, and the correct answer needs to be further determined.
  • the disclosed embodiment obtains the first target information marked by the hyperlink as the answer, obtains the first query corresponding to the first target information from the homepage article of the first target information, and obtains at least one first context information of the first target information from at least one cited article of the first target information, so that the first target information, the first query and each first context information can constitute a triple of machine reading comprehension style. Further, the machine learning model is pre-trained according to each triple, so that the pre-trained machine learning model can seamlessly and naturally handle a variety of natural language understanding tasks under the machine reading comprehension paradigm.
  • the pre-training stage is consistent with the data format used for model training in the fine-tuning stage, both of which include triples of answers, queries and context information
  • the goal of pre-training and the goal of fine-tuning are the same, so that the pre-training stage and the fine-tuning stage can be seamlessly connected.
  • the pre-training process and the fine-tuning process are very similar, after pre-training the machine learning model with a large amount of low-cost data, the pre-trained machine learning model can be calibrated with a small amount of target task data, so that the general knowledge learned in the pre-training stage is smoothly transferred to the fine-tuned machine learning model, and the accuracy of the fine-tuned machine learning model is guaranteed.
  • the present embodiment can construct a large amount of high-quality data in a machine reading comprehension format, such as triples, so that an end-to-end pre-trained machine learning model can be used.
  • Massive data can be constructed without manual labeling, which greatly reduces labor costs, thereby reducing the cost of pre-training and improving the accuracy of the pre-trained machine learning model.
  • the pre-training provided by the present embodiment is a pre-training scheme based on the machine reading comprehension paradigm, the pre-trained machine learning model can be applied to various languages and even multi-language learning. Therefore, the present scheme has wide applicability, strong versatility, and strong interpretability in sequence-level tasks.
  • the present embodiment unifies all pre-training and fine-tuning stages in the machine reading comprehension paradigm, thereby eliminating the differences in training objectives and data formats between pre-training and fine-tuning, making pre-training
  • the general knowledge learned in the training phase can be well transferred to the fine-tuned machine learning model, which improves the transferability and greatly improves the performance of the pre-trained and fine-tuned machine learning model when processing natural language understanding tasks.
  • FIG6 is a flow chart of a model training method provided by another embodiment of the present disclosure.
  • the machine learning model includes an encoder and an extractor, and the output of the encoder is the input of the extractor.
  • the machine learning model is a PMR as shown in FIG5 , and the PMR includes an encoder and an extractor, and the output of the encoder is the input of the extractor.
  • Pre-training the machine learning model according to the first target information, the first query, and the at least one first context information to obtain the pre-trained machine learning model includes the following steps:
  • each first context information in the at least one first context information use the first query and the first context information as inputs of the encoder, so that the encoder outputs a representation vector for each text unit in the first query and a representation vector for each text unit in the first context information.
  • a certain anchor marked by a hyperlink may correspond to multiple context information
  • the anchor, the query corresponding to the anchor, and the context information corresponding to the anchor can constitute a triple.
  • the query in a certain triple is the query (Query) shown in Figure 5
  • the context information in the triple is the context information (Context) shown in Figure 5.
  • the query and the context information are at least one sentence respectively, and each sentence includes at least one text unit
  • the text unit can be, for example, a word, a character, a phrase, a short sentence, a character, etc. Therefore, in this embodiment, the query shown in Figure 5 can be split into multiple text units, and the context information shown in Figure 5 can be split into multiple text units.
  • the query can be split into Q text units
  • the context information can be split into C text units
  • one text unit can be recorded as a Token.
  • a special word such as [CLS] can be added in front of the query
  • a special word such as [SEP] can be added between the query and the context information
  • a special word such as [SEP] can be added after the context information.
  • [CLS]] Q text units in the query, [SEP], C text units in the context information, [SEP] are taken as the input of the encoder.
  • the encoder can represent the M input words in a vector space, so that the encoder can output the representation vector corresponding to each of the M input words.
  • H 1 , H 2 , H N-1 , H N , H N+1 , H M-1 , H M shown in FIG5 are the representation vectors of the M input words in sequence.
  • S602 Calculate, by the extractor, the probability that each text segment in the first context information is an answer corresponding to the first query, wherein each text segment is composed of at least one continuous text unit in the first context information.
  • the extractor can calculate a probability value given the representation vectors of any two input words, and the probability value represents the probability that the text segment determined by the two input words is the answer corresponding to the query, and the text segment is at least one continuous input word extracted from the M input words, starting with the first input word of the two input words and ending with the second input word of the two input words.
  • S 1,3 represents the probability that the text segment composed of the input word corresponding to H 1 , the input word corresponding to H 2 , and the input word corresponding to H 3 is the answer corresponding to the query.
  • there are a total of M input words so a total of M*M probabilities can be obtained, thereby obtaining the probability matrix 51 shown in FIG5 .
  • the probability matrix 51 includes the probability that a text segment consisting of any number of continuous text units intercepted from the C text units is the answer corresponding to the query, that is, the probability matrix 51 includes the probability that each text segment in the context information is the answer corresponding to the query, and each text segment in the context information is respectively composed of at least one continuous text unit in the context information.
  • the extractor calculates the probability that each text segment in the first context information is the answer corresponding to the first query, including: calculating the probability that a text segment consisting of at least one text unit continuous from the i-th text unit to the j-th text unit in the first context information is the answer corresponding to the first query, where j is greater than or equal to i, and the probability is calculated based on the representation vector of the i-th text unit and the representation vector of the j-th text unit.
  • [CLS], Q text units in the query, [SEP], C text units in the context information, and [SEP] are uniformly ordered, that is, assume that [CLS] is the first text unit, and so on, the last [SEP] is the Mth text unit.
  • the index corresponding to [CLS] in the encoder input is 1
  • the index corresponding to the first text unit in the query in the encoder input is 2
  • the index corresponding to the first text unit in the context information in the encoder input is N+1
  • the index corresponding to the last [SEP] is M.
  • the probability matrix 51 shown in FIG. 5 can be represented in detail as the probability matrix 71 shown in FIG.
  • the probability matrix 71 includes a probability matrix 72, assuming that i is greater than or equal to N+1, j is greater than or equal to i, and j is less than or equal to M-1, then any S i,j in the probability matrix 72 represents the probability that the text segment consisting of at least one text unit continuous from the i-th text unit to the j-th text unit is the answer corresponding to the query.
  • S i,j is calculated based on the representation vector of the i-th text unit and the representation vector of the j-th text unit.
  • the first target information is answer 2 as shown in FIG. 4
  • the query shown in FIG. 5 is the query shown in FIG. 4
  • the context information shown in FIG. 5 is context information 2 as shown in FIG. 4 .
  • the position information of answer 2 in context information 2 can be determined in advance, so that the index corresponding to answer 2 in the encoder input can be determined according to the position information of answer 2 in context information 2.
  • the probability matrix 71 can be calculated according to the method described above, and the probability matrix 72 can be determined from the probability matrix 71.
  • a standard matrix can be generated according to the position information of answer 2 in context information 2.
  • the size of the standard matrix is the same as the size of probability matrix 72.
  • the standard matrix is, for example, probability matrix 81 as shown in FIG8.
  • S' M-1, M-1 in probability matrix 81 can be set to 1, and the other probability values in the probability matrix 81 are set to 0.
  • the PMR is pre-trained according to the difference between the probability matrix 72 and the probability matrix 81.
  • the loss function value is calculated according to the probability matrix 72 and the probability matrix 81, and further, the PMR is pre-trained according to the loss function value.
  • the method before pre-training the machine learning model according to the probability and the position information of the first target information in the first context information, the method also includes: calculating the correlation between the first query and the first context information through the extractor, the correlation is calculated based on the representation vector of the first query and the first context information as a whole; pre-training the machine learning model according to the probability and the position information of the first target information in the first context information, including: pre-training the machine learning model according to the correlation, the probability, and the position information of the first target information in the first context information.
  • H1 corresponds to the special word [CLS]
  • a loss function can be constructed according to S 1,1 in the probability matrix 71, the probability matrix 72 in the probability matrix 71, S' 1,1 , and the probability matrix 81 shown in FIG8.
  • the loss function includes the difference between S 1,1 and S' 1,1 , and the difference between the probability matrix 72 and the probability matrix 81.
  • the PMR is pre-trained according to the value of the loss function.
  • a standard matrix of the same size as the probability matrix 71 can be constructed, such as the probability matrix 91 shown in FIG9 .
  • S' 1,1 is equal to 1
  • S' M-1,M-1 is equal to 1
  • other elements are 0.
  • a loss function is constructed according to the difference between the standard matrix and the probability matrix 71, and the PMR is pre-trained according to the loss function value.
  • the query can be fixed, and there may be multiple context information. Then, when the query and each context information are used as an input of the PMR, the PMR can be pre-trained once. In the process of constantly changing the context information, the PMR can be pre-trained multiple times. In addition, when the answer is changed, the corresponding query and context information will also be changed accordingly, so that the PMR can be pre-trained multiple times. In other words, each triple can pre-train the PMR once. Since the present embodiment can construct a large amount of machine reading comprehension style data, namely triples, the PMR can be fully pre-trained.
  • the pre-trained PMR can also be fine-tuned by using S204 as described above.
  • the data used for model training is still triples, but the triples at this time are data for a natural language understanding task.
  • the source of the triples used in the fine-tuning process is different from the source of the triples used in the pre-training process, the data format is similar.
  • the principle of the fine-tuning process is consistent with the principle of the pre-training process, that is, the fine-tuning process can use the difference between the probability matrix 72 and the probability matrix 81 as described above to fine-tune the pre-trained PMR, or use the difference between S 1,1 and S' 1,1 as described above, and the difference between the probability matrix 72 and the probability matrix 81 to fine-tune the pre-trained PMR, or use the difference between the probability matrix 71 and the probability matrix 91 as described above to fine-tune the pre-trained PMR, and the specific process will not be repeated here.
  • This embodiment uses a unified extractor to handle various natural language understanding tasks, and the extractor maintains the same training goal during pre-training and fine-tuning, thereby eliminating the difference in training goals between pre-training and fine-tuning. Since the pre-training process provided by this embodiment is based on the discriminative goal of machine reading comprehension, compared with the traditional generative pre-training goal, it can significantly improve the pre-training efficiency and reduce the hardware overhead required for pre-training.
  • this embodiment can accurately fine-tune the pre-trained machine learning model through a small amount of target task data for a certain natural language understanding task (such as the triples used for fine-tuning) without any adjustment, so that the fine-tuned machine learning model can accurately handle the natural language understanding task.
  • a certain natural language understanding task such as the triples used for fine-tuning
  • the disclosed embodiment converts four forms of natural language understanding tasks (e.g., named entity recognition tasks, extractive question-answering tasks, sentiment analysis tasks, and multiple-choice question-answering tasks) into machine reading comprehension paradigms, so that the pre-trained machine learning model, such as the pre-trained PMR, can be directly and seamlessly fine-tuned on the triples formed by these tasks.
  • the present embodiment provides a unified framework for solving downstream natural language understanding tasks, such as PMR, so that only one machine learning model needs to be maintained to solve various tasks, with high storage efficiency in actual scenarios, and strong mobility and versatility of the machine learning model.
  • the embodiments described above mainly describe how to pre-train and fine-tune a machine learning model such as PMR, wherein both pre-training and fine-tuning belong to the training phase.
  • the fine-tuned PMR i.e., the trained machine learning model
  • the training phase and the reasoning phase can be performed in the same device or in different devices.
  • both the training phase and the reasoning phase can be completed on the server 32 as shown in FIG3 .
  • the training phase is completed on the server 32, and the trained machine learning model is further transplanted to other devices, thereby implementing the reasoning phase on other devices.
  • the reasoning phase is introduced below in conjunction with FIG10 .
  • FIG10 is a flow chart of a natural language processing method provided by another embodiment of the present disclosure.
  • the specific steps of the method are as follows:
  • the reasoning stage is executed on the server 32 as shown in FIG3 .
  • the terminal 31 may send a target text to the server 32 .
  • S1002 Determine query information according to the natural language understanding task corresponding to the target text.
  • the user inputs the target text and query information at the terminal 31.
  • the target text is "someone will fly to city A”
  • the query information is "city name”.
  • the server 32 receives the target text and the query information
  • the query information input by the user can be used as the query information corresponding to the natural language understanding task.
  • the server 32 can search for the entity corresponding to the "city name", i.e., "city A”, in the target text according to the query information specified by the user, and feed back "city A" to the terminal 31, i.e., "city A” is the answer corresponding to the query information.
  • the user inputs the target text on the terminal 31, but does not input the query information.
  • the server 32 receives the target text and the natural language understanding task is a named entity recognition task, query information for each entity type in all known entity types can be generated.
  • the known entity types include “city name”, “personal name”, “historical places”, etc.
  • the query information generated by the server 32 is "please find the entity related to the city name below", “please find the entity related to the person name below", “please find the entity related to the historical places below”.
  • the server 32 can sequentially query the entities corresponding to "city name”, "personal name”, “historical places”, and feedback the entities corresponding to "city name”, "personal name”, “historical places”, etc. to the terminal 31.
  • the server 32 can use the query information and the target text as inputs to the machine learning model that has been pre-trained and fine-tuned as described above.
  • the machine learning model that has been pre-trained and fine-tuned is the PMR shown in FIG5 .
  • the PMR processes the query information and the target text according to the logic shown in FIG5 , obtains a matrix similar to the probability matrix 51, obtains the maximum probability value from the matrix, and uses the text segment in the target text corresponding to the maximum probability value as the answer corresponding to the query information, and outputs the answer.
  • taking the query information and the target text as inputs of a machine learning model so that the machine learning model outputs an answer in the target text corresponding to the query information including: taking the query information and the target text as inputs of a machine learning model so that the machine learning model determines whether the query information and the target text are related; if the query information and the target text are related, outputting the answer in the target text corresponding to the query information through the machine learning model.
  • the query information described in this embodiment corresponds to the query shown in FIG5
  • the target text described in this embodiment corresponds to the context information shown in FIG5 .
  • adding a special word [CLS] in front of the query information adding a special word [SEP] between the query information and the target text, and adding the special word [SEP] after the target text, [CLS], the query information, [SEP], the target text, and [SEP] are taken together as inputs of the encoder.
  • a matrix similar to the probability matrix 51 is obtained.
  • the element of the first row and the first column is first extracted from the matrix.
  • the element is similar to S 1,1 as shown in FIG5 .
  • the query information and the target text are related. If they are related, it can be further determined whether the text segment needs to be output based on the probability corresponding to each text segment in the target text in the matrix. For example, if the probability is greater than or equal to a preset threshold, it can be determined that the text segment is the query information. The answer corresponding to the query information is output.
  • This embodiment seamlessly connects the pre-training stage and the fine-tuning stage. After pre-training the machine learning model with a large amount of low-cost data, the pre-trained machine learning model can be calibrated with a small amount of target task data, so that the fine-tuned machine learning model can accurately handle various natural language understanding tasks.
  • FIG11 is a schematic diagram of the structure of a model training device provided in an embodiment of the present disclosure.
  • the model training device provided in an embodiment of the present disclosure can execute the processing flow provided in an embodiment of the model training method.
  • the model training device 110 includes:
  • a first acquisition module 111 is used to acquire first target information marked by a hyperlink
  • a second acquisition module 112 is configured to acquire a first query corresponding to the first target information from a homepage article of the first target information, and acquire at least one first context information of the first target information from at least one referenced article of the first target information;
  • a pre-training module 113 configured to pre-train a machine learning model according to the first target information, the first query, and the at least one first context information to obtain a pre-trained machine learning model;
  • the fine-tuning module 114 is used to determine the second target information from the sample text provided by the natural language understanding task, generate a second query corresponding to the second target information, and use the sample text, the second query and the second target information to train the pre-trained machine learning model.
  • the machine learning model includes an encoder and an extractor, and the output of the encoder is the input of the extractor;
  • the pre-training module 113 includes: an input unit 1131, a calculation unit 1132, and a pre-training unit 1133, wherein the input unit 1131 is used to take the first query and the first context information as the input of the encoder for each first context information in the at least one first context information, so that the encoder outputs the representation vector of each text unit in the first query and the representation vector of each text unit in the first context information;
  • the calculation unit 1132 is used to calculate the probability of each text segment in the first context information being the answer corresponding to the first query through the extractor, and each text segment is composed of at least one continuous text unit in the first context information;
  • the pre-training unit 1133 is used to pre-train the machine learning model according to the probability and the position information of the first target information in the first context information to obtain a pre-trained machine learning model.
  • calculation unit 1132 calculates the probability of each text segment in the first context information being the answer corresponding to the first query through the extractor, it is specifically configured to:
  • the extractor calculates the probability that a text segment consisting of at least one text unit continuous from the i-th text unit to the j-th text unit is the answer corresponding to the first query, where j is greater than or equal to i, i is greater than or equal to N+1, N+1 is the index corresponding to the first text unit in the first context information in the input of the encoder, and the probability is calculated based on the representation vector of the i-th text unit and the representation vector of the j-th text unit.
  • the calculation unit 1132 is further used to: before the pre-training unit 1133 pre-trains the machine learning model according to the probability and the position information of the first target information in the first context information, calculate the correlation between the first query and the first context information through the extractor, wherein the correlation is calculated based on the first query.
  • the pre-training unit 1133 is used to pre-train the machine learning model according to the relevance, the probability and the position information of the first target information in the first context information.
  • the second acquisition module 112 acquires the first query corresponding to the first target information from the homepage article of the first target information, it is specifically used to: use the first at least one sentence in the homepage article of the first target information as the first query corresponding to the first target information.
  • the second acquisition module 112 acquires at least one first context information of the first target information from at least one cited article of the first target information, it is specifically configured to:
  • At least one sentence before and after the sentence in the cited article and the sentence are taken together as a first context information.
  • the model training device of the embodiment shown in FIG11 can be used to execute the technical solution of the above-mentioned method embodiment. Its implementation principle and technical effects are similar and will not be described in detail here.
  • FIG12 is a schematic diagram of the structure of a natural language processing device provided in an embodiment of the present disclosure.
  • the natural language processing device provided in an embodiment of the present disclosure can execute the processing flow provided in an embodiment of the natural language processing method.
  • the natural language processing device 120 includes:
  • An acquisition module 121 is used to acquire a target text
  • a determination module 122 configured to determine query information according to a natural language understanding task corresponding to the target text
  • the input module 123 is used to use the query information and the target text as inputs of a machine learning model, so that the machine learning model outputs an answer in the target text corresponding to the query information.
  • the machine learning model is trained according to the model training method described above.
  • the input module 123 uses the query information and the target text as inputs of a machine learning model so that the machine learning model outputs an answer in the target text corresponding to the query information, specifically for:
  • the answer corresponding to the query information in the target text is output through the machine learning model.
  • the natural language processing device of the embodiment shown in FIG12 can be used to execute the technical solution of the above-mentioned method embodiment. Its implementation principle and technical effects are similar and will not be repeated here.
  • FIG13 is a schematic diagram of the structure of an electronic device embodiment provided by an embodiment of the present disclosure. As shown in FIG13, The electronic device includes a memory 131 and a processor 132 .
  • the memory 131 is used to store programs. In addition to the above-mentioned programs, the memory 131 can also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 131 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 132 is coupled to the memory 131 and executes the program stored in the memory 131 to:
  • Determine second target information from the sample text provided by the natural language understanding task generate a second query corresponding to the second target information, and use the sample text, the second query and the second target information to train the pre-trained machine learning model.
  • processor 132 may also be configured to:
  • the query information and the target text are used as inputs of a machine learning model, so that the machine learning model outputs an answer in the target text corresponding to the query information, and the machine learning model is trained according to the model training method described above.
  • the electronic device may also include other components such as a communication component 133, a power component 134, an audio component 135, and a display 136.
  • Fig. 13 schematically shows only some components, which does not mean that the electronic device only includes the components shown in Fig. 13.
  • the communication component 133 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • the electronic device can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 133 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 133 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the power supply component 134 provides power to various components of the electronic device.
  • the power supply component 134 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the electronic device.
  • the audio component 135 is configured to output and/or input audio signals.
  • the audio component 135 includes a microphone.
  • the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 131 or transmitted via the communication component 133.
  • the audio component 135 further includes a speaker for outputting an audio signal.
  • the display 136 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
  • an embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the model training method or natural language processing method described in the above embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

本公开涉及一种自然语言处理、模型训练方法、装置、设备及存储介质。本公开通过每个三元组对机器学习模型进行预训练,使得预训练后的机器学习模型能够无缝自然的在机器阅读理解范式下处理各式各样的自然语言理解任务。另外,由于预训练阶段中用于做模型训练的数据格式和微调阶段中用于做模型训练的数据格式一致,使得预训练的目标和微调的目标相同,从而使得预训练阶段和微调阶段之间可以进行无缝的衔接。在采用大量低成本数据对模型进行预训练之后,通过少量的目标任务数据即可校准预训练后的机器学习模型,从而使得预训练阶段中学习到的通用知识顺利的迁移到微调后的模型中,并保证了微调后的模型的准确性。

Description

自然语言处理、模型训练方法、装置、设备及存储介质
本申请要求于2022年10月04日提交中国专利局、申请号为202211218353.X、申请名称为“自然语言处理、模型训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及信息技术领域,尤其涉及一种自然语言处理、模型训练方法、装置、设备及存储介质。
背景技术
目前的自然语言理解任务可以由机器来执行,但是,需要具备能够对自然语言进行处理的机器学习模型。由于针对自然语言理解任务的目标任务数据较少,因此,在采用少量的目标任务数据对该机器学习模型进行训练之前,通常会对该机器学习模型进行预训练。
但是,目前预训练的目标和自然语言理解任务的目标不同,例如,目前预训练的目标是让机器学习模型恢复出被污染的文本,而自然语言理解任务的目标是解决具体的问题,例如,识别命名实体、完成抽取式问答、情感分析、完成多选式问答等。因此,若采用目前的预训练方法对该机器学习模型进行预训练,将会导致经过预训练后的机器学习模型无法用于处理自然语言理解任务,并且难以通过少量的目标任务数据校准预训练后的机器学习模型,从而导致微调后的机器学习模型依然不够精准。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种自然语言处理、模型训练方法、装置、设备及存储介质,以提高微调后的机器学习模型的准确性。
第一方面,本公开实施例提供一种自然语言处理方法,包括:
获取超链接所标记的第一目标信息;
从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
第二方面,本公开实施例提供一种模型训练方法,包括:
获取目标文本;
根据所述目标文本对应的自然语言理解任务,确定查询信息;
将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如上所述的模型训练方法训练得到的。
第三方面,本公开实施例提供一种模型训练装置,包括:
第一获取模块,用于获取超链接所标记的第一目标信息;
第二获取模块,用于从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
预训练模块,用于根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
微调模块,用于从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
第四方面,本公开实施例提供一种自然语言处理装置,包括:
获取模块,用于获取目标文本;
确定模块,用于根据所述目标文本对应的自然语言理解任务,确定查询信息;
输入模块,用于将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如上所述的模型训练方法训练得到的。
第五方面,本公开实施例提供一种电子设备,包括:
存储器;
处理器;以及
计算机程序;
其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如第一方面或第二方面所述的方法。
第六方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现第一方面或第二方面所述的方法。
本公开实施例提供的自然语言处理、模型训练方法、装置、设备及存储介质,通过获取超链接所标记的第一目标信息作为答案,并从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息,使得所述第一目标信息、所述第一查询和每个第一上下文信息可以构成一个机器阅读理解风格的三元组。进一步,根据每个三元组对机器学习模型进行预训练,使得预训练后的机器学习模型能够无缝自然的在机器阅读理解范式下处理各式各样的自然语言理解任务。另外,由于预训练阶段中用于做模型训练的数据格式 和微调阶段中用于做模型训练的数据格式一致,都是包括答案、查询和上下文信息的三元组,使得预训练的目标和微调的目标相同,从而使得预训练阶段和微调阶段之间可以进行无缝的衔接。由于预训练过程和微调过程极为相似,因此,在采用大量低成本数据对机器学习模型进行预训练之后,通过少量的目标任务数据即可校准预训练后的机器学习模型,从而使得预训练阶段中学习到的通用知识顺利的迁移到微调后的机器学习模型中,并保证了微调后的机器学习模型的准确性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的MLM、S2S、MRC在预训练和微调阶段差异的示意图;
图2为本公开实施例提供的模型训练方法流程图;
图3为本公开实施例提供的应用场景的示意图;
图4为本公开实施例提供的主页文章、引用文章的示意图;
图5为本公开另一实施例提供的PMR的示意图;
图6为本公开另一实施例提供的模型训练方法流程图;
图7为本公开另一实施例提供的概率矩阵的示意图;
图8为本公开另一实施例提供的概率矩阵的示意图;
图9为本公开另一实施例提供的概率矩阵的示意图;
图10为本公开另一实施例提供的模型训练方法流程图;
图11为本公开实施例提供的模型训练装置的结构示意图;
图12为本公开实施例提供的自然语言处理装置的结构示意图;
图13为本公开实施例提供的电子设备实施例的结构示意图。
具体实施方式
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。
通常情况下,由于目标任务数据少,如果直接用少量的目标任务数据去训练机器学习模型,将导致训练后的机器学习模型的效果不好即不够精准,所以在采用少量的目标任务 数据对该机器学习模型进行训练之前,通常会对该机器学习模型进行预训练。例如,通过某种预训练方法在大量的低成本获得的数据中对机器学习模型进行预训练,得到预训练模型(Pre-trained Models),使得预训练模型可以学习到大量低成本数据中的共性,并获得通用知识。进一步,通过少量的目标任务数据对预训练模型进行微调,从而将通用知识迁移到微调后的机器学习模型中,并使得微调后的机器学习模型可以很好的处理目标任务,例如自然语言理解任务。其中,自然语言理解(Natural Language Understanding,NLU)是支持机器理解文本数据的思想、方法和任务的统称。
但是,目前预训练的目标和自然语言理解任务的目标不同,例如,目前预训练的目标是让机器学习模型恢复出被污染的文本,而自然语言理解任务的目标是解决具体的问题,例如,识别命名实体、完成抽取式问答、情感分析、完成多选式问答等。因此,若采用目前的预训练方法对该机器学习模型进行预训练,将会导致经过预训练后的机器学习模型无法用于处理自然语言理解任务,并且难以通过少量的目标任务数据校准预训练后的机器学习模型,从而导致微调后的机器学习模型依然不够精准。尽管可以对预训练后的机器学习模型进行一些调整,使得调整后的机器学习模型能够处理自然语言理解任务,但是,这些调整会导致训练目标以及数据格式的变化,造成预训练和微调之间极大的差异,从而影响通用知识往下游任务的迁移。并且少量的目标任务数据也不足以调整预训练后的机器学习模型,以便消除预训练和微调之间的差异。
例如,基于掩码语言范式的预训练(MLM-style Pre-training),其预训练过程分为两个步骤,步骤(1):自动将输入文本中的部分文本替换为特殊字符例如[MASK],将包含有特殊字符的输入文本送入编码器。步骤(2):基于被替换文本的上下文文本表示,恢复出该文本。掩码语言模型(Masked Language Model,MLM):是一类自然语言处理的模型范式,具体的,机器学习模型需要恢复输入中被污染的一些词。由于机器学习模型可以看到整个句子,因此机器学习模型可以基于被污染词的上下文来恢复该词。如图1所示,输入文本是“某某发明了硅技术”,其中,“发明了”被替换为[MASK]。该编码器可以是基于转换器(Transformer)的双向编码器(Bidirectional Encoder Representation from Transformer,BERT)或者是强力优化的BERT(A Robustly Optimized BERT,RoBERTa)。语言模型输出层(LM Head)基于被替换文本的上下文文本表示,恢复出“发明了”。该预训练方案可在大规模的文本语料之中进行预训练。进一步,对预训练后的机器学习模型进行微调时,需要额外加一个随机初始化的任务相关的模块来实现下游任务分类的目标。例如,针对命名实体识别(Named Entity Recognition,NER)任务,需要在预训练后的机器学习模型输出的每个词向量上添加一个多分类器例如图1所示的命名实体识别输出层(NER Layer),来判断该词属于哪一种实体类别(或者不属于实体)。该多分类器是随机初始化的并且是任务相关的,因此只能用该命名实体识别的数据进行微调。若该命名实体识别的数据较少,将难以获得较好的微调效果,即难以通过少量的目标任务数据(例如该命名实体识别的数据)校准预训练后的机器学习模型,从而导致微调后的机器学习模型依然不够 精准,并且容易过拟合。其中,微调后的机器学习模型是指通过少量的目标任务数据对预训练后的机器学习模型进行校准即再次训练或微调后得到的机器学习模型。再例如,针对抽取式问答(Extractive Question Answering,EQA)任务,需要添加一个抽取式问答输出层(EQA Layer)。其中,命名实体识别(Named Entity Recognition,NER)是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,例如从图1所示的“某某将飞往A城市”中识别出人名“某某”。抽取式问答(Extractive Question Answering,EQA)是指根据给定的问题,在相关文本中抽取出对应的答案,例如图1所示的问题是“谁是小张的父亲?”,相关文本是“小张的父亲是老张”,答案是“老张”以及“老张”在相关文本中的位置信息,例如图1所示的14,15表示“老张”位于第14个字和第15个字的位置上。其中,问题中的每个字和相关文本中的每个字是统一排序的。
再例如,基于序列-序列范式的预训练(S2S-style Pre-training),其预训练过程分为两步,步骤(1):在输入文本中污染一些文本段,例如将文本段替换为一些特殊字符,如[X],[Y],之后送入编码器。步骤(2):基于编码器输出的文本表征,通过解码器将由特殊字符污染的文本段分别恢复出来。序列-序列(Sequence-to-Sequence,S2S)是一类自然语言处理的模型范式,给定文本输入,机器学习模型需要输出与对应生成任务相关的文本。例如图1所示,被特殊字符[X]替换的文本段是“发明了硅”,T5包括编码器和解码器。该方法可以将各种类型的下游自然语言理解任务分别转化成文本生成任务,因此无须添加任务相关模块。但由于该方案的输入文本被污染,输出文本是恢复出的文本段,和真实的下游任务的自然语言输入、自然语言输出仍有数据形式上的差异,例如,当下游任务是命名实体识别任务时,自然语言输入是图1所示的“[spot]人物[spot]地点[text]某某将飞往A城市”,自然语言输出是“人物:某某,地点:A城市”。当下游任务是抽取式问答任务时,自然语言输入是图1所示的“谁是小张的父亲?小张的父亲是老张”,自然语言输出是“老张”。可见,针对序列-序列范式,预训练阶段的输入输出和微调阶段的输入输出有着很大的差异,导致序列-序列范式在低资源场景即目标任务数据较少的情况下仍然难以获得较好的微调效果,即难以通过少量的目标任务数据校准预训练后的机器学习模型,从而导致微调后的机器学习模型依然不够精准。
针对上述问题,本公开实施例提供了一种模型训练方法,该方法包括采用图1所示的机器阅读理解(Machine Reading Comprehension,MRC)范式的预训练(MRC-style Pre-training)对机器学习模型进行预训练,另外,该方法还包括对预训练后的机器学习模型的微调。该MRC范式具体是一类自然语言处理的模型范式,其输入包含查询(query)和相关上下文文本(context)两部分,输出是上下文文本中的一些答案的位置,使得该答案能够满足输入的查询。例如针对图1所示的MRC范式,查询是“它是半导体的一种化学成分,化学符号是Si”,上下文文本是“某某发明了硅技术”,答案是“硅”在上下文文本中的位置“24,24”,表示“硅”位于第24个字的位置上,其中,查询中的每个字和上下文文本中的每个字是统一排序的。另外,针对本实施例提供的机器阅读理解范式,机器 学习模型在微调阶段的输入输出和机器学习模型在预训练阶段的输入输出的数据格式是相同的,例如,在预训练阶段,机器学习模型的输入是查询和上下文文本,输出是答案。在微调阶段,若目标任务是NER任务,则机器学习模型的输入也是查询和上下文文本,其中,查询是如图1所示的“人物?”,上下文文本是图1所示的“某某将飞往A城市”,机器学习模型的输出是答案即“某某”在上下文文本中的位置“2,3”。若目标任务是EQA任务,则机器学习模型的输入也是查询和上下文文本,其中,查询是如图1所示的“谁是小张的父亲?”,上下文文本是图1所示的“小张的父亲是老张”,机器学习模型的输出是答案即“老张”在上下文文本中的位置“14,15”。
下面结合具体的实施例对该方法进行介绍。图2为本公开实施例提供的模型训练方法流程图。该方法可以由模型训练装置执行,该装置可以采用软件和/或硬件的方式实现,该装置可配置于电子设备中,例如服务器或终端,其中,终端具体包括手机、电脑或平板电脑等。另外,本实施例所述的模型训练方法可以适用于如图3所示的应用场景。如图3所示,该应用场景包括终端31和服务器32,其中,服务器32可以采用本实施例所述的模型训练方法对机器学习模型进行预训练和微调。进一步,服务器32可以根据微调之后的机器学习模型为终端31提供服务,例如,终端31可以向服务器32发送查询和上下文文本,服务器32可以将该查询和上下文文本输入给微调之后的机器学习模型,使得微调之后的机器学习模型可以输出答案,进一步,服务器32可以将该答案反馈给终端31。或者,终端31可以向服务器32发送上下文文本,服务器32根据自然语言理解任务的具体要求生成查询,并将该查询和上下文文本输入给微调之后的机器学习模型,使得微调之后的机器学习模型可以输出答案。再或者,服务器32还可以将微调之后的机器学习模型部署在终端31上,使得终端31可以通过该微调之后的机器学习模型执行自然语言理解任务。下面结合图3对该方法进行详细介绍,如图2所示,该方法具体步骤如下:
S201、获取超链接所标记的第一目标信息。
例如,为了构造机器阅读理解风格的数据,可以将超链接所标记的锚(Anchor)作为机器阅读理解的答案,另外,超链接所标记的锚可以记为第一目标信息。具体的,该超链接可以是网页中的超链接。该网页可以是维基百科中的网页。
S202、从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息。
具体的,一个锚可以链接有两种类型的文章,一种类型的文章是主页文章,另一种类型的文章是引用文章。其中,主页文章(Home Article)用于对该锚进行详细的解释。例如图4所示,“硅”是被超链接所标记的锚,主页文章对硅进行了详细的解释。另外,一个锚可能会链接有一个或多个引用文章(Reference Article)。例如图4所示的“硅”链接有两篇引用文章,分别是引用文章41和引用文章42,其中,引用文章41是介绍半导体的文章,引用文章42是介绍集成电路的文章,这两篇引用文章可以是维基百科文章,并 且这两篇引用文章中都提到了“硅”,引用文章中出现的“硅”都通过超链接所标记。在本实施例中,可以从主页文章中获取锚对应的查询,该查询可以记为第一查询。另外,还可以从每个引用文章中获取锚对应的上下文信息,该上下文信息记为第一上下文信息。从一个引用文章中至少可以获取到一个上下文信息。
可选的,从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,包括:将所述第一目标信息的主页文章中的前至少一个句子作为所述第一目标信息对应的第一查询。
例如,在本实施例中,可以将主页文章中的前T个句子作为该锚对应的查询,T大于或等于1。例如图4所示,将主页文章中的前两个句子作为该锚对应的查询。
可选的,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息,包括:针对所述第一目标信息的至少一个引用文章中的每个引用文章,从所述引用文章中确定出包括所述第一目标信息的句子;将所述引用文章中所述句子前后各至少一个句子和所述句子一起作为一个第一上下文信息。
例如,针对图4所示的两篇引用文章中的每篇引用文章,从该篇引用文章中首先确定出包含该锚即“硅”的句子,可以理解的是,在该篇引用文章中包含“硅”的句子可能不止一个,例如,图4所示的引用文章41中就有两个句子分别包含有“硅”,以其中一个包含“硅”的句子为例,将该句子前后各W各句子和该句子一起(即2W+1个句子)作为该锚的一个上下文信息,W大于或等于1。也就是说,对于一篇引用文章而言,若该篇引用文章中有N个句子分别包含了该锚,那么从该篇引用文章中可以获取到N个上下文信息。如图4所示,假设以每篇引用文章提取到一个上下文信息为例,具体的,从引用文章41中提取到上下文信息1,上下文信息1对应于答案1,从引用文章41中提取到上下文信息2,上下文信息2对应于答案2。具体的,(查询,上下文信息1,答案1)可以构成一个三元组,(查询,上下文信息2,答案2)可以构成另一个三元组。通过这种数据构造方式,可以构造出数以亿计的机器阅读理解风格的正例数据。另外,还可以将不相关的查询和上下文信息进行匹配从而构造出机器阅读理解风格的负例数据,例如,负例数据也是三元组形式的数据,该三元组也包括如上所述的查询、上下文信息、答案,只是查询和上下文信息不匹配,即上下文信息中不存在该查询对应的答案,相应的,该三元组中的答案可以直接赋值为空或0等特殊标记。
S203、根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型。
如图5所示,本实施例提供了一个统一的预训练阅读器(Pre-trained Machine Reader,PMR),该预训练阅读器可以是机器学习模型。进一步,通过如上所述的正例数据和负例数据对该预训练阅读器进行预训练,得到预训练后的PMR,预训练后的PMR可以记为预训练后的机器学习模型。例如以正例数据中的一个三元组为例,将该三元组中的查询和上下文信息作为PMR的输入,使得PMR根据查询和上下文信息输出一个答案。进一步,根据PMR 输出的答案和该三元组中标准的答案,对PMR进行训练,即对PMR的参数进行一次迭代更新。可以理解的是,根据每一个三元组可以对PMR的参数进行一次迭代更新,当迭代更新的次数达到预设次数,或者PMR的参数趋于稳定的时候,可以确定针对PMR的预训练完成。
S204、从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
在本实施例中,自然语言理解任务可以分为几种类型,针对每一种类型的任务,该任务通常会提供有一个样本文本,进一步,从该样本文本中确定出答案,该答案可以作为第二目标信息。根据该答案可以生成对应的查询,该查询可以记为第二查询。从而将(样本文本,第二目标信息,第二查询)作为一个三元组,也就是说,此处的样本文本可以视为如上所述的上下文信息,该三元组可以是如上所述的目标任务数据,通过该三元组可以对预训练后的PMR进行训练,即微调、修正或校准。同理,将第二查询和样本文本作为预训练后的PMR的输入,使得预训练后的PMR输出一个答案,进一步,根据预训练后的PMR输出的答案和第二目标信息对预训练后的PMR进行微调。
具体的,若自然语言理解任务是有固定任务标签的词级别抽取任务,例如命名实体识别任务,则根据每一个任务标签可以生成一个查询,其中,每一种类型的实体可以对应有一个任务标签,实体的类型例如可以分为地名、人物、机构名、专有名词等。假设命名实体识别任务提供的样本文本是“某某将飞往A城市”,从该样本文本中预先标注出各种类型的实体,例如,“某某”是人物类型的实体,“A城市”是地名类型的实体等。人物类型和地名类型可以对应有不同的任务标签。针对人物类型对应的任务标签,生成查询“请找出下文中与人物相关的实体”。进一步,将该样本文本和该查询作为预训练后的PMR的输入,使得预训练后的PMR从该样本文本中检索与该查询对应的答案,假设预训练后的PMR输出一个答案,进一步,根据预训练后的PMR输出的答案和标准答案即“某某”对预训练后的PMR进行微调。再例如,针对地名类型对应的任务标签,生成查询“请找出下文中与地名相关的实体”。进一步,将该样本文本和该查询作为预训练后的PMR的输入,使得预训练后的PMR从该样本文本中检索与该查询对应的答案,假设预训练后的PMR输出一个答案,进一步,根据预训练后的PMR输出的答案和标准答案即“A城市”对预训练后的PMR进行微调。
若自然语言理解任务是自然语言问题的词级别抽取任务,如抽取式问答任务,则根据样本文本生成至少一个问题。例如,样本文本是“小明出生在B国家”,则问题可以是“谁出现在B国家”、“小明出生在哪里”等。针对“谁出现在B国家”,将该查询和该样本文本作为预训练后的PMR的输入,使得预训练后的PMR在该样本文本中检索与该查询对应的答案,进一步,根据预训练后的PMR输出的答案和标准答案即“小明”对预训练后的PMR进行微调。
若自然语言理解任务是有固定任务标签的序列级别分类任务,如情感分析(Sentiment  Analysis)任务,即给定一段文本,判断该文本的情感极性,则每一种情感可对应有一个任务标签,例如,“积极情感”和“消极情感”分别对应不同的任务标签,根据每个任务标签生成一个查询。例如,针对“积极情感”生成的查询是“下面文本表示的是积极情感”,假设样本文本是“小明今天很开心”,将该查询和该样本文本作为预训练后的PMR的输入,使得预训练后的PMR判断该查询和该样本文本是否相关,若相关,则说明情感分析的结果是“积极情感”。另外,针对“消极情感”生成的查询是“下面文本表示的是消极情感”,将该查询和该样本文本作为预训练后的PMR的输入,使得预训练后的PMR判断该查询和该样本文本是否相关,若不相关,则说明“消极情感”不能作为情感分析的结果。
若自然语言理解任务是基于自然语言问题在多个选项上的序列级别分类任务,如多选式问答(Multi-choice Question Answering,MCQA)任务,即根据给定的问题和相关参考信息,在多个选项中选择出正确的选项。例如,相关参考信息是阅读理解中给定的一段文章,问题是针对该文章的问题,多个选项是该问题对应的多个选项。那么在这种情况下,可以将该问题和一个选项一起作为一个查询,并将该查询和该相关参考信息作为预训练后的PMR的输入,使得预训练后的PMR判断该查询和该相关参考信息是否相关,若相关,则说明该问题的正确答案就是该查询中的这个选项,若不相关,则说明该问题的正确答案不是该查询中的这个选项,需要继续判断正确答案。
本公开实施例通过获取超链接所标记的第一目标信息作为答案,并从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息,使得所述第一目标信息、所述第一查询和每个第一上下文信息可以构成一个机器阅读理解风格的三元组。进一步,根据每个三元组对机器学习模型进行预训练,使得预训练后的机器学习模型能够无缝自然的在机器阅读理解范式下处理各式各样的自然语言理解任务。另外,由于预训练阶段中用于做模型训练的数据格式和微调阶段中用于做模型训练的数据格式一致,都是包括答案、查询和上下文信息的三元组,使得预训练的目标和微调的目标相同,从而使得预训练阶段和微调阶段之间可以进行无缝的衔接。由于预训练过程和微调过程极为相似,因此,在采用大量低成本数据对机器学习模型进行预训练之后,通过少量的目标任务数据即可校准预训练后的机器学习模型,从而使得预训练阶段中学习到的通用知识顺利的迁移到微调后的机器学习模型中,并保证了微调后的机器学习模型的准确性。
另外,本实施例可以构造大量高质量的机器阅读理解格式的数据例如三元组,从而可以端到端的预训练机器学习模型,无需进行人工标注便能构造海量数据,极大的减少了人力成本,从而降低了预训练的成本,并且提高了预训练后的机器学习模型的准确度。另外,由于本实施例提供的预训练是一种基于机器阅读理解范式的预训练方案,预训练后的机器学习模型可以适用于各种语言,甚至多语言学习,因此本方案适用性广、通用性强,并且在序列级别任务上有很强的可解释性。此外,本实施例将预训练和微调阶段全部统一在机器阅读理解范式,从而消除了预训练和微调之间的训练目标和数据格式的差异,使得预训 练阶段学习到的通用知识可以很好的迁移到微调后的机器学习模型中,即提高了迁移性,同时使得经过预训练和微调之后的机器学习模型在处理自然语言理解任务时具有极大的性能提升。
图6为本公开另一实施例提供的模型训练方法流程图。在本实施例中,所述机器学习模型包括编码器和抽取器,所述编码器的输出是所述抽取器的输入。例如,该机器学习模型是如图5所示的PMR,PMR包括编码器和抽取器,编码器的输出是抽取器的输入。根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型,包括如下几个步骤:
S601、针对所述至少一个第一上下文信息中的每个第一上下文信息,将所述第一查询和所述第一上下文信息作为所述编码器的输入,使得所述编码器输出所述第一查询中每个文本单元的表示向量和所述第一上下文信息中每个文本单元的表示向量。
可以理解的是,针对超链接所标记的某个锚可能对应有多个上下文信息,该锚、该锚对应的查询、该锚对应的一个上下文信息可以构成一个三元组。假设某个三元组中的查询是图5所示的查询(Query),该三元组中的上下文信息是图5所示的上下文信息(Context)。可以理解的是,由于查询和上下文信息分别是至少一个句子,而每个句子包括至少一个文本单元,该文本单元例如可以是词语、单词、字、短语、短句、字符等。因此,在本实施例中可以将图5所示的查询拆分为多个文本单元,同时将图5所示的上下文信息拆分为多个文本单元。例如,该查询可以被拆分为Q个文本单元,该上下文信息可以被拆分为C个文本单元,一个文本单元可以记为一个Token。另外,在查询的前面可以增加一个特殊词例如[CLS],在查询和上下文信息之间可以增加一个特殊词例如[SEP],另外,在上下文信息的后面也可以增加一个特殊词例如[SEP]。进一步,将[CLS]、查询中的Q个文本单元、[SEP]、上下文信息中的C个文本单元、[SEP]一起作为编码器的输入。假设图5所示的[CLS]、查询中的Q个文本单元、[SEP]、上下文信息中的C个文本单元、[SEP]一共是M个输入词,则编码器可以将该M个输入词表征在一个向量空间中,从而使得编码器可以输出该M个输入词中每个输入词对应的表示向量,例如,图5所示的H1、H2、HN-1、HN、HN+1、HM-1、HM是该M个输入词依次的表示向量。
S602、通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率,所述每个文本段分别由所述第一上下文信息中连续的至少一个文本单元构成。
在本实施例中,抽取器可以在给定任意两个输入词的表示向量的情况下,计算出一个概率值,该概率值表示根据该两个输入词确定的文本段作为该查询对应的答案的概率,该文本段是从该M个输入词中截取出的以该两个输入词中的第一个输入词为开头、以该两个输入词中的第二个输入词为结尾的连续的至少一个输入词。例如,S1,3表示H1对应的输入词、H2对应的输入词、H3对应的输入词构成的文本段作为该查询对应的答案的概率。由 于如图5所示一共有M个输入词,那么一共可以得到M*M个概率,从而得到图5所示的概率矩阵51。可以理解的是,由于上下文信息中的C个文本单元包括在该M个输入词中,因此,概率矩阵51中包括从该C个文本单元中截取的任意个连续的文本单元构成的文本段作为该查询对应的答案的概率,也就是说,概率矩阵51中包括上下文信息中每个文本段分别作为该查询对应的答案的概率,上下文信息中的每个文本段分别由该上下文信息中连续的至少一个文本单元构成。
可选的,通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率,包括:通过所述抽取器计算所述第一上下文信息中从第i个文本单元到第j个文本单元连续的至少一个文本单元构成的文本段作为所述第一查询对应的答案的概率,j大于或等于i,所述概率是根据所述第i个文本单元的表示向量和所述第j个文本单元的表示向量计算得到的。
假设在图5和图7中,[CLS]、查询中的Q个文本单元、[SEP]、上下文信息中的C个文本单元、[SEP]统一排序,即假设[CLS]是第一个文本单元,以此类推,最后一个[SEP]是第M个文本单元。也就是说,[CLS]在编码器的输入中对应的索引是1,查询中的第一个文本单元在编码器的输入中对应的索引是2,上下文信息中的第一个文本单元在编码器的输入中对应的索引是N+1,以此类推,最后一个[SEP]对应的索引是M。例如图5所示的概率矩阵51可以详细表示为图7所示的概率矩阵71,其中,概率矩阵71中包括概率矩阵72,假设i大于或等于N+1,j大于或等于i,j小于或等于M-1,则概率矩阵72中的任一Si,j表示从第i个文本单元到第j个文本单元连续的至少一个文本单元构成的文本段作为该查询对应的答案的概率。Si,j是根据所述第i个文本单元的表示向量和所述第j个文本单元的表示向量计算得到的。
S603、根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练,得到预训练后的机器学习模型。
例如,第一目标信息是如图4所示的答案2,图5所示的查询是图4所示的查询,图5所示的上下文信息是如图4所示的上下文信息2。在构造如图4所示的查询、上下文信息2、答案2构成的三元组时,可以预先确定出答案2在上下文信息2中的位置信息,从而可以根据答案2在上下文信息2中的位置信息,确定出答案2在编码器输入中对应的索引。进一步,根据如上所述的方法可以计算得到概率矩阵71,并从概率矩阵71中确定出概率矩阵72。可以理解的是,假设概率矩阵72中存在一个概率值最大的Si,j,那么针对该最大的Si,j,说明从第i个文本单元到第j个文本单元连续的至少一个文本单元构成的文本段可以作为答案被PMR输出。但是,在预训练阶段中,PMR输出的答案和标准的答案例如答案2之间可能存在一定的差异,因此,需要根据该差异对PMR进行预训练。在一种可能的实现方式中,可以根据答案2在上下文信息2中的位置信息生成一个标准矩阵,该标准矩阵的大小和概率矩阵72的大小相同,该标准矩阵例如是如图8所示的概率矩阵81。假设答案2是编码器输入中的第M-1个文本单元,那么可以将概率矩阵81中的S’M-1,M-1设置 为1,同时将概率矩阵81中其他的概率值设置为0。进一步,根据概率矩阵72和概率矩阵81之间的差异对PMR进行预训练。例如,根据概率矩阵72和概率矩阵81计算损失函数值,进一步,根据该损失函数值对PMR进行预训练。
可选的,根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练之前,所述方法还包括:通过所述抽取器计算所述第一查询和所述第一上下文信息的相关度,所述相关度是根据所述第一查询和所述第一上下文信息整体的表示向量计算得到的;根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练,包括:根据所述相关度、所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练。
例如图5所示,虽然H1对应于特殊词[CLS],但是在本实施例中,H1可以是编码器输入中的查询和上下文信息整体的表示向量,也就是说,特殊词[CLS]对应的表示向量可以是查询和上下文信息整体的表示向量。因此,根据H1计算出的S1,1可以表示查询和上下文信息的相关度。因此,根据S1,1可以确定出该查询和该上下文信息是否相关。例如,若S1,1=1,则确定该查询和该上下文信息相关,即该上下文信息中存在该查询对应的答案。若S1,1=0,则确定该查询和该上下文信息不相关,即该上下文信息中不存在该查询对应的答案。可以理解的是,在构造如图4所示的查询、上下文信息2、答案2构成的三元组时,由于答案2在上下文信息2中,因此,该查询和该上下文信息2是相关的,此时可以用S’1,1=1来表示该查询和该上下文信息2相关。进一步,将该查询和该上下文信息2作为PMR的输入,得到概率矩阵71后,可以根据概率矩阵71中的S1,1、概率矩阵71中的概率矩阵72、S’1,1、如图8所示的概率矩阵81来构造损失函数,例如,该损失函数包括S1,1和S’1,1之间的差异、以及概率矩阵72和概率矩阵81之间的差异。进一步,根据该损失函数的值对PMR进行预训练。
另外,在有一种可行的实现方式中,针对查询、上下文信息2、答案2这个三元组,还可以构造一个大小与概率矩阵71的大小相同的标准矩阵,例如图9所示的概率矩阵91。在该标准矩阵中,S’1,1等于1,S’M-1,M-1等于1,其他元素均为0。进一步,根据该标准矩阵和该概率矩阵71之间的差异构造损失函数,并根据损失函数值对PMR进行预训练。
可以理解的是,针对同一个答案,查询可以是固定的,上下文信息可能有多个,那么当该查询和每一个上下文信息作为PMR的一次输入时,即可对PMR进行一次预训练。在不断更换上下文信息的过程中,可以对PMR进行多次预训练。另外,在更换答案的情况下,相应的查询、上下文信息也会随之更换,从而又可以对PMR进行多次预训练。也就是说,每一个三元组都可以对PMR进行一次预训练。由于本实施例可以构造出海量的机器阅读理解风格的数据即三元组,因此,可以对PMR进行充分的预训练。另外,可以理解的是,如图4所示的查询、上下文信息2、答案2构成的三元组只是一个正例数据,同理,根据负例数据对PMR进行预训练的过程类似于根据该正例数据对PMR进行预训练的过程,此处不再赘述。
另外,在对PMR进行预训练之后,还可以采用如上所述的S204对预训练后的PMR进行微调。在微调过程中,用于进行模型训练的数据还是三元组,只是此时的三元组是针对于某个自然语言理解任务的数据。虽然微调过程中使用的三元组的来源和预训练过程中使用的三元组的来源有所不同,但是,数据格式是相类似的。因此,微调过程的原理和预训练过程的原理是一致的,即微调过程可以采用类似于如上所述的概率矩阵72和概率矩阵81之间的差异对预训练后的PMR进行微调,或者采用类似于如上所述的S1,1和S’1,1之间的差异、以及概率矩阵72和概率矩阵81之间的差异对预训练后的PMR进行微调,再或者采用类似于如上所述的概率矩阵71和概率矩阵91之间的差异对预训练后的PMR进行微调,具体过程此处不再赘述。
本实施例采用了一个统一的抽取器去处理各式自然语言理解任务,并且该抽取器在预训练和微调过程中保持了同样的训练目标,从而消除了预训练和微调之间的训练目标差异。由于本实施例提供的预训练过程是基于机器阅读理解的判别式目标,相比与传统生成式的预训练目标,能显著的提高预训练效率和减少预训练所需的硬件开销。并且本实施例在对机器学习模型进行预训练之后,无需进行任何调整即可通过少量的针对某种自然语言理解任务的目标任务数据(例如微调所用的三元组)对预训练后的机器学习模型进行准确的微调,从而使得微调之后的机器学习模型能够准确的处理该自然语言理解任务。
另外,本公开实施例将四种形式的自然语言理解任务(例如,命名实体识别任务、抽取式问答任务、情感分析任务、多选式问答任务)分别转化为机器阅读理解范式,从而可以在这些任务形成的三元组上对预训练后的机器学习模型例如预训练后的PMR进行直接无缝的微调。此外,本实施例提供了一个解决下游自然语言理解任务的统一框架,例如PMR,因此,只需要维护这一个机器学习模型便可解决各式任务,在实际场景中存储效率高,机器学习模型迁移性和通用性强。
如上所述的实施例主要讲述如何对机器学习模型例如PMR进行预训练和微调,其中,预训练和微调均属于训练阶段。在训练阶段完成之后,即可使用微调后的PMR即训练完成的机器学习模型处理各种各样的自然语言理解任务,即进入机器学习模型的推理阶段即使用阶段。可以理解的是,训练阶段和推理阶段可以在同一个设备中执行,也可以分别在不同的设备中执行。例如,训练阶段和推理阶段均可以在如图3所示的服务器32上完成。或者,训练阶段在服务器32上完成,进一步将训练完成的机器学习模型移植到其他设备上,从而在其他设备上实现推理阶段。下面结合图10对推理阶段进行介绍。
图10为本公开另一实施例提供的自然语言处理方法流程图。在本实施例中,该方法具体步骤如下:
S1001、获取目标文本。
假设推理阶段在如图3所示的服务器32上执行。具体的,当终端31需要处理某个自然语言理解任务,例如,命名实体识别任务时,终端31可以向服务器32发送目标文本。
S1002、根据所述目标文本对应的自然语言理解任务,确定查询信息。
在一种可能的情况下,用户在终端31上输入目标文本的同时还输入了查询信息,例如,该目标文本是“某某将飞往A城市”,该查询信息是“城市名称”。在这种情况下,当服务器32接收到该目标文本和该查询信息时,可以将用户输入的该查询信息作为该自然语言理解任务对应的查询信息。从而使得服务器32可以根据用户指定的查询信息在该目标文本中查询出与“城市名称”对应的实体即“A城市”,并将“A城市”反馈给终端31,即“A城市”是该查询信息对应的答案。
在另一种可能的情况下,用户在终端31上输入了目标文本,但是没有输入查询信息,那么在这种情况下,当服务器32接收到该目标文本,并且该自然语言理解任务是命名实体识别任务时,可以生成针对所有已知实体类型中每种实体类型的查询信息。例如,已知实体类型包括“城市名称”、“人名”、“名胜古迹”等,针对“城市名称”、“人名”、“名胜古迹”,服务器32生成的查询信息依次为“请找出下文中与城市名称相关的实体”、“请找出下文中与人名相关的实体”、“请找出下文中与名胜古迹相关的实体”。进一步,服务器32可以在该目标文本中依次查询与“城市名称”、“人名”、“名胜古迹”分别对应的实体,并将与“城市名称”、“人名”、“名胜古迹”分别对应的实体反馈给终端31。
S1003、将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如上所述的模型训练方法训练得到的。
具体的,服务器32可以将该查询信息和该目标文本作为经过如上所述的预训练和微调后的机器学习模型的输入。例如,经过预训练和微调后的机器学习模型是如图5所示的PMR。PMR按照如图5所示的逻辑对该查询信息和该目标文本进行处理,得到类似于概率矩阵51的一个矩阵,从该矩阵中获取最大的概率值,并将该目标文本中与该最大的概率值对应的文本段作为该查询信息对应的答案,并输出该答案。
可选的,将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,包括:将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型判断所述查询信息和所述目标文本是否相关;若所述查询信息和所述目标文本相关,则通过所述机器学习模型输出所述目标文本中与所述查询信息对应的答案。
例如,本实施例所述的查询信息对应于如图5所示的查询,本实施例所述的目标文本对应于如图5所示的上下文信息。在查询信息的前面添加特殊词[CLS],在查询信息和目标文本之间添加特殊词[SEP],在目标文本后面添加特殊词[SEP]之后,将[CLS]、查询信息、[SEP]、目标文本、[SEP]一起作为编码器的输入,经过编码器和抽取器的处理之后,得到类似于概率矩阵51的一个矩阵,从该矩阵中首先提取出第一行第一列的元素,该元素类似于如图5所示的S1,1。根据该元素可以判断该查询信息和该目标文本是否相关,在相关的情况下,可以进一步根据该目标文本中每个文本段在该矩阵中对应的概率,确定该文本段是否需要被输出,例如,若该概率大于或等于预设阈值,则可以确定该文本段是该查 询信息对应的答案,从而输出该答案。
本实施例通过将预训练阶段和微调阶段进行无缝的衔接,在采用大量低成本数据对机器学习模型进行预训练之后,通过少量的目标任务数据即可校准预训练后的机器学习模型,从而使得微调后的机器学习模型可以准确的处理各种各样的自然语言理解任务。
图11为本公开实施例提供的模型训练装置的结构示意图。本公开实施例提供的模型训练装置可以执行模型训练方法实施例提供的处理流程,如图11所示,模型训练装置110包括:
第一获取模块111,用于获取超链接所标记的第一目标信息;
第二获取模块112,用于从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
预训练模块113,用于根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
微调模块114,用于从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
可选的,所述机器学习模型包括编码器和抽取器,所述编码器的输出是所述抽取器的输入;预训练模块113包括:输入单元1131、计算单元1132、预训练单元1133,其中,输入单元1131用于针对所述至少一个第一上下文信息中的每个第一上下文信息,将所述第一查询和所述第一上下文信息作为所述编码器的输入,使得所述编码器输出所述第一查询中每个文本单元的表示向量和所述第一上下文信息中每个文本单元的表示向量;计算单元1132用于通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率,所述每个文本段分别由所述第一上下文信息中连续的至少一个文本单元构成;预训练单元1133用于根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练,得到预训练后的机器学习模型。
可选的,计算单元1132通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率时,具体用于:
通过所述抽取器计算从第i个文本单元到第j个文本单元连续的至少一个文本单元构成的文本段作为所述第一查询对应的答案的概率,j大于或等于i,i大于或等于N+1,N+1是所述第一上下文信息中的第一个文本单元在所述编码器的输入中对应的索引,所述概率是根据所述第i个文本单元的表示向量和所述第j个文本单元的表示向量计算得到的。
可选的,计算单元1132还用于:在预训练单元1133根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练之前,通过所述抽取器计算所述第一查询和所述第一上下文信息的相关度,所述相关度是根据所述第一查 询和所述第一上下文信息整体的表示向量计算得到的;相应的,预训练单元1133根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练时,具体用于:根据所述相关度、所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练。
可选的,第二获取模块112从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询时,具体用于:将所述第一目标信息的主页文章中的前至少一个句子作为所述第一目标信息对应的第一查询。
可选的,第二获取模块112从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息时,具体用于:
针对所述第一目标信息的至少一个引用文章中的每个引用文章,从所述引用文章中确定出包括所述第一目标信息的句子;
将所述引用文章中所述句子前后各至少一个句子和所述句子一起作为一个第一上下文信息。
图11所示实施例的模型训练装置可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图12为本公开实施例提供的自然语言处理装置的结构示意图。本公开实施例提供的自然语言处理装置可以执行自然语言处理方法实施例提供的处理流程,如图12所示,自然语言处理装置120包括:
获取模块121,用于获取目标文本;
确定模块122,用于根据所述目标文本对应的自然语言理解任务,确定查询信息;
输入模块123,用于将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如上所述的模型训练方法训练得到的。
可选的,输入模块123将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案时,具体用于:
将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型判断所述查询信息和所述目标文本是否相关;
若所述查询信息和所述目标文本相关,则通过所述机器学习模型输出所述目标文本中与所述查询信息对应的答案。
图12所示实施例的自然语言处理装置可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
以上描述了模型训练装置或自然语言处理装置的内部功能和结构,该装置可实现为一种电子设备。图13为本公开实施例提供的电子设备实施例的结构示意图。如图13所示, 该电子设备包括存储器131和处理器132。
存储器131用于存储程序。除上述程序之外,存储器131还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。
存储器131可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
处理器132与存储器131耦合,执行存储器131所存储的程序,以用于:
获取超链接所标记的第一目标信息;
从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
或者,处理器132还可用于:
获取目标文本;
根据所述目标文本对应的自然语言理解任务,确定查询信息;
将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如上所述的模型训练方法训练得到的。
进一步,如图13所示,电子设备还可以包括:通信组件133、电源组件134、音频组件135、显示器136等其它组件。图13中仅示意性给出部分组件,并不意味着电子设备只包括图13所示组件。
通信组件133被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件133经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件133还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
电源组件134,为电子设备的各种组件提供电力。电源组件134可以包括电源管理系统,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。
音频组件135被配置为输出和/或输入音频信号。例如,音频组件135包括一个麦克 风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器131或经由通信组件133发送。在一些实施例中,音频组件135还包括一个扬声器,用于输出音频信号。
显示器136包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。
另外,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现上述实施例所述的模型训练方法或自然语言处理方法。
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (12)

  1. 一种模型训练方法,其中,所述方法包括:
    获取超链接所标记的第一目标信息;
    从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
    根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
    从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
  2. 根据权利要求1所述的方法,其中,所述机器学习模型包括编码器和抽取器,所述编码器的输出是所述抽取器的输入;
    根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型,包括:
    针对所述至少一个第一上下文信息中的每个第一上下文信息,将所述第一查询和所述第一上下文信息作为所述编码器的输入,使得所述编码器输出所述第一查询中每个文本单元的表示向量和所述第一上下文信息中每个文本单元的表示向量;
    通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率,所述每个文本段分别由所述第一上下文信息中连续的至少一个文本单元构成;
    根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练,得到预训练后的机器学习模型。
  3. 根据权利要求2所述的方法,其中,通过所述抽取器计算所述第一上下文信息中每个文本段分别作为所述第一查询对应的答案的概率,包括:
    通过所述抽取器计算从第i个文本单元到第j个文本单元连续的至少一个文本单元构成的文本段作为所述第一查询对应的答案的概率,j大于或等于i,i大于或等于N+1,N+1是所述第一上下文信息中的第一个文本单元在所述编码器的输入中对应的索引,所述概率是根据所述第i个文本单元的表示向量和所述第j个文本单元的表示向量计算得到的。
  4. 根据权利要求2所述的方法,其中,根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练之前,所述方法还包括:
    通过所述抽取器计算所述第一查询和所述第一上下文信息的相关度,所述相关度是根据所述第一查询和所述第一上下文信息整体的表示向量计算得到的;
    根据所述概率、所述第一目标信息在所述第一上下文信息中的位置信息,对所述机器学习模型进行预训练,包括:
    根据所述相关度、所述概率、所述第一目标信息在所述第一上下文信息中的位置信息, 对所述机器学习模型进行预训练。
  5. 根据权利要求1所述的方法,其中,从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,包括:
    将所述第一目标信息的主页文章中的前至少一个句子作为所述第一目标信息对应的第一查询。
  6. 根据权利要求1所述的方法,其中,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息,包括:
    针对所述第一目标信息的至少一个引用文章中的每个引用文章,从所述引用文章中确定出包括所述第一目标信息的句子;
    将所述引用文章中所述句子前后各至少一个句子和所述句子一起作为一个第一上下文信息。
  7. 一种自然语言处理方法,其中,所述方法包括:
    获取目标文本;
    根据所述目标文本对应的自然语言理解任务,确定查询信息;
    将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如权利要求1-6中任一项所述的方法训练得到的。
  8. 根据权利要求7所述的方法,其中,将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,包括:
    将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型判断所述查询信息和所述目标文本是否相关;
    若所述查询信息和所述目标文本相关,则通过所述机器学习模型输出所述目标文本中与所述查询信息对应的答案。
  9. 一种模型训练装置,其中,包括:
    第一获取模块,用于获取超链接所标记的第一目标信息;
    第二获取模块,用于从所述第一目标信息的主页文章中获取所述第一目标信息对应的第一查询,从所述第一目标信息的至少一个引用文章中获取所述第一目标信息的至少一个第一上下文信息;
    预训练模块,用于根据所述第一目标信息、所述第一查询、所述至少一个第一上下文信息对机器学习模型进行预训练,得到预训练后的机器学习模型;
    微调模块,用于从自然语言理解任务提供的样本文本中确定出第二目标信息,生成所述第二目标信息对应的第二查询,并采用所述样本文本、所述第二查询和所述第二目标信息对所述预训练后的机器学习模型进行训练。
  10. 一种自然语言处理装置,其中,包括:
    获取模块,用于获取目标文本;
    确定模块,用于根据所述目标文本对应的自然语言理解任务,确定查询信息;
    输入模块,用于将所述查询信息和所述目标文本作为机器学习模型的输入,使得所述机器学习模型输出所述目标文本中与所述查询信息对应的答案,所述机器学习模型是根据如权利要求1-6中任一项所述的方法训练得到的。
  11. 一种电子设备,其中,包括:
    存储器;
    处理器;以及
    计算机程序;
    其中,所述计算机程序存储在所述存储器中,并被配置为由所述处理器执行以实现如权利要求1-8中任一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-8中任一项所述的方法。
PCT/CN2023/121267 2022-10-04 2023-09-25 自然语言处理、模型训练方法、装置、设备及存储介质 WO2024074100A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211218353.XA CN115879440A (zh) 2022-10-04 2022-10-04 自然语言处理、模型训练方法、装置、设备及存储介质
CN202211218353.X 2022-10-04

Publications (1)

Publication Number Publication Date
WO2024074100A1 true WO2024074100A1 (zh) 2024-04-11

Family

ID=85770278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121267 WO2024074100A1 (zh) 2022-10-04 2023-09-25 自然语言处理、模型训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115879440A (zh)
WO (1) WO2024074100A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879440A (zh) * 2022-10-04 2023-03-31 阿里巴巴(中国)有限公司 自然语言处理、模型训练方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581350A (zh) * 2020-04-30 2020-08-25 识因智能科技(北京)有限公司 一种基于预训练语言模型的多任务学习阅读理解方法
WO2020174826A1 (ja) * 2019-02-25 2020-09-03 日本電信電話株式会社 回答生成装置、回答学習装置、回答生成方法、及び回答生成プログラム
CN112507706A (zh) * 2020-12-21 2021-03-16 北京百度网讯科技有限公司 知识预训练模型的训练方法、装置和电子设备
CN114565104A (zh) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 语言模型的预训练方法、结果推荐方法及相关装置
CN115879440A (zh) * 2022-10-04 2023-03-31 阿里巴巴(中国)有限公司 自然语言处理、模型训练方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020174826A1 (ja) * 2019-02-25 2020-09-03 日本電信電話株式会社 回答生成装置、回答学習装置、回答生成方法、及び回答生成プログラム
CN111581350A (zh) * 2020-04-30 2020-08-25 识因智能科技(北京)有限公司 一种基于预训练语言模型的多任务学习阅读理解方法
CN112507706A (zh) * 2020-12-21 2021-03-16 北京百度网讯科技有限公司 知识预训练模型的训练方法、装置和电子设备
CN114565104A (zh) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 语言模型的预训练方法、结果推荐方法及相关装置
CN115879440A (zh) * 2022-10-04 2023-03-31 阿里巴巴(中国)有限公司 自然语言处理、模型训练方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115879440A (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
US20240070392A1 (en) Computing numeric representations of words in a high-dimensional space
CN111177393B (zh) 一种知识图谱的构建方法、装置、电子设备及存储介质
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
US20190043379A1 (en) Neural models for key phrase detection and question generation
WO2024074100A1 (zh) 自然语言处理、模型训练方法、装置、设备及存储介质
CN107330120A (zh) 询问应答方法、询问应答装置及计算机可读存储介质
CN106156310A (zh) 一种图片处理装置和方法
JP2015060074A (ja) 音声学習支援装置及び音声学習支援プログラム
KR20110083544A (ko) 성장형 개인 단어 데이터베이스 시스템을 이용한 언어 학습 장치 및 방법
CN106528759A (zh) 智能问答系统的信息处理方法及装置
WO2020073532A1 (zh) 客服机器人对话状态识别方法及装置、电子设备、计算机可读存储介质
CN111897915B (zh) 问答设备和答复信息确定方法
US11669679B2 (en) Text sequence generating method and apparatus, device and medium
JP2016061855A (ja) 音声学習装置および制御プログラム
CN108255962A (zh) 知识点关联方法、装置、存储介质和电子设备
US20230289514A1 (en) Speech recognition text processing method and apparatus, device, storage medium, and program product
CN113140138A (zh) 互动教学方法、装置、存储介质及电子设备
US20190303393A1 (en) Search method and electronic device using the method
CN112036174B (zh) 一种标点标注方法及装置
US11556708B2 (en) Method and apparatus for recommending word
CN109272983A (zh) 用于亲子教育的双语切换装置
KR20210050484A (ko) 정보 처리 방법, 장치 및 저장 매체
CN111241833A (zh) 一种文本数据的分词方法、装置及电子设备
CN117371448A (zh) 实体识别及其模型训练方法、装置、电子设备与存储介质
KR102450973B1 (ko) 단어 추천 방법 및 이를 위한 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23874292

Country of ref document: EP

Kind code of ref document: A1