CN115688753A

CN115688753A - Knowledge injection method and interaction system of Chinese pre-training language model

Info

Publication number: CN115688753A
Application number: CN202211214379.7A
Authority: CN
Inventors: 汪诚愚; 张涛林; 黄�俊
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-02-03

Abstract

A knowledge injection method and a corresponding interactive system for a Chinese pre-training language model are disclosed. The method comprises the following steps: labeling key semantic components in the pre-training sentence by using a special identifier to construct a reconstructed pre-training sentence; masking the reconstructed pre-training sentence; and inputting the masked pre-training sentence into the PLM, and adjusting parameters of the neural network model for the first loss value output by the masked word based on the PLM. The invention realizes knowledge injection aiming at the Chinese pre-training language model through internal linguistics knowledge labeling and external knowledge map injection, so that the model can learn semantics and factual knowledge under the condition of unchanged architecture. The model obtained by pre-training can greatly reduce the scale of required parameters, can complete various downstream tasks without the support of external data, and is suitable for providing various real-time services for users in a cloud environment.

Description

Knowledge injection method and interactive system of Chinese pre-training language model

Technical Field

The disclosure relates to the field of deep learning, and in particular to a knowledge injection method for a Chinese pre-training language model and a Chinese interactive system provided with the pre-training model for knowledge injection obtained by the method.

Background

As a Natural Language Processing (NLP) base model, a pre-trained language model (PLM, including BERT, roBERTa, XLNET, etc.) has achieved excellent performance in each downstream Natural Language Understanding (NLU) task, with great versatility. The mainstream pre-trained language model is pre-trained using english, where basic building blocks of sentences, i.e., words ("words"), are masked and predicted. Since a single word in english is usually a complete semantic unit, the model for correctly predicting the mask word can learn the semantics of the word in a sentence. However, in the pre-training language model of chinese, if the basic constituent units of a sentence, i.e. a single chinese character, are also masked and predicted in the pre-training process, the model predicts by locally judging the characters on both sides of the masked character, i.e. only the composition of the vocabulary can be learned, and the semantics of the vocabulary in the pre-training sentence cannot be known. This results in the Chinese pre-training language model performing less than ideally in the performance of downstream tasks.

Therefore, a pre-training model that can better learn Chinese knowledge is needed.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a knowledge injection method for a chinese pre-training language model, which enables the pre-training language model to better learn the linguistic knowledge contained in a sentence by performing linguistic analysis on the pre-training sentence and marking key semantic components. Further, positive and negative examples of external triples may be constructed for entities contained in the pre-trained sentences, and the pre-trained language model may be enabled to better learn the fact knowledge contained in the external knowledge graph through comparative learning. Therefore, the accuracy of executing various downstream tasks based on the Chinese pre-training language model is improved while the parameter scale of the Chinese pre-training language model is greatly reduced.

According to a first aspect of the present disclosure, there is provided a knowledge injection method of a chinese pre-training language model, comprising: labeling key semantic components in the pre-training sentence by using the special identifier to construct a reconstructed pre-training sentence; masking the reconstructed pre-training sentence; and inputting the masked pre-training sentence into the pre-training language model (PLM) and adjusting parameters of a neural network model in the PLM for a first loss value output by a mask word based on the PLM.

Optionally, the method further comprises: recalling positive and negative example triples corresponding to entities contained in the pre-training sentence from the knowledge graph; inputting words, the positive-case triples and the negative-case triples corresponding to the entity in the pre-training sentence into an encoder of the PLM; and constructing a second loss value for the hidden representation of the entity's word, the representation of the positive-case triplets, and the representation of the negative-case triplets output by the encoder to adjust parameters of a neural network model in the PLM based on contrast learning.

Optionally, the positive-case triplet is a single-hop triplet including the entity, and the negative-case triplet is a multi-hop triplet within a distance from the entity by multiple hops in the knowledge graph.

Optionally, the number of hops from the entity of the multi-hop triple is not greater than a predetermined threshold.

Optionally, the second penalty value is used to characterize a difference between positive case similarity characterizing a similarity between the hidden representation of the word of the entity and the representation of the positive case triplet and negative case similarity characterizing a similarity between the hidden representation of the word of the entity and the representation of the negative case triplet.

Optionally masking the reconstructed pre-training sentence comprises: and performing mask processing on at least part of the key semantic components.

Optionally, labeling the key semantic components in the pre-training sentence with the special identifier to construct the reconstructed pre-training sentence comprises: adding semantic dependency marks before and after the dependency grammar relation words in the pre-training sentences; and adding dependency syntax relation marks before and after the dependency syntax relation vocabulary in the pre-training sentence.

Optionally, the masking the reconstructed pre-training sentence comprises: masking words or special identifiers in the reconstructed pre-training sentence according to a predetermined scale, and in the predetermined scale, assigning a first scale to the random mask, a second scale to the dependency grammar related words, and a third scale to the dependency grammar related words.

According to a second aspect of the present disclosure, there is provided a knowledge injection-based interactive system, comprising: the system comprises a user input receiving unit, a query processing unit and a query processing unit, wherein the user input receiving unit is used for acquiring a specific field related query input by a user; a question matching unit comprising a Chinese pre-training model using knowledge injection obtained using a Chinese corpus and a Chinese knowledge-graph as described in the first aspect, the model identifying relevant entities and semantics in the Chinese query and generating feedback therefrom; a feedback providing unit for providing the generated feedback to the user. Multiple Chinese pre-training models of different parameters may be trained for generating the feedback at different accuracies and speeds in different interaction scenarios.

According to a third aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first aspect above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.

Therefore, the knowledge injection for the Chinese pre-training language model is realized through internal linguistic knowledge labeling and external knowledge map injection based on the design for input data and pre-training tasks, so that the model can learn the linguistic knowledge of the pre-training sentence under the condition of unchanged architecture, and learn the fact knowledge of entities contained in the pre-training sentence from the external knowledge map. The model obtained by pre-training can complete various downstream tasks with less parameter quantity without external data support, so that the method is suitable for providing various real-time services for users in a cloud environment.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 shows an example of masking and predicting a single Chinese character and a single vocabulary in a Chinese pre-trained language model.

FIG. 2 shows a schematic flow diagram of a knowledge injection method for a Chinese pre-trained language model, according to one embodiment of the invention.

Fig. 3 shows an example of linguistic tagging and reconstruction of a pre-training sentence according to the invention.

FIG. 4 illustrates an example of a knowledge subgraph included in a knowledge-graph.

Fig. 5 shows a pre-training schematic of CKBERT according to one embodiment of the invention.

Fig. 6 shows an example of the PLM trained by the present invention for actual interaction.

FIG. 7 is a block diagram of a computing device that may be used to implement the knowledge injection method for the Chinese pre-training language model described above according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

A pre-trained language model (PLM) is a generic term of a class of natural language processing models used to learn the characterization of low-dimensional dense real-valued vectors of text. Early pre-trained language models aimed at learning word embedding (word embedding) representations with shallow neural networks, these word embedding representations representing various tasks for downstream natural language processing; the latest pre-trained language model learns context-based word-embedded representations, and the learned models are used for fine-tuning of downstream tasks. The pre-trained language model achieves excellent performance in each downstream Natural Language Understanding (NLU) task and has strong universality.

However, the mainstream pre-trained language model is pre-trained using English, where constituent elements of a sentence, i.e., a single word ("word"), are masked and the model is required to predict the masked word. Since the words in English are usually a complete semantic unit, for example, the word "Harbin" corresponds to the complete semantic unit "Harbin". Therefore, the model needs to use a wider range of information in the sentence to achieve correct prediction of the word, in other words, the mask and prediction of the word enable the pre-trained language model in english to naturally learn the semantics of the word in the sentence.

However, in the pre-training language model of chinese, if the basic constituent units of a sentence, i.e. a single chinese character, are also masked and predicted in the pre-training process, the model predicts by locally judging the characters on both sides of the masked character, i.e. only the composition of the vocabulary can be learned, and the semantics of the vocabulary in the pre-training sentence cannot be known. This results in a Chinese pre-trained language model that performs less than optimally in performing downstream tasks.

As shown, the pre-trained sentence is "harbin is the province of the black dragon river, international ice and snow culture name". On the left side of fig. 1 is a masking method, e.g. of BERT, during pre-training. The BERT uses data that is only a mask for a single character, and the BERT is trained to determine the word "er" by local co-occurrence of "har" and "shore", but the model (shown as a Transformer, which can be translated into a "Transformer") does not actually learn the knowledge related to "harbin", i.e., only learns the word "harbin", but does not know the meaning represented by "harbin". The right side shows an example where the data used in the pre-training is a mask of the entire word. Since the whole vocabulary is masked, for example, masking the words "halbin" and "snow" so that the model (shown as transform) must learn the expression of the words and entities to model the relationship between "halbin" and "black longjiang", learn that "halbin" is the province of "black longjiang" and "halbin" is a snow city, and then accurately predict the masked vocabulary.

However, masking and prediction of words as shown on the right side of fig. 1 requires the model to learn the semantics in the sentence by itself, and the masked words are often not important words in the sentence, resulting in inefficient learning of the model and huge parameter size. Therefore, the invention provides an improved knowledge injection method of a Chinese pre-training language model, which reconstructs pre-training samples by identifying and prompting key semantic components in pre-training sentences, so that the model can rapidly learn the linguistic knowledge in the sentences, and further improves the performance of the model in executing downstream natural language understanding tasks under the condition of greatly reducing the parameter scale.

In step S210, the key semantic components in the pre-training sentence are labeled with the special identifier to construct a reconstructed pre-training sentence. Here, the key semantic component in a sentence refers to a semantic component that is "key" when viewed from a linguistic point of view after a linguistic (linguistic) analysis is performed on the sentence. The key semantic components are usually a complete vocabulary, as linguistic meaning is required.

In particular, dependency syntactic relationship words and dependency syntactic relationship vocabularies in the pre-training sentence may be identified. Both the "dependency syntactic relation word" and the "dependency syntactic relation vocabulary" correspond to concepts in linguistics.

Syntax parsing (syntax parsing) is one of the key techniques in natural language processing, which is a process of parsing an input text sentence to obtain a syntax structure of the sentence. Dependency syntax parses a sentence into a dependency syntax tree describing the dependency relationships between words. That is, syntactic collocation relationships between words are indicated, which are semantically related.

In dependency syntax theory, "dependency" refers to the relationship between words and their allocations and dominations, which is not peer-to-peer and directional. Specifically, the dominant component is called Agent (AGT). Dependency syntax considers that a verb in a "predicate" is the center of a sentence, and the other components are directly or indirectly linked to the verb and are the objects pointed to by the actor AGT. In natural language processing, a framework that describes a language structure with dependency relationships between words is called dependency grammar (dependency grammar), and thus "dependency grammar related words" may be used herein to refer to actors in a sentence, typically predicates in a sentence. After the dependency syntactic relation words are determined, the subject-predicate relation, the fixed-language relation (fixed-language-central-language relation), the parallel relation and the like in the sentence can be processed according to the dependency syntactic relation, and the 'dependency syntactic relation words' can be found.

After the "dependency syntactic relation words" and "dependency syntactic relation words" are found, the key components in the sentence may be tagged with special tags. Then, labeling the key semantic components in the pre-training sentence using the special identifier to construct a reconstructed pre-training sentence may include: adding semantic dependency tags before and after dependency grammar relations (e.g., typically "predicates") in the pre-training sentence (e.g., adding [ SDP ] before predicate vocabulary followed by [/SDP ]); dependency tags are added before and after the dependency vocabulary (e.g., the "headword" pointed to by the stop phrase) in the pre-training sentence (e.g., add [ DEP ] before the headword, then add [/DEP ]).

Fig. 3 shows an example of linguistic tagging and reconstruction of a pre-training sentence according to the invention. The original pre-training sentence is 'the fact that most actual combat is known by people, so that the spoken language pronunciation can be really improved'. In this regard, any existing or future developed tools may be used to perform dependency parsing on the sentence to find the "dependency related terms" and the "dependency related vocabulary" contained in the sentence. In the sentence shown in fig. 3, the object pointed to by the actor "big" is the predicate "know", and therefore "know" is taken as the dependency grammar relation and the identifiers [ SDP ] and [/SDP ] are added before and after it, respectively. And the modified object to which the fixed phrase "true" points is the verb "improve", so "improve" is taken as the dependency syntax vocabulary and the identifiers [ DEP ] and [/DEP ] are added before and after it, respectively. Further, a sentence start identifier [ CLS ] may be added before the beginning of a sentence, and a sentence end identifier [ SEP ] may be added after the end of the sentence, as shown. Therefore, the original pre-training sentence "everyone knows how many actual combat to really improve the spoken language pronunciation" is reconstructed into "[ CLS ] everyone [ SDP ] knows [/SDP ] many actual combat to really improve the [/DEP ] spoken language pronunciation [ SEP ]".

After reconstructing the pre-training sentence as above, the reconstructed pre-training sentence may be subjected to a masking process in step S220. In a preferred embodiment, the masking the reconstructed pre-training sentence may include: and performing mask processing on at least part of the key semantic components. For example, one or both of the key semantic terms "know" and "improve" in the reconstructed pre-training sentence shown in FIG. 3 are masked.

To ensure both prediction accuracy and learning rate, a predetermined proportion of words or special identifiers in the pre-training samples are typically masked. In one embodiment of the invention, the predetermined ratio may be 15%. In this 15% predetermined ratio, one part may be left for randomly masked single Chinese characters and another part may be left for key semantic vocabulary. Thus, the model can learn the linguistic knowledge of the mark, and learn other knowledge in the pre-training sentence through the random mask. In one embodiment, masking the reconstructed pre-training sentence may include: the words or special identifiers in the reconstructed pre-training sentence are masked according to a predetermined scale, and of the predetermined scale (e.g., 15%), a first scale (e.g., 40% of 15%) is assigned to the random mask, a second scale (e.g., 30% of 15%) is assigned to the dependent syntactic relation words, and a third scale (e.g., 30% of 15%) is assigned to the dependent syntactic relation vocabulary. In a preferred embodiment, the special identifiers ([ DEP ], [/DEP ], [ SDP ] and [/SDP ]) can also be considered as common labels for the masks, so the model needs to know the prediction vocabulary boundaries, rather than simply filling the masks according to the context.

Subsequently, in step S230, the masked pre-training sentence may be input into the pre-training language model PLM, and parameters of a neural network model in the PLM may be adjusted based on a first loss value output by the PLM for the masked word.

Here, the first loss value corresponds to a first loss function, and the first loss function of the present invention is a loss function corresponding to the MLM task. Mask Language Model (MLM) is an auto-supervised task that attempts to mask a word or word in a sentence, predicting the masked word or word based on other parts of the sentence. However, unlike the conventional MLM, the pre-training task of the pre-training model of the present invention is an LMLM, i.e., a linguistic-aware MLM (linguistics-aware MLM), which enables the model to learn the linguistic knowledge contained in a sentence while performing the MLM task by linguistic reconstruction for the pre-training sentence, masking for key semantic words and random words, and prediction for the mask as described above, thereby raising the semantic understanding level of the model and speeding up the convergence of the model.

In the MLM-based training scheme, the PLM training samples are masked text, i.e., a sentence in which some words are randomly replaced with special tokens (e.g., [ MASK ]), e.g., the pre-training sentence reconstructed by the linguistics of the present invention is "[ CLS ] everywhere [ SDP ] knows [/SDP ] multi-reality, to really improve the [ DEP ] spoken utterance [ SEP ]", and an example of the masked text is "[ CLS ] everywhere [ SDP ] [ MASK ] [ MASK ] [/SDP ] [ MASK ] multi-reality to really make [ DEP ] [ MASK ] [ MASK ] [/DEP ] spoken utterance [ SEP ]" (i.e., the keyword word "knows" and "improves" is masked, and the first "multi" word is randomly masked). The text after the mask processing is input into the PLM, and the PLM needs to predict that the content to be masked is 'known', 'multi', 'improved' and 'good' respectively. The training samples of the PLM may be referred to as mask training samples. In a sentence where the masked content is not masked, its context information (and the unique identifier information in the present invention) and the PLM learns the ability to capture textual context information and the linguistic meaning of the masked key words by predicting the masked content. Therefore, the PLM trained and completed based on the LMLM training scheme has the capability of understanding the deep semantics of the natural language and can be used for a series of downstream tasks related to NLP. Furthermore, the model can efficiently learn linguistic knowledge by reconstructing sentences and selecting semantic key words, so that the model can support the same information content with smaller parameter scale, and the model obtained by pre-training is smaller in scale and more suitable for subsequent arrangement in practical application scenes.

The pre-trained model represents the ideal values for all weights and biases learned (determined) by the tag samples. These determined weights and biases enable a high-accuracy inference of the input eigenvalues during the neural network deployment phase, e.g., correct prediction of masked chinese characters based on context.

In self-supervised learning, a machine learning algorithm learns parameters by examining multiple samples and attempting to find a model that minimizes the loss, a process known as empirical risk minimization.

The penalty is a penalty for poor prediction. That is, the penalty may be a numerical value representing how accurate the model predicts for a single sample. If the prediction of the model is completely accurate, the penalty is zero, otherwise the penalty will be large. The goal of training the model is to find a set of weights and biases that are "less" lost on average from all samples.

In the training and tuning process of the neural network, a loss function needs to be defined in order to quantify whether the current weights and biases can fit the network inputs to all the network inputs. Thus, the goal of training the network can be translated into a process that minimizes the loss function of weights and biases. Typically, a gradient descent algorithm (in multi-layer neural network training, a back propagation algorithm is used) is used to achieve the above-described minimization process.

In the back-propagation algorithm, a repetitive iterative process of forward propagation and back propagation is involved. The forward propagation process is a process in which neurons between layers are connected through a weight matrix so that a stimulus (eigenvalue) is continuously transmitted from a previous layer to a next layer through an excitation function of each layer. In the backward propagation, the error of the current layer needs to be reversely derived from the error of the next layer. Therefore, the weights and the bias are continuously adjusted through the iterative process of forward propagation and backward propagation, so that the loss function is gradually close to the minimum value, and the training of the neural network is completed. In the present invention, the loss function for the LMLM task, such as may be described by the first loss function below

And (5) realizing.

The mainstream pre-training language model is based on public documents, general language knowledge is learned from unstructured documents, and the learning of a large amount of knowledge information, particularly the learning of structured Knowledge Graph (KG) information, is omitted. Here, unstructured and structured are intended to indicate the way in which linguistic knowledge is presented. In natural language processing, the presentation of language knowledge generally includes the following three forms: unstructured text, semi-structured tables, etc., and structured triples. In particular, triple knowledge is stored in artificially constructed large-scale knowledge-graph data, consisting of < head entity, relationship, tail entity >. Head-tail entities represent a particular thing that exists in the real world (e.g., hangzhou), and relationships express some semantic association between entities (e.g., place of birth).

In PLM, a two-phase strategy (i.e., pre-training and fine-tuning) inherits the knowledge learned during pre-training and applies it to downstream tasks. While PLM stores a lot of internal knowledge, it is difficult to understand external background knowledge, such as facts and common sense knowledge, because PLM learns general linguistic knowledge from unstructured documents and lacks systematic learning of structured knowledge. The lack of knowledge will generate some counter-fact content (for example, the GPT model will output the explicit wrong conclusion such as "sun has two eyes", etc.), and also greatly reduce the learning ability of the model for small samples, the migration ability of the domain knowledge and the induction ability of the general knowledge, etc.

Thus, in one embodiment, when an entity is included in a pre-trained sentence, the present invention may also improve the performance of chinese PLM by injecting external knowledge triples associated with the included entity in the sentence. PLM injected with external knowledge may be referred to as a knowledge-enhanced pre-trained model (KEPLM). In the prior art, the KEPLM can further improve the performance of downstream tasks by semantic understanding of key entity information in a text on the basis of pre-training model modeling. However, the biggest problem of the above method is that during downstream task training and reasoning, a large-scale available knowledge graph still needs to be constructed in advance for the pre-training model with enhanced knowledge, and meanwhile, the burden of computing resources is increased by additional network parameters, so that the method is complicated in the process of practical application and the effect is not stable enough.

Therefore, the invention hopes that the knowledge-enhanced model can not use the knowledge-graph information in the fine-tuning and reasoning stage and can achieve good performance in the downstream task. Therefore, the invention provides a method for combining knowledge coding, knowledge injection and a pre-training process based on a shared encoder, a model is only modified at a data input level and a pre-training task level, and the model architecture is not modified, so that the model does not need to be added with additional parameters, and meanwhile, the model can have good effect without depending on an external knowledge map in the use process of a downstream task (namely, external knowledge is injected into the model in the pre-training stage).

Specifically, the present invention may utilize a knowledge graph to construct positive and negative example triples that associate entities, and inject factual knowledge implied in an external knowledge graph into a model based on comparative learning.

A Knowledge Graph (Knowledge Graph) is a Knowledge base in which data is integrated through a data model or topology of a Graph structure. Knowledge-graphs are commonly used to store entities that have an interrelationship with each other. The knowledge graph is converted into simple and clear triples of < head entity, relation and tail entity > by effectively processing, processing and integrating data of complex and intricate documents, and finally a large amount of knowledge is aggregated, so that the quick response and reasoning of the knowledge are realized. FIG. 4 illustrates an example of a knowledge subgraph included in a knowledge-graph. The knowledge graph is a knowledge base formed by a series of triples, and a knowledge sub-graph can be constructed by intercepting a plurality of triples with connection relations, as shown in fig. 4. Fig. 4 shows an example of a knowledge subgraph centered around the entity "margarite michel". In fig. 4, the circles represent the various nodes in the awareness subgraph and correspond to different entities. The line with the arrow indicates the edge in the awareness subgraph, the arrow points from the head entity to the tail entity, and the text on the edge indicates the relationship between the head entity and the tail entity.

For example, the pre-training sentence "margarite michel" is a bright pearl on the academic history of the twentieth century. "while, positive and negative example triples may be constructed from the knowledge subgraph shown in fig. 4 for the entity" margaritt michael "identified from the sentence for external knowledge injection.

To this end, in one embodiment, the knowledge injection method of the chinese pre-training language model of the present invention may further include: recalling positive and negative example triples corresponding to entities contained in the pre-training sentence from the knowledge graph; inputting words, the positive-case triples and the negative-case triples corresponding to the entity in the pre-training sentence into an encoder of the PLM; and constructing second loss values for hidden representations of words of the entity (i.e., model-processed embedding vectors), representations of the positive-case triples, and representations of the negative-case triples output by the encoder to adjust parameters of a neural network model in the PLM based on contrast learning.

Here, the positive-case triplet is a single-hop (one-hop) triplet including the entity, and the negative-case triplet is a multi-hop (multi-hop) triplet within the knowledge-graph that is multiple hops away from the entity. Taking fig. 4 as an example, when the entity "margarite michael" is included in the pre-training sentence, the single-hop triplet may be a triplet composed of "margarite michael", a node in the knowledge sub-graph that arrives from the node "margarite michael" in a single hop (e.g., the relationship corresponding to the edges along which the single hop follows in the graph, "atlantan", "novice", "literary doctor", "prizery prize", and "drift" in the graph.

The multi-hop triple is a node which takes 'Margardt Michell' as a starting point and skips at least two edges. In the example of FIG. 4, < float, female hero, sijiali Ohara >, < float, age of occurrence, north-south war >, < Sijiali Ohara, famous image, fermat, and < Atlanta, celebrity, martin Lode gold > can all be viewed as triplets obtained over multiple hops, starting from "Margaret Michel". In different embodiments, the number of hops of a multi-hop may be defined to construct different negative example triples comparing learning "difficulty". In one embodiment, the number of hops from the entity by the multi-hop triple may be set to be not greater than a predetermined threshold δ. If the hop count threshold δ is too large, the model can easily distinguish between positive and negative triplets due to a large semantic gap. For effective contrast learning, a good negative example triplet should be "difficult", so in the example of fig. 4, δ can be set to 3. Thus, < float, female hero, scagli ohara >, < float, age of occurrence, north and south war >, < scagli ohara, famous image, filler, and < atlanta, celebrity, martin, road, gold > in the knowledge sub-graph can all be considered negative example triplets.

After the positive and negative example triples are obtained, the respective triples may be converted into a natural sentence input model, and a second loss function may be utilized to implement the comparative learning. The triples may be directly stitched into a sentence or rewritten into a sentence that is more consistent with the expression. For example, for triplets<Piao, main horn of woman, sijiali, ohara>The method can be directly spliced into 'Piaoganci Shara' for the flying female, or simply rewritten into 'Piaoganci Shara' for the flying female. In one embodiment, the second penalty value is used to characterize a difference in positive case similarity characterizing a similarity between the hidden representation of the word of the entity and the representation of the positive case triplet and a negative case similarity characterizing a similarity between the hidden representation of the word of the entity and the representation of the negative case triplet. The second loss function may be implemented, for example, as a loss function as will be described in detail below

Similar to the internal linguistic knowledge injection, the parameters of the model can learn knowledge more effectively through the fact knowledge injection of the external knowledge map, so that the scale of the model parameters required for achieving the same prediction performance can be further reduced.

In one embodiment, the pre-trained language model may be fine-tuned according to both the linguistically-aware MLM task and the contrast learning taskThe total loss function may be a first loss function

And a second loss function

A sum, or a weighted sum.

Therefore, the method of the invention is used for injecting the internal linguistic knowledge and the knowledge of the external knowledge map, so that the obtained Chinese pre-training language model can learn more semantic knowledge and factual knowledge, thereby ensuring the execution performance of subsequent downstream tasks.

Still further, it should be noted that the reconstruction for a sentence is for key semantic words in the sentence, which as shown above in FIG. 3 may generally be verbs in the main or subordinate sentences, while the knowledge-graph based processing is for entities contained in the sentence. In other words, the simultaneous injection of the internal linguistic knowledge and the knowledge of the external knowledge graph are generally respectively directed to different words in the pre-training sentences, so that the knowledge can be maximally learned based on the limited pre-training sentences, thereby further reducing the parameters required by the model and improving the performance of the subsequent downstream tasks.

A specific implementation of the knowledge injection scheme of the present invention will be described below in conjunction with fig. 5. Fig. 5 shows a pre-training schematic of CKBERT according to one embodiment of the invention. CKBERT (Chinese knowledge enhanced BERT) may be considered as an embodiment of the knowledge injection scheme of the Chinese pre-training language model according to the present invention.

The model implemented in FIG. 5 uses an existing Chinese PLM model, such as the BERT model, and changes are made only at the data entry level and the pre-training task level, without changing the model architecture, thereby facilitating model extension parameters. At the data input level, two parts of knowledge are processed, namely, external knowledge-graph triplets and linguistic knowledge internal to the sentence level. For linguistic knowledge, pre-training sentences can be processed using existing or future developed tools for semantic role labeling, dependency parsing, and the like, and important components in recognition results are labeled according to rules. And for external triple knowledge, positive and negative triple samples of an entity are constructed according to the entity appearing in a sentence, wherein the positive sample is sampled according to a single-hop entity in the map, the negative sample is sampled according to a multi-hop entity in the map, but the sampling process of the negative sample can only be within a specified multi-hop range and cannot be too far away in the map.

In particular, CKBERT and BERT of the present invention share the same model backbone. It accepts N WordPiece tagged sequences (x) ₁ ,x ₂ ,...,x _N ) As input, and computing a D-dimensional context representation by successively stacking N transform encoder layers

Here, no modifications are made to the architecture to ensure that CKBERT can be seamlessly integrated with better performance into any industrial application that BERT supports. In other embodiments, the model architecture may also be extended.

CKBERT includes two pre-training tasks:

linguistic perceptual mask language modeling (LMLM): the LMLM is extended from Mask Language Modeling (MLM) by introducing two key linguistic tags derived from dependency syntax parsing and semantic role tagging. A unique token may be inserted for each language component in the continuation token. The goal of LMLM is to predict randomly selected tokens and linguistic tokens that are obscured in the pre-training sentences.

Contrast Multihop Relational Modeling (CMRM): and sampling a fine-grained subgraph from the large-scale Chinese KG through a multi-hop relation so as to supplement understanding of the background knowledge of the target entity. In particular, a positive triple is constructed for the matching target entity by retrieving a single-hop entity in the corresponding subgraph. The negative triples are sampled from unrelated multi-hop entities through the relationship path in the KG. The CMRM task is proposed to pull up semantics of similar entities and push aside those that are not relevant. Aggregating heterogeneous knowledge information may further facilitate context-aware representation of PLM.

In BERT pre-training, 15% of all marker positions are randomly masked for prediction. However, the randomly masked tokens may be unimportant elements such as conjunctions and prepositions. For this reason, the input pre-training sentence is reconstructed and more tags are masked according to linguistic knowledge in the present invention, so that CKBERT can better understand the semantics of the important tags in the pre-training sentence. The linguistic input units may be masked using the following three steps:

recognizing linguistic tags: the significant elements in the pre-training sentence may be identified using existing tools, including dependency grammars (dependency grammars) and semantic dependency parsing (semantic dependency parsing). The relationships extracted here serve as an important source of linguistic knowledge. As shown in the lower right part of the figure, the original pre-training sentence is extracted, namely, the object pointed by the performer ' big ' in the spoken pronunciation ' can be really improved by ' big-and-big-known-war ', namely, the predicate ' known ', and the core word ' improved ' modified by the fixed phrase ' real ' is extracted as the dependency syntactic relation word.

Reconstruct the input sentence: on the basis of an original input form, inserting special identifiers between vocabulary spans based on extracted key semantic components of linguistic relations, and providing clear boundary information for model pre-training. For example, we add [ DEP ] and [/DEP ] to the dependency syntax relation vocabulary, and [ SDP ] and [/SDP ] to the dependency syntax relation vocabulary.

Selection mask flag: a special marker MASK is used to select 15% of the marker positions from the reconstructed input sentence for masking. Of these markers, 40% of the positions are assigned to randomly selected markers, and the rest are assigned to linguistic markers. Here, the special identifiers ([ DEP ], [/DEP ], [ SDP ], and [/SDP ]) are also considered as common labels for the masks, so the model needs to know the prediction vocabulary boundaries, rather than simply filling the masks in accordance with the context. Thus, a reconstructed pre-training sentence as shown in the middle of the right side of the figure may be obtained and the gray scale portion may be masked.

After the input sentence is processed as above, tokens (tokens) that result in linguistic masks may be fed into a multi-headed self-attention and FFN unit set stacked by N layersAnd (4) forming a model. For the LMLM task, let Ω = (m) ₁ ,m ₂ ,m ₃ ,...,γ _K-1 ,γ _K ) Representing masked indexes in sentence X, where m _i Is an index of a random mask mark, γ _i Is an index of the mask tokens in the chosen linguistic sense, and K is the total number of mask tokens. Let X _Ω Representing a set of mask tokens in X, X _-Ω Representing the set of observed (unmasked) tokens. One implementation of an LMLM may be as follows:

wherein

Representing randomly selected tags or linguistic tags. θ represents a set of parameters of the model.

In addition to the LMLM task, when an entity is included in the pre-training sentence, a relational triple may be further injected into CKBERT to make it understand the background factual knowledge of the entity. For entities in the pre-training sentence, positive-negative relationship triples are constructed as follows:

positive example triple: the entities in the pre-training sentence are linked to the target entities in the knowledge-graph using entity links. A relational triple, i.e. a single hop entity, is considered a candidate positive triple. Next, a relation triple is randomly selected from the candidates as a positive sample, denoted as t _p 。

Negative example triple: triplet t due to positive example _p Semantic similarity with relation triples decreases along KG path, and thus can be passed from target entity e _t Starting multi-hop to construct L candidate negative triples

For example, in FIG. 5, with target entity e ₀ To start a node, nodes are retrieved along the respective edges. Thereby obtaining a multi-hop relation

End node e of _end In which H is _op (. Represents a knowledge graph

In (e) ₀ And e _end The shortest distance therebetween. Herein, if H _op (·)>1 and not greater than a small threshold δ (δ =3 in the example of the figure), a triplet is considered to be a negative-case triplet. Here, δ =3 may be set. Thus, e in FIG. 5 ₀ There are four negative example triplets. One sample three-hop path is e ₀ →e ₂ →e ₆ →e ₉ 。

The pre-training sentence "margarite michler" is a bright pearl in the literary history of the twenty-century. For example, an entity "Margardt Michell" contained in a sentence may be linked to a corresponding entity in the knowledge-graph as a target entity e ₀ And positive and negative example triples, e.g., single-hop triples, are constructed for the knowledge subgraph shown in FIG. 4<Margaret Michell, representative work, float>As a positive sample t _p (where an entity can be considered to "fly" corresponding to entity e in FIG. 5 ₂ )。e ₀ May correspond to a negative example triplet<Piao, main horn of woman, sijiali Ohara (e) ₆ )>、<Drifting, age, north-south war (e) ₇ )>、<Sijiali Oghala, famous image, fermat (e) ₉ )>And<artaloda (e) ₅ ) Famous person, martin, road, gold (e) ₈ )>。

The CMRM task aims to pull-in the target entity's affinity relationship triplets and push-out the irrelevant multi-hop relationship triplets to enhance the external background knowledge of the target entity from the knowledge-graph. Specifically, upon retrieval of target entity e _t Positive sample t of _p And negative sample

Then, a target entity e can be obtained _t Is expressed as follows:

wherein h is _et Is a target entity e _t Since an entity may have multiple tags in a pre-training sentence, the hidden representation is represented by the tags of the entity

And (5) constructing. f. of _sp Is a self-attention pooling operator, σ (-) is a non-linear activation function GELU,

is the LayerNorm function. W ₁ Is a learnable weight matrix.

Meanwhile, since a relational triplet may be treated as a natural sentence by concatenating the tags of the triplets, the triplet may be converted into a sentence to generate a representation obtained by a shared encoder θ (which may be a Transformer encoder of the CKBERT model). Thus, positive triplets may also be derived

And negative triplets

Is shown. For CMRM tasks, the similarity can be calculated using InfonCE as a loss function, as follows:

where cos (·,) represents a cosine function used to compute the similarity between the entity and the relational representation, τ is a predefined hyper-parameter.

For model training optimization, the total loss function of the pre-training CKBERT can be given according to the above two pre-training tasks as follows:

the invention can also be realized as a Chinese interactive system based on knowledge injection. Fig. 6 shows an example of the PLM trained by the present invention for actual interaction. Specifically, the interactive system based on knowledge injection comprises: the user input receiving unit is used for acquiring a Chinese inquiry input by a user; a problem matching unit comprising a knowledge-injected Chinese pre-training model (e.g., CKBERT) using a Chinese corpus and Chinese knowledge map acquisition in the method described above, the model identifying relevant entities and semantics in the Chinese query and generating feedback therefrom; a feedback providing unit for providing the generated feedback to the user.

A series of CKBERT models may be pre-trained on a distributed GPU cluster. For example, multiple Chinese pre-training models of different parameters are trained for generating the feedback with different accuracy and speed in different interaction scenarios. Due to the fact that the injection of linguistic and factual knowledge is carried out in the pre-training stage, the quantity of parameters of the models with different parameters is greatly reduced compared with the quantity of parameters of other models.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose co-processors, such as a Graphics Processor (GPU), digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units, such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only memory (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only memory, an ultra-dense disc, flash memory cards (e.g., SD, min SD, micro-SD, etc.), magnetic floppy disks, and the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 710 has stored thereon executable code that, when processed by the processor 720, causes the processor 720 to perform the above-described knowledge injection method of pre-training a language model.

The knowledge injection method of the chinese pre-training language model according to the present invention and the interactive system of the chinese pre-training model equipped with knowledge injection acquired by this method have been described in detail above with reference to the accompanying drawings.

The invention is based on the design aiming at input data and pre-training tasks, realizes the knowledge injection aiming at the Chinese pre-training language model through the internal linguistic knowledge labeling and the external knowledge map injection, so that the model can learn the linguistic knowledge of the pre-training sentence under the condition of unchanged architecture, and can learn the fact knowledge of the entity contained in the pre-training sentence from the external knowledge map. The model obtained by pre-training can complete various downstream tasks without external data support while greatly reducing the parameter scale, so that the method is suitable for providing various real-time services for users in a cloud environment.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A knowledge injection method of a Chinese pre-training language model comprises the following steps:

labeling key semantic components in the pre-training sentence by using a special identifier to construct a reconstructed pre-training sentence;

masking the reconstructed pre-training sentence; and

inputting the masked pre-training sentence into the pre-training language model (PLM), and adjusting parameters of a neural network model in the PLM for a first loss value output by a mask word based on the PLM.

2. The method of claim 1, further comprising:

recalling positive and negative example triples corresponding to entities contained in the pre-training sentence from the knowledge graph;

inputting words, the positive-case triples and the negative-case triples corresponding to the entity in the pre-training sentence into an encoder of the PLM; and

constructing second loss values for the hidden representation of the entity's word, the representation of the positive-case triplets, and the representation of the negative-case triplets output by the encoder to adjust parameters of a neural network model in the PLM based on contrast learning.

3. The method of claim 2, wherein the positive-case triples are single-hop triples containing the entity, and the negative-case triples are multi-hop triples in the knowledge-graph that are multiple hops away from the entity.

4. The method of claim 3, wherein the number of hops the multi-hop triplet is from the entity is no greater than a predetermined threshold.

5. The method of claim 2, wherein the second penalty value is to characterize a difference in positive case similarity characterizing similarity between the hidden representation of the word of the entity and the representation of the positive case triplet and negative case similarity characterizing similarity between the hidden representation of the word of the entity and the representation of the negative case triplet.

6. The method of claim 1, wherein masking the reconstructed pre-training sentence comprises:

and performing mask processing on at least part of the key semantic components.

7. The method of claim 1, wherein labeling key semantic components in the pre-training sentence with special identifiers to construct the reconstructed pre-training sentence comprises:

adding semantic dependency marks before and after the dependency grammar relation words in the pre-training sentences; and

and adding dependency syntax relation marks before and after the dependency syntax relation vocabulary in the pre-training sentence.

8. The method of claim 7, wherein masking the reconstructed pre-training sentence comprises:

masking words or special identifiers in the reconstructed pre-training sentence according to a predetermined scale, and in the predetermined scale, assigning a first scale to the random mask, a second scale to the dependency grammar related words, and a third scale to the dependency grammar related words.

9. A Chinese interactive system based on knowledge injection, comprising:

the user input receiving unit is used for acquiring a Chinese inquiry input by a user;

a problem matching unit comprising a Chinese pre-training model of knowledge injection using a Chinese corpus and Chinese knowledge-map acquisition by the method of any one of claims 1-8, the model identifying relevant entities and semantics in the Chinese query and generating feedback therefrom;

a feedback providing unit for providing the generated feedback to the user.

10. The system of claim 9, wherein a plurality of chinese pre-training models of different parameters are trained using the method of any of claims 1-8 for generating the feedback at different accuracies and speeds in different interaction scenarios.

11. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

12. A non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-8.