CN113204961A

CN113204961A - Language model construction method, device, equipment and medium for NLP task

Info

Publication number: CN113204961A
Application number: CN202110602682.3A
Authority: CN
Inventors: 于凤英; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-03
Anticipated expiration: 2041-05-31
Also published as: CN113204961B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a language model construction method, a device, equipment and a medium for NLP tasks, wherein the method comprises the following steps: acquiring a first dictionary of a target Word vector generation model of a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of the initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; intersecting according to the first dictionary and the second dictionary to obtain intersection data of the target dictionary; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionary by adopting a least square method to obtain a simulation matrix vector; and constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field. The NLP task in the target field can be processed after the structure is changed, hardware cost is reduced, and time spent is reduced.

Description

Language model construction method, device, equipment and medium for NLP task

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for constructing a language model for NLP tasks.

Background

For a pre-trained language model, we usually need to apply it to another domain to handle NLP (natural language processing) tasks. The conventional method is to perform unsupervised pre-training on a pre-trained language model by using a text in a target field to perform NLP task processing in the target field, for example, when a text mining task in a biomedical field is desired, it is necessary to initialize a BioBERT (pre-trained language representation model for biomedical text mining) using weights of a Bert model (language model) trained in a general field, and then pre-train the weight-initialized BioBERT using a corpus in the biomedical field. The training method has good results, but requires huge cost in terms of hardware, and takes a lot of time for training, thereby delaying the development of NLP tasks in emerging fields.

Disclosure of Invention

The method, the device, the equipment and the medium for constructing the language model for the NLP task aim at solving the technical problems that in the prior art, the text in the target field is adopted to perform unsupervised pre-training on the pre-trained language model to realize the processing of the NLP task in the target field, huge cost is needed in terms of hardware, and a large amount of time is needed for training.

In order to achieve the above object, the present application proposes a language model construction method for NLP task, the method comprising:

acquiring a first dictionary of a target Word vector generation model of a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training;

acquiring a second dictionary of an initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training;

performing intersection acquisition according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries;

fitting unconstrained linear transformation is carried out on the intersection data of the target dictionaries by adopting a least square method to obtain a simulation matrix vector;

and constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field.

Further, before the step of obtaining the first dictionary of the target word vector generation model in the target domain, the method further includes:

acquiring a training sample set of the target field;

and training a word vector generation initial model by adopting the training sample set, and taking the training-finished word vector generation initial model as the target word vector generation model.

Further, the step of obtaining intersection data of the target dictionaries according to the intersection acquisition of the first dictionary and the second dictionary includes:

performing intersection acquisition according to the first dictionary and the second dictionary to obtain dictionary intersection data to be denoised;

removing noise characters from the intersection data of the dictionaries to be denoised to obtain the intersection data of the target dictionaries, wherein the noise characters comprise: emoticons, punctuation, and null characters.

Further, the simulation matrix vector is expressed as W and is calculated by the following formula:

wherein W is the simulation matrix vector for aligning a first word vector that is a word vector that inputs a target word into the target word vector generative model output and a second word vector that is a word vector that inputs the target word into the initial language model output, the target word being a word in the target dictionary intersection data; epsilon_w2v(x) Inputting the words x in the intersection data of the target dictionary into the target word vector generation modelThe first word vector, epsilon_LM(x) Is to input word x in the target lexicon intersection data into the second word vector output by the initial language model,

is to make the following calculation expression

Reaching a minimum value, LLM &LW2 v is the target dictionary intersection data, LLM is the first dictionary, L_W2vIs the second dictionary of the second set of words,

is to W epsilon_w2v(x)-ε_LM(x) And taking the square and then opening the root for calculation.

Further, the step of constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field includes:

generating a vector generating unit according to the initial language model and the target word vector generating model to obtain a target word vector generating unit;

generating a word embedding unit according to the target word vector generating unit, the initial language model and the simulation matrix vector to obtain a target word embedding unit;

removing the previous structure of the encoder from the initial language model to obtain a target word vector processing unit;

and sequentially connecting the target word vector generating unit, the target word embedding unit and the target word vector processing unit to obtain the target language model corresponding to the target field.

Further, the step of generating a vector generation unit according to the initial language model and the target word vector generation model to obtain a target word vector generation unit includes:

taking the target word vector generation model as a first word vector generation subunit;

taking the word segmenter and the word vector generator of the initial language model as a second word vector generating subunit;

and the first word vector generating subunit and the second word vector generating subunit are arranged in parallel to obtain the target word vector generating unit.

Further, the step of generating a word embedding unit according to the target word vector generating unit, the initial language model and the simulation matrix vector to obtain a target word embedding unit includes:

a word vector source Bert judgment subunit is constructed according to the target word vector generation unit, wherein the word vector source Bert judgment subunit is used for judging whether a word vector generated by the second word vector generation subunit exists for each word in the target text data input into the target word vector generation unit, so as to obtain a word vector source Bert judgment result corresponding to each word in the target text data;

constructing a word vector alignment subunit according to the target word vector generation unit, the simulation matrix vector and the word vector source Bert judgment subunit, wherein the word vector alignment subunit is configured to, when the word vector source Bert judgment result indicates that no word vector source Bert exists, use all words corresponding to the word vector source Bert judgment result that no word vector source Bert exists as a word set to be aligned, obtain a word vector output by the first word vector generation subunit according to each word in the word set to be aligned, obtain a word vector set to be aligned, and multiply each word vector in the word vector set to be aligned with the simulation matrix vector to obtain an aligned word vector set;

constructing a word vector combination subunit according to the target word vector generation unit, the word vector source Bert judgment subunit and the word vector alignment subunit, where the word vector combination subunit is configured to, when the word vector source Bert judgment result indicates that a word vector source Bert exists, use all words corresponding to the word vector source Bert judgment result as word sets that do not need to be aligned, obtain word vectors output by the second word vector generation subunit according to each word in the word sets that do not need to be aligned, obtain a word vector set that does not need to be aligned, and splice the aligned word vector set and the vector word set that does not need to be aligned according to the word sequence of the target text data to obtain target word vector data;

taking the word embedding layer of the initial language model as a word embedding subunit;

and performing word embedding unit generation according to the word vector source Bert judgment subunit, the word vector alignment subunit, the word vector combination subunit and the word embedding subunit to obtain the target word embedding unit.

The application also provides a language model building device for NLP task, the device comprises:

the first dictionary determining module is used for obtaining a first dictionary of a target Word vector generating model of the target field, wherein the target Word vector generating model is a model obtained based on Word2vec training;

the second dictionary determining module is used for acquiring a second dictionary of the initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field;

the target dictionary intersection data determining module is used for performing intersection acquisition according to the first dictionary and the second dictionary to obtain target dictionary intersection data;

the simulation matrix vector determining module is used for performing fitting unconstrained linear transformation on the intersection data of the target dictionary by adopting a least square method to obtain a simulation matrix vector;

and the target language model determining module is used for constructing a language model according to the initial language model, the target word vector generating model and the simulation matrix vector to obtain a target language model corresponding to the target field.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

According to the language model construction method, device, equipment and medium for the NLP task, a first dictionary of a target Word vector generation model in a target field is obtained, and the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of the initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; intersection acquisition is carried out according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionary by adopting a least square method to obtain a simulation matrix vector; the method comprises the steps of constructing a language model according to an initial language model, a target Word vector generation model and a simulation matrix vector to obtain a target language model corresponding to a target field, and realizing that an NLP task in the target field can be processed after a model obtained by training based on Word2vec in the target field is adopted to structurally modify a Bert model obtained by training sample data in an unlimited field, wherein the cost in terms of hardware required by the model obtained by training based on Word2vec in the target field is less than that of unsupervised pre-training of the language model by adopting a text in the target field, and the time required by the model is less than that of unsupervised pre-training of the language model by adopting the text in the target field, so that the hardware cost is reduced, the time required by the model is reduced, and the development of the NLP task in the emerging field is.

Drawings

Fig. 1 is a schematic flowchart of a language model construction method for NLP task according to an embodiment of the present application;

FIG. 2 is a block diagram schematically illustrating the structure of a language model building apparatus for NLP task according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In order to solve the technical problems that in the prior art, the text in the target field is adopted to perform unsupervised pre-training on a pre-trained language model to process the NLP task in the target field, huge cost is needed in terms of hardware, and a large amount of time is needed for training, the method for constructing the language model for the NLP task is provided, the method is applied to the technical field of artificial intelligence, and the method is further applied to the technical field of artificial intelligence natural language processing. According to the language model construction method for the NLP task, the model obtained by training based on Word2vec in the target field is adopted to carry out structural modification on the Bert model obtained by training based on sample data in an unlimited field, and then the NLP task in the target field can be processed, because the cost in terms of hardware required by the model obtained by training based on Word2vec in the target field is less than that of unsupervised pre-training of the language model by adopting the text in the target field, and the time required by unsupervised pre-training of the language model by adopting the text in the target field is less than that of unsupervised pre-training of the language model, the hardware cost is reduced, the time required by spending is reduced, and the development of the NLP task in the emerging field is facilitated.

Referring to fig. 1, in an embodiment of the present application, a method for constructing a language model for an NLP task is provided, where the method includes:

s1: acquiring a first dictionary of a target Word vector generation model of a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training;

s2: acquiring a second dictionary of an initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training;

s3: performing intersection acquisition according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries;

s4: fitting unconstrained linear transformation is carried out on the intersection data of the target dictionaries by adopting a least square method to obtain a simulation matrix vector;

s5: and constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field.

In the embodiment, a first dictionary of a target Word vector generation model in a target field is obtained, wherein the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of the initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; intersection acquisition is carried out according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionary by adopting a least square method to obtain a simulation matrix vector; the method comprises the steps of constructing a language model according to an initial language model, a target Word vector generation model and a simulation matrix vector to obtain a target language model corresponding to a target field, and realizing that an NLP task in the target field can be processed after a model obtained by training based on Word2vec in the target field is adopted to structurally modify a Bert model obtained by training sample data in an unlimited field, wherein the cost in terms of hardware required by the model obtained by training based on Word2vec in the target field is less than that of unsupervised pre-training of the language model by adopting a text in the target field, and the time required by the model is less than that of unsupervised pre-training of the language model by adopting the text in the target field, so that the hardware cost is reduced, the time required by the model is reduced, and the development of the NLP task in the emerging field is facilitated.

For S1, the first dictionary of the target word vector generation model of the target domain may be obtained from the database, the first dictionary of the target word vector generation model of the target domain may be obtained from the third-party application system, or the first dictionary of the target word vector generation model of the target domain input by the user may be obtained.

Target areas include, but are not limited to: biomedical field, financial field.

The first dictionary, that is, the dictionary in the target word vector generation model.

The target Word vector generation model is obtained by training a model obtained based on Word2vec by adopting a training sample in a target field. That is, the target word vector generation model may be used for word vector generation of text data of the target domain. Word2vec, a group of correlation models used to generate Word vectors.

For S2, the second dictionary of the initial language model may be obtained from a database, the third-party application system, or the user input.

The second dictionary, i.e., the dictionary of the initial language model.

The initial language model is a Bert (bidirectional Encoder replication from transformations) model obtained by training sample data in an unlimited field, that is, the initial language model is a Bert model trained in the general field.

For S3, the same words are found from the first dictionary and the second dictionary, and all the found same words are used as target dictionary intersection data. That is, the words in the target dictionary intersection data are words that exist in both the first dictionary and the second dictionary.

And S4, performing fitting unconstrained linear transformation on the intersection data of the target dictionaries by adopting a least square method, calculating to obtain a parameter matrix, and taking the obtained parameter matrix as a simulation matrix vector.

It is understood that each vector element in the analog matrix vector is a value between 0 and 1, and may include 0 and may also include 1.

And simulating a matrix vector to align a word vector obtained by inputting the words in the intersection data of the target dictionary into the target word vector generation model with a word vector obtained by inputting the initial language model. That is, multiplication of the word vector obtained by inputting the word in the target dictionary intersection data into the target word vector generation model and the simulation matrix vector is substantially consistent with the word vector obtained by inputting the word into the initial language model. Therefore, the word vectors output by the target word vector generation model can be simulated into the word vectors output by the initial language model through the simulation matrix vectors.

For S5, a model is generated according to the target word vector and the simulation matrix vector, a previous structure of the encoder of the initial language model is adjusted, and the adjusted network structure is used as the target language model corresponding to the target domain.

In an embodiment, before the step of obtaining the first dictionary of the target word vector generation model in the target domain, the method further includes:

s11: acquiring a training sample set of the target field;

s12: and training a word vector generation initial model by adopting the training sample set, and taking the training-finished word vector generation initial model as the target word vector generation model.

According to the method and the device, the training of the word vector generation initial model is carried out by adopting the training sample set in the target field to obtain the target word vector generation model, so that support is provided for the generation of the word vector in the target field.

For S11, the training sample set of the target field may be obtained from a database, or the training sample set of the target field may be obtained from a third-party application system, or the training sample set of the target field may be obtained.

The training sample set of the target field means that sample data in the training sample set comes from the target field.

For S12, the specific steps of training the word vector generation initial model by using the training sample set are not described herein again.

Generating an initial model using Word2vec as the Word vector.

In an embodiment, the step of obtaining intersection data of the target dictionaries according to the intersection acquisition of the first dictionary and the second dictionary includes:

s31: performing intersection acquisition according to the first dictionary and the second dictionary to obtain dictionary intersection data to be denoised;

s32: removing noise characters from the intersection data of the dictionaries to be denoised to obtain the intersection data of the target dictionaries, wherein the noise characters comprise: emoticons, punctuation, and null characters.

According to the method and the device, the intersection data of the target dictionary is used after the intersection data of the dictionary is removed, the influence of the noise characters on the accuracy of determining the simulation matrix vector is reduced, and the accuracy of the target language model is improved.

For S31, finding out the same words from the first dictionary and the second dictionary, and using all the found words as the intersection data of the dictionaries to be denoised. It will be appreciated that each noisy character is treated as a word in the lexicon intersection data to be denoised.

For S32, acquiring a preset noise character library; and searching each word in the intersection data of the dictionaries to be denoised in the preset noise character library, deleting all words which are successfully searched in the preset noise character library from the intersection data of the dictionaries to be denoised, and taking the intersection data of the dictionaries to be denoised which is deleted as the intersection data of the target dictionaries.

The noise characters include, but are not limited to: emoticons, punctuation, and null characters.

Because the Bert model carries out accurate Word vector generation on the noise characters in the preset noise character library, the Word vectors generated by the Bert model are directly adopted for the noise characters in the preset noise character library, and Word2vec Word vectors are not needed, so that the noise characters in the preset noise character library are deleted from the intersection data of the dictionary to be denoised, and the influence of the noise characters on the accuracy of determining the simulation matrix vectors is avoided.

In one embodiment, the above-mentioned analog matrix vector is represented as W and is calculated by using the following formula:

wherein W is the simulation matrix vector for aligning a first word vector that is a word vector that inputs a target word into the target word vector generative model output and a second word vector that is a word vector that inputs the target word into the initial language model output, the target word being a word in the target dictionary intersection data; epsilon_w2v(x) Is to input the word x in the intersection data of the target dictionary into the first word vector output by the target word vector generation model, epsilon_LM(x) Is to input word x in the target lexicon intersection data into the second word vector output by the initial language model,

is to make the following calculation expression

In this embodiment, a least square method is adopted to perform fitting unconstrained linear transformation on the target dictionary intersection data, determine a simulation matrix vector that aligns a Word vector output by a Word input target Word vector generation model in the target dictionary intersection data with a Word vector output by an input initial language model, and provide support for subsequently adopting a model obtained by training based on Word2vec in a target field to perform structural modification on a Bert model obtained by training sample data in an unlimited field, so as to process an NLP task in the target field.

Wherein when

And W when the minimum value is reached is the simulation matrix vector.

In an embodiment, the step of constructing a language model according to the initial language model, the target word vector generation model, and the simulation matrix vector to obtain a target language model corresponding to the target field includes:

s51: generating a vector generating unit according to the initial language model and the target word vector generating model to obtain a target word vector generating unit;

s52: generating a word embedding unit according to the target word vector generating unit, the initial language model and the simulation matrix vector to obtain a target word embedding unit;

s53: removing the previous structure of the encoder from the initial language model to obtain a target word vector processing unit;

s54: and sequentially connecting the target word vector generating unit, the target word embedding unit and the target word vector processing unit to obtain the target language model corresponding to the target field.

According to the embodiment, the initial language model is adjusted according to the target Word vector generation model and the simulation matrix vector to obtain the target language model corresponding to the target field, the model obtained by training the target field based on Word2vec is adopted to carry out structural modification on the Bert model obtained by training sample data in an unlimited field, and then the NLP task in the target field can be processed, because the cost in terms of hardware required by the model obtained by training the target field based on Word2vec is less than that of unsupervised pre-training of the language model by adopting the text in the target field, and the time required to be spent is less than that of unsupervised pre-training of the language model by adopting the text in the target field, the hardware cost is reduced, and the time required to be spent is reduced.

And S51, taking the word segmenter and the word vector generator of the initial language model and the target word vector generation model as word vector generation subunits connected in parallel to obtain a target word vector generation unit.

Optionally, the union word segmenter is used as a first word segmenter, the word segmenter of the initial language model is used as a second word segmenter, the word vector generator of the target word vector generation model is used as a first word vector generator, and the word vector generator of the initial language model is used as a second word vector generator; and sequentially connecting the first Word splitter and the first Word vector generator to obtain a Word2vec Word vector generating subunit, sequentially connecting the second Word splitter and the second Word vector generator to obtain a Bert Word vector generating subunit, and arranging the Word2vec Word vector generating subunit and the Bert Word vector generating subunit in parallel to obtain the target Word vector generating unit. The word segmentation device is a word segmentation device which can perform word segmentation on the text data of the general field and the target field.

And S52, redefining the word embedding layer of the initial language model according to the target word vector generating unit and the simulation matrix vector to obtain a target word embedding unit.

For S53, the previous structures of the encoder of the initial language model are removed, and the remaining structures in the initial language model are used as the target word vector processing unit.

For S54, the output end of the target word vector generating unit is connected to the input end of the target word embedding unit, the output end of the target word embedding unit is connected to the input end of the target word vector processing unit, and the target word vector generating unit, the target word embedding unit, and the target word vector processing unit that have completed the connection are used as the target language model corresponding to the target domain. That is, the target language model is a model obtained by adjusting the previous structure of the encoder of the initial language model.

In an embodiment, the step of generating the vector generating unit according to the initial language model and the target word vector generating model to obtain the target word vector generating unit includes:

s511: taking the target word vector generation model as a first word vector generation subunit;

s512: taking the word segmenter and the word vector generator of the initial language model as a second word vector generating subunit;

s513: and the first word vector generating subunit and the second word vector generating subunit are arranged in parallel to obtain the target word vector generating unit.

The Word segmentation device, the Word vector generator and the target Word vector generation model of the initial language model are used as parallel Word vector generation subunits, and support is provided for the fact that the model obtained by training based on Word2vec in the target field is adopted to carry out structural modification on the Bert model obtained by training sample data in an unlimited field, and then the NLP task in the target field can be processed.

For S511, the target word vector generation model is directly used as a first word vector generation subunit.

The first word vector generation subunit may perform word segmentation and word vector generation on the input text data.

For step S512, after the word segmenter of the initial language model and the word vector generator of the initial language model are sequentially connected, the connected word segmenter of the initial language model and the connected word vector generator of the initial language model serve as a second word vector generation subunit.

The word segmentation device of the initial language model is used for segmenting words of input text data, and the word vector generator of the initial language model is used for generating word vectors of the input text data.

For S513, the first word vector generating subunit and the second word vector generating subunit are arranged in parallel, that is, the target text data input to the target word vector generating unit is input to the first word vector generating subunit and the second word vector generating subunit at the same time, the first word vector generating subunit outputs data to the target word embedding unit, and the second word vector generating subunit outputs data to the target word embedding unit.

In an embodiment, the step of generating a word embedding unit according to the target word vector generating unit, the initial language model, and the simulation matrix vector to obtain a target word embedding unit includes:

s521: a word vector source Bert judgment subunit is constructed according to the target word vector generation unit, wherein the word vector source Bert judgment subunit is used for judging whether a word vector generated by the second word vector generation subunit exists for each word in the target text data input into the target word vector generation unit, so as to obtain a word vector source Bert judgment result corresponding to each word in the target text data;

s522: constructing a word vector alignment subunit according to the target word vector generation unit, the simulation matrix vector and the word vector source Bert judgment subunit, wherein the word vector alignment subunit is configured to, when the word vector source Bert judgment result indicates that no word vector source Bert exists, use all words corresponding to the word vector source Bert judgment result that no word vector source Bert exists as a word set to be aligned, obtain a word vector output by the first word vector generation subunit according to each word in the word set to be aligned, obtain a word vector set to be aligned, and multiply each word vector in the word vector set to be aligned with the simulation matrix vector to obtain an aligned word vector set;

s523: constructing a word vector combination subunit according to the target word vector generation unit, the word vector source Bert judgment subunit and the word vector alignment subunit, where the word vector combination subunit is configured to, when the word vector source Bert judgment result indicates that a word vector source Bert exists, use all words corresponding to the word vector source Bert judgment result as word sets that do not need to be aligned, obtain word vectors output by the second word vector generation subunit according to each word in the word sets that do not need to be aligned, obtain a word vector set that does not need to be aligned, and splice the aligned word vector set and the vector word set that does not need to be aligned according to the word sequence of the target text data to obtain target word vector data;

s524: taking the word embedding layer of the initial language model as a word embedding subunit;

s525: and performing word embedding unit generation according to the word vector source Bert judgment subunit, the word vector alignment subunit, the word vector combination subunit and the word embedding subunit to obtain the target word embedding unit.

According to the embodiment, the Word embedding layer of the initial language model is redefined according to the target Word vector generating unit and the simulation matrix vector, and support is provided for the following step that the model obtained by training based on Word2vec in the target field is adopted to carry out structure change on the Bert model obtained by training sample data in an unlimited field, and then the NLP task in the target field can be processed.

And for S521, constructing a word vector source Bert judgment subunit, and connecting the input end of the word vector source Bert judgment subunit with the output end of the second word vector generation subunit.

The working principle of the word vector source Bert judgment subunit is as follows: and judging whether the word vector generated by the second word vector generation subunit exists or not according to each word in the target text data input into the target word vector generation unit to obtain a word vector source Bert judgment result corresponding to each word in the target text data.

That is to say, for each word in the target text data input to the target word vector generation unit, determining whether the second word vector generation subunit successfully generates a word vector, determining that the word vector source Bert determination result corresponding to the word for which the second word vector generation subunit unsuccessfully generates a word vector is absent, and determining that the word vector source Bert determination result corresponding to the word for which the second word vector generation subunit successfully generates a word vector is present.

For S522, a word vector alignment subunit is constructed, and the input end of the word vector alignment subunit is connected to the output end of the first word vector generation subunit and the output end of the word vector source Bert judgment subunit, respectively.

The working principle of the word vector alignment subunit is as follows: and when the judgment result of the word vector source Bert is that the word vector source Bert does not exist, taking all words corresponding to the word vector source Bert as a word set to be aligned, respectively acquiring the word vectors output by the first word vector generation subunit according to each word in the word set to be aligned to obtain a word vector set to be aligned, and respectively multiplying each word vector in the word vector set to be aligned with the simulation matrix vector to obtain an aligned word vector set.

That is to say, when the word vector source Bert determination result indicates that no word vector source Bert exists, it means that the word corresponding to the word vector source Bert is not correctly identified by the second word vector generation subunit and a word vector is generated as the word vector source Bert determination result, and therefore all words corresponding to the word vector source Bert determination result are taken as a word set to be aligned; respectively acquiring Word vectors output by the first Word vector generation subunit according to each Word in the Word set to be aligned to obtain a Word vector set to be aligned, so as to obtain Word vectors generated by Word2 vec; and multiplying each Word vector in the Word vector set to be aligned with the simulation matrix vector respectively, taking each multiplied data as an aligned Word vector, and taking all aligned Word vectors as an aligned Word vector set, thereby realizing that the Word vector generated by Word2vec is multiplied with the simulation matrix vector to simulate the Word vector generated by the second Word vector generation subunit.

For S523, a word vector combination subunit is constructed, and an input end of the word vector combination subunit is respectively connected to an output end of the target word vector generation unit, an output end of the word vector source Bert judgment subunit, and an output end of the word vector alignment subunit.

The working principle of the word vector combination subunit is as follows: when the judgment result of the word vector source Bert is that the word vector source Bert exists, taking all words corresponding to the word vector source Bert as word sets which do not need to be aligned according to the judgment result of the word vector source Bert, respectively obtaining word vectors output by the second word vector generation subunit according to each word in the word sets which do not need to be aligned, obtaining word vector sets which do not need to be aligned, and splicing the aligned word vector sets and the word vector sets which do not need to be aligned according to the character sequence of the target text data to obtain target word vector data.

That is to say, when the word vector source Bert exists, it means that the word corresponding to the word vector source Bert exists in the word vector source Bert determination result, which is the word vector source Bert, can be correctly identified by the second word vector generation subunit to generate the word vector, and the word corresponding to the word vector source Bert does not need to be aligned in the word vector source Bert determination result, so that all the words corresponding to the word vector source Bert in the word vector source Bert determination result are used as a word set that does not need to be aligned; respectively acquiring word vectors output by the second word vector generation subunit according to each word in the word set which does not need to be aligned to obtain a word vector set which does not need to be aligned, thereby obtaining word vectors generated by Bert; and splicing the aligned word vector set and the word vector set which does not need to be aligned according to the character sequence of the target text data, and taking the spliced data as target word vector data.

For S524, directly using the word embedding layer of the initial language model as a word embedding subunit, and connecting an input end of the word embedding subunit with an output end of the word vector combination subunit.

For S525, the word vector source Bert determining subunit, the word vector aligning subunit, the word vector combining subunit, and the word embedding subunit, which are connected, are used as word embedding units, and the obtained word embedding unit is used as the target word embedding unit.

Referring to fig. 2, the present application also proposes a language model construction apparatus for NLP task, the apparatus comprising:

the first dictionary determining module 100 is configured to obtain a first dictionary of a target Word vector generation model in a target field, where the target Word vector generation model is a model obtained based on Word2vec training;

the second dictionary determining module 200 is configured to obtain a second dictionary of an initial language model, where the initial language model is a Bert model trained by sample data in an unlimited field;

the target dictionary intersection data determining module 300 is configured to perform intersection acquisition according to the first dictionary and the second dictionary to obtain target dictionary intersection data;

the simulation matrix vector determining module 400 is configured to perform fitting unconstrained linear transformation on the intersection data of the target dictionary by using a least square method to obtain a simulation matrix vector;

and the target language model determining module 500 is configured to perform language model construction according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a language model construction method for NLP tasks. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a language model construction method for NLP tasks. The language model construction method for the NLP task comprises the following steps: acquiring a first dictionary of a target Word vector generation model of a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of an initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; performing intersection acquisition according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionaries by adopting a least square method to obtain a simulation matrix vector; and constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a language model construction method for NLP task, including the steps of: acquiring a first dictionary of a target Word vector generation model of a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of an initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; performing intersection acquisition according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionaries by adopting a least square method to obtain a simulation matrix vector; and constructing a language model according to the initial language model, the target word vector generation model and the simulation matrix vector to obtain a target language model corresponding to the target field.

The executed language model construction method for the NLP task obtains a first dictionary of a target Word vector generation model in a target field, wherein the target Word vector generation model is a model obtained based on Word2vec training; acquiring a second dictionary of the initial language model, wherein the initial language model is a Bert model obtained by adopting sample data in an unlimited field for training; intersection acquisition is carried out according to the first dictionary and the second dictionary to obtain intersection data of the target dictionaries; fitting unconstrained linear transformation is carried out on the intersection data of the target dictionary by adopting a least square method to obtain a simulation matrix vector; the method comprises the steps of constructing a language model according to an initial language model, a target Word vector generation model and a simulation matrix vector to obtain a target language model corresponding to a target field, and realizing that an NLP task in the target field can be processed after a model obtained by training based on Word2vec in the target field is adopted to structurally modify a Bert model obtained by training sample data in an unlimited field, wherein the cost in terms of hardware required by the model obtained by training based on Word2vec in the target field is less than that of unsupervised pre-training of the language model by adopting a text in the target field, and the time required by the model is less than that of unsupervised pre-training of the language model by adopting the text in the target field, so that the hardware cost is reduced, the time required by the model is reduced, and the development of the NLP task in the emerging field is facilitated.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of language model construction for NLP tasks, the method comprising:

2. The method of claim 1, wherein the step of obtaining the first dictionary of the target word vector generation model of the target domain is preceded by the step of:

acquiring a training sample set of the target field;

3. The method of claim 1, wherein the step of obtaining intersection data of target dictionaries by performing intersection acquisition according to the first dictionary and the second dictionary comprises:

4. The method for constructing a language model for NLP task according to claim 1, wherein the simulation matrix vector is represented by W and is calculated by the following formula:

is to make the following calculation expression

Reaches a minimum value, L_LM∩L_W2vIs the target dictionary intersection data, L_LMIs the first dictionary, L_W2vIs the second dictionary of the second set of words,

5. The method according to claim 1, wherein the step of constructing a language model according to the initial language model, the target word vector generation model, and the simulation matrix vector to obtain a target language model corresponding to the target domain comprises:

6. The method according to claim 5, wherein the step of generating a target word vector generating unit by a vector generating unit according to the initial language model and the target word vector generating model comprises:

7. The method according to claim 6, wherein the step of generating a word embedding unit according to the target word vector generating unit, the initial language model and the simulation matrix vector to obtain a target word embedding unit comprises:

8. A language model building apparatus for NLP task, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.