CN116028821A

CN116028821A - Pre-training model training method integrating domain knowledge and data processing method

Info

Publication number: CN116028821A
Application number: CN202310314738.4A
Authority: CN
Inventors: 黄海峰; 熊子奇; 孙丽娟; 曹扬; 李响; 蔡惠民; 谢真强
Original assignee: CETC Big Data Research Institute Co Ltd
Current assignee: CETC Big Data Research Institute Co Ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-04-28
Anticipated expiration: 2043-03-29
Also published as: CN116028821B

Abstract

The invention provides a training method and a data processing method for a pre-training model fusing domain knowledge, wherein after judging that a model processing request and a target domain sample set are received, a server invokes a first domain sample set corresponding to each first pre-training model in a database; obtaining sample set similarity coefficients of a plurality of first field sample sets and a target field sample set, wherein the first field sample set with the highest similarity coefficient or the second highest similarity coefficient is used as a second field sample set; determining a target training sample different from the second training sample, generating a differential sample set based on the determined target training sample; and taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation on the difference training sentences to obtain at least one training word, and correspondingly storing the corresponding relation between the slot positions and the training sentences and the corresponding slot position templates to obtain a final model.

Description

Pre-training model training method integrating domain knowledge and data processing method

Technical Field

The invention relates to the technical field of data processing, in particular to a training method and a data processing method for a pre-training model fusing knowledge in the field.

Background

A pre-trained model is a model that is trained on a large amount of data and stored. The model is generally understood as a model created by a former person to solve similar problems, when a new problem is encountered, the model is not required to be trained from scratch, and the model can be directly used for starting with the model, so that the new problem can be solved by simple learning.

In an actual application scene, there may be pre-training models in multiple scenes, taking a natural language processing field as an example, different pre-training models are available in the elevator interaction field and the intelligent home interaction field, for example, one field needs to be developed as a car-to-machine interaction field, and training can be performed on the basis of the elevator interaction field and the intelligent home interaction field at this time, namely, the current interaction model in the elevator interaction field or the intelligent home interaction field is used as the pre-training model, and the pre-training model is continuously trained to obtain a new training model corresponding to the required field.

In the prior art, the most suitable pre-training model cannot be rapidly determined to perform subsequent data processing according to the deployment scene requirement of a user, so that the calculation effect of the corresponding model after deployment is poor. Therefore, a technical scheme is needed to be able to integrate domain knowledge, and perform corresponding selection and retraining in a plurality of pre-training models, so that the corresponding models have a better calculation effect after deployment.

Disclosure of Invention

The embodiment of the invention provides a training method and a data processing method for a pre-training model, which are used for fusing domain knowledge, and can be used for carrying out corresponding selection and retraining in a plurality of pre-training models, so that the training can be fast and efficiently carried out, a final model with comprehensive functions can be obtained, and the calculation effect of the corresponding model after deployment is better.

In a first aspect of the embodiment of the present invention, a training method for a pre-training model fusing domain knowledge is provided, including:

after judging that a model processing request and a target field sample set sent by a request end are received, a server invokes a first field sample set corresponding to each first pre-training model in a database, wherein the target field of the target field sample set is the current application field of the request end, the first field sample set is pre-stored sample data of the first field, the first field is a plurality of preset interactive application fields, each first field comprises a first pre-training model corresponding to the first field sample set, and samples included in the target field sample set and the first field sample set are corpus samples extracted from the corresponding fields;

traversing each target training sample in the target field sample set in sequence, comparing the target field sample with first training samples in the first field sample set, determining target training samples identical to or corresponding to the first training samples, counting first numbers of identical or corresponding target training samples in each first field sample set and second numbers of different or non-corresponding target training samples, calculating the similarity based on the first numbers and the second numbers to respectively obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second highest similarity coefficient as the second field sample set;

Traversing each target training sample in the target field sample set in turn, comparing the target field sample with a second training sample in a second field sample set, determining a target training sample different from the second training sample, generating a difference sample set based on the determined target training sample, wherein each difference training sample in the difference sample set at least comprises a difference training sentence;

and taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, constructing a slot template corresponding to the difference training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model.

Optionally, in one possible implementation manner of the first aspect, the traversing each target training sample in the target field sample set sequentially, comparing the target field sample with a first training sample in a first field sample set, determining a first number of target training samples that are the same as or correspond to the first training sample, counting a second number of target training samples that are the same as or correspond to each first field sample set, and calculating the similarity based on the first number and the second number, to obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample set, and comparing all sample set similarity coefficients, where the first field sample set with the highest similarity coefficient or the second highest similarity coefficient is used as the second field sample set, including:

Sorting all the first field sample sets in a descending order according to the sample set similarity coefficient, and taking the first field sample set with the highest sample set similarity coefficient as a second field sample set;

and if the difference of the similarity coefficients between the first field sample set with the highest similarity coefficient of the sample set and the first field sample set with the second highest similarity coefficient of the sample set is smaller than the preset difference value, displaying the first field sample set with the highest and second highest similarity coefficient.

Optionally, in one possible implementation manner of the first aspect, counting a first number of identical or corresponding target training samples in each first domain sample set and a second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number, to obtain sample set similarity coefficients of the plurality of first domain sample sets and the target domain sample set, respectively, including:

calculating according to the first quantity and the total quantity of the target training samples in the target field sample set to obtain the same evaluation sub-coefficient of the first field sample set and the target field sample set;

calculating according to the second quantity and the total quantity of the target training samples in the target field sample set to obtain different evaluation sub-coefficients of the first field sample set and the target field sample set;

Respectively carrying out weighting treatment on the same evaluation sub-coefficient and the different evaluation sub-coefficients to obtain a sample set similarity coefficient of the first field sample set and the target field sample set, calculating the sample set similarity coefficient by the following formula,

，/>

，/>

wherein X is _Sim For the sample set similarity coefficient of the first domain sample set and the target domain sample set,

for the same evaluator coefficient->

For different evaluation sub-coefficients, S _ide First number of training samples for the same or corresponding target, +.>

For the total number of target training samples within the target domain sample set,/>

for the first calculation weight, S _dif For a second number of different or non-corresponding target training samples +.>

For the second calculation weight ∈>

To calculate a constant;

wherein the preset difference is 0.05.

Optionally, in one possible implementation manner of the first aspect, the method further includes:

if the user is judged to take the second highest first field sample set as the second field sample set, the original highest first field sample set is not taken as the second field sample set;

then the first number of the next highest first domain sample set is taken as the first to-be-compared number, the second number of the next highest first domain sample set is taken as the second to-be-compared number, and the first number of the highest first domain sample set is taken as the third to-be-compared number, and the second number of the highest first domain sample set is taken as the fourth to-be-compared number;

And if the first quantity to be compared, the second quantity to be compared, the third quantity to be compared and the fourth quantity to be compared meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight.

Optionally, in one possible implementation manner of the first aspect, if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number, and the fourth to-be-compared number meet a preset condition, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, including:

if the first to-be-compared number is larger than the third to-be-compared number and the second to-be-compared number is larger than the fourth to-be-compared number, judging that a preset condition is met;

performing increasing training on the first calculation weight to obtain a third calculation weight after increasing training;

if the first to-be-compared number is smaller than the third to-be-compared number and the second to-be-compared number is smaller than the fourth to-be-compared number, judging that a preset condition is met;

and performing increasing training on the second calculated weight to obtain a fourth calculated weight after increasing training.

Optionally, in a possible implementation manner of the first aspect, the training to increase the first computation weight to obtain a third computation weight after the training includes:

calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a third calculated weight after the increased training according to the first calculated weight and the increased training proportion;

and performing augmentation training on the second calculation weight to obtain a fourth calculation weight after augmentation training, wherein the method comprises the following steps:

calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a fourth calculation weight after the increased training according to the second calculation weight and the increased training proportion;

the third calculation weight after the increase training or the fourth calculation weight after the increase training is calculated by the following formula,

wherein k is _ide ³ To increase the third calculated weight after training, X ^fir _Sim For the highest similarity coefficient, X _Sim ^sec K is the next highest similarity coefficient _dif ⁴ To increase trainingThe fourth calculation weight to be obtained later is that,

to increase the training ratio. / >

Optionally, in one possible implementation manner of the first aspect, the taking the first pre-training model corresponding to the second domain sample set as the second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentence to obtain at least one training word, constructing a slot template corresponding to the differential training sentence according to the training word, and correspondingly storing a corresponding relation between a slot and the training sentence and the corresponding slot template to obtain a final model, where the step includes:

extracting all difference training samples in the difference sample set, wherein each difference training sample at least comprises one difference training statement, and the difference training statement has preset instruction information and/or preset feedback statement corresponding to the difference training statement;

controlling a second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, and constructing a slot template corresponding to the difference training sentences according to the training word, wherein the slot template at least comprises a first slot;

numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model.

Optionally, in one possible implementation manner of the first aspect, the controlling the second pre-training model to perform word segmentation processing on the differential training sentence to obtain at least one training word, and constructing a slot template corresponding to the differential training sentence according to the training word, where the slot template includes at least one first slot, and includes:

according to the position relation of each training word, a first slot position corresponding to each training word is established in a slot position template;

determining synonym words corresponding to each training word according to a preset word library, wherein the word library is provided with corresponding relations between the training words and the synonym words;

and counting each training word and all corresponding synonym words to generate a word set.

Optionally, in one possible implementation manner of the first aspect, the numbering all the first slots in the slot templates, determining a correspondence between each slot number and a training word, and storing the correspondence and the corresponding slot template correspondingly to obtain a final model, where the obtaining includes:

numbering all the first slots in the slot template in ascending order according to the sequence to obtain the slot number corresponding to each first slot;

And determining the corresponding relation between the slot number and the word set according to the corresponding relation between the slot number and the training words, and correspondingly storing the corresponding relation and the corresponding slot template to obtain a final model.

In a second aspect of the embodiment of the present invention, a data processing method is provided, where a final model obtained by training in the first aspect of the embodiment of the present invention is configured, and the method further includes:

receiving a user control sentence, performing word segmentation processing on the control sentence to obtain at least one control word, and numbering all the control words in ascending order according to the time sequence of the control word;

determining a word set corresponding to a first slot of the minimum number of the slot templates, and taking the corresponding slot template as the slot template to be screened if judging that the training words in the word set correspond to the control words;

comparing the word sets of other first slots of all the slot templates to be screened with control words of other numbers;

if all the first slots in the slot templates to be screened are judged to be completely corresponding to the control words, the corresponding slot templates to be screened are used as output slot templates, and preset instruction information and/or preset feedback sentences corresponding to the output slot templates are output.

Optionally, in one possible implementation manner of the second aspect, if it is determined that all the first slots in the slot templates to be screened completely correspond to the control word, the corresponding slot templates to be screened are used as output slot templates, and the outputting of the preset instruction information and/or the preset feedback statement corresponding to the output slot templates includes:

if the maximum number of the control word is judged to be not corresponding to the maximum slot number of the first slot, deleting the non-corresponding slot template;

if the maximum number of the control word corresponds to the maximum slot number of the first slots, comparing the control word with the word set of each first slot according to the number of the control word, and if the control word with the same number corresponds to the corresponding word set, judging that all the first slots in the slot template to be screened correspond to the control word completely.

In a third aspect of embodiments of the present invention, there is provided a storage medium having stored therein a computer program for implementing the method of the first aspect and the various possible designs of the first aspect when the computer program is executed by a processor.

The training method and the data processing method for the pre-training model fusing domain knowledge can compare the target domain sample set with the first domain sample set of the first pre-training model, further determine a second pre-training model meeting the current training requirement in a plurality of first pre-training models, obtain a difference sample set according to the difference between the target domain sample set and the second domain sample set, and train the second pre-training model again by combining the difference sample set to obtain a final model. The method has the advantages that the method can train the final model belonging to the corresponding unique knowledge field on the basis of the prior trained model, has the advantage of high training efficiency, trains the second pre-training model again according to the difference sample set, enables the final model to be more comprehensive, and can meet the interaction scene of the corresponding knowledge field.

When the similarity coefficient of the sample set is calculated, the first quantity of the target training samples which are the same or corresponding to the first field sample set and the target field sample set and the second quantity of the target training samples which are different or not corresponding to the first field sample set are comprehensively considered, the similarity relation between the first field sample set and the model corresponding to the target field sample set can be reflected through the first quantity, and the volume and useless data volume of the corresponding model can be reflected through the second quantity. The method and the device can comprehensively consider a plurality of dimensions when calculating the similarity coefficient, so that the calculated similarity coefficient is more fit with a corresponding application scene.

According to the invention, the second field sample set is adjusted by combining with the user, and when the user is judged to actively adjust the second field sample set, the first calculation weight and the second calculation weight for calculating the similarity coefficient of the sample set are continuously trained, so that the trained third calculation weight and fourth calculation weight are more in line with the current calculation and application scene.

Drawings

FIG. 1 is a schematic view of an application scenario of the technical scheme provided by the invention;

FIG. 2 is a flow chart of a first embodiment of a pre-training model training method incorporating domain knowledge;

Fig. 3 is a flow chart of a data processing method.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

It should be understood that, in various embodiments of the present invention, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present invention, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present invention, "plurality" means two or more. "and/or" is merely an association relationship describing an association object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. "comprising A, B and C", "comprising A, B, C" means that all three of A, B, C comprise, "comprising A, B or C" means that one of the three comprises A, B, C, and "comprising A, B and/or C" means that any 1 or any 2 or 3 of the three comprises A, B, C.

It should be understood that in the present invention, "B corresponding to a", "a corresponding to B", or "B corresponding to a" means that B is associated with a, from which B can be determined. Determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. The matching of A and B is that the similarity of A and B is larger than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection" depending on the context.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

As shown in fig. 1, an application scenario schematic diagram of the technical scheme provided by the invention includes a request end, a server and a database, when the request end has a model requirement, the request end sends a corresponding model processing request and a target field sample set to the server, the target field sample set at this time can be a preliminary target field sample set preconfigured by a worker, and the server screens the pre-training model based on the target field sample set and a first field sample set corresponding to each pre-training model in the database, so as to obtain a pre-training model requiring subsequent training of knowledge depth of the corresponding field, and trains the corresponding pre-training model, thereby improving training efficiency of the model.

The invention provides a training method of a pre-training model fusing domain knowledge, which is shown in fig. 2 and comprises the following steps:

Step S110, after judging that a model processing request and a target field sample set sent by a request end are received, a server invokes a first field sample set corresponding to each first pre-training model in a database, wherein the target field of the target field sample set is the current application field of the request end, the first field sample set is pre-stored sample data of the first field, the first field is a plurality of preset interactive application fields, each first field comprises a first pre-training model corresponding to the first field sample set, and samples included in the target field sample set and the first field sample set are corpus samples extracted by corresponding fields. After receiving the model processing request and the target domain sample set, the method and the device can obtain all first domain sample sets corresponding to the first pre-training models.

The model processing request may be a natural language model processing request, an image processing request, or the like, and the target domain sample set may be preset by a staff member, and taking a natural language model as an example, different application scenarios may have different target domain sample sets. For example, the natural language model has corresponding application in the multi-subdivision fields of intelligent home interaction, elevator interaction and the like, and the intelligent home interaction and the elevator interaction are respectively models which are trained previously, namely, an intelligent home interaction model and an elevator interaction model exist, and the intelligent home interaction model and the elevator interaction model can be regarded as a first pre-training model. The request end needs to build a vehicle-computer interaction model, and the corresponding model processing request can be the vehicle-computer interaction model which needs to build natural language processing, and the target field sample set has corpus samples in the vehicle-computer interaction process.

Corpus samples in the target domain sample set, target training samples such as: "open door", "close door", "open window", "close window", "open air conditioner", "close air conditioner", "raise temperature", "lower temperature", etc.

A first pre-training model, such as a smart home interaction model, corpus samples within a first domain sample set of the smart home interaction model, a first training sample, such as: "turn on television", "turn off television", "turn on air conditioner", "turn off air conditioner", "raise temperature", "lower temperature", etc.

A first pre-training model, e.g. an elevator interaction model, corpus samples within a first domain sample set of the elevator interaction model, a first training sample, e.g.: "open door", "close door", "to floor 16", "to floor 18", etc.

Step S120, traversing each target training sample in the target field sample set in turn, comparing the target field sample with the first training samples in the first field sample set, determining the target training samples identical to or corresponding to the first training samples, counting the first number of the target training samples identical to or corresponding to each first field sample set, and calculating the similarity based on the first number and the second number of the target training samples different from or not corresponding to each first field sample set, respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second field sample set with the highest similarity coefficient as the second field sample set. In an actual application scenario, different models may have partially identical training samples, and in particular, in the field of natural language processing, languages corresponding to intelligent control implemented between different application scenarios may be identical and corresponding. It can be understood that if more training samples are identical between two models, the similarity between the two models is possibly larger, so that new training samples can be added for training on the basis of another model to obtain a model which is more suitable for a corresponding scene at present, and therefore the invention needs to calculate the sample set similarity coefficient of the target field sample set and the target field sample set, and further determine a second field sample set with higher similarity coefficient with the sample set.

In one possible implementation manner, the step S120 includes:

step S1201, traversing each target training sample in the target field sample set in turn, comparing the target field sample with the first training sample in the first field sample set, and determining a target training sample identical to or corresponding to the first training sample. The invention can compare the target training sample with each first training sample in sequence, and the comparison process can be to directly compare the sentences corresponding to the target training sample with the sentences corresponding to the first training samples.

Step S1202, counting a first number of the same or corresponding target training samples in each first field sample set and a second number of different or non-corresponding target training samples, and calculating the similarity based on the first number and the second number to obtain sample set similarity coefficients of the plurality of first field sample sets and the target field sample set respectively. The greater the first number of identical or corresponding target training samples, the greater the similarity between the expected final model to be trained and the corresponding first pre-trained model is demonstrated. The more the second number of different or non-corresponding target training samples, the greater the distinction between the desired final model and the corresponding first pre-trained model, which proves to be needed to be trained.

According to the method, when the similarity coefficient of the sample set is determined, the first number of the same or corresponding target training samples and the second number of different or non-corresponding target training samples are comprehensively considered, the situation that some first pre-training models are the same as the expected final model but have large model body can be avoided, and further the first pre-training models with large model body can be removed in the method, so that the first pre-training model with the highest similarity coefficient is probably not the same or the first number of the corresponding target training samples is the largest, and further the moderate body quantity of the first pre-training models can be ensured not to be excessively large when the first pre-training model which is the most similar to the final model is found.

In one possible implementation manner, the calculating the first number of the same or corresponding target training samples and the second number of different or non-corresponding target training samples in each first field sample set, calculating the similarity based on the first number and the second number, and respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets includes:

And calculating according to the first quantity and the total quantity of the target training samples in the target field sample set to obtain the same evaluation sub-coefficient of the first field sample set and the target field sample set. The total number of target training samples in the target field sample set can reflect the volume of the corresponding first pre-training model, and if the same evaluation sub-coefficient is larger, the first pre-training model is more similar to the expected final model under the volume of the corresponding model data volume.

And calculating according to the second quantity and the total quantity of the target training samples in the target field sample set to obtain different evaluation sub-coefficients of the first field sample set and the target field sample set. The larger the different evaluation sub-coefficients, the more different the first pre-trained model is proved to be from the desired final model at the volume of the corresponding model data volume.

，/>

，/>

For the same evaluator coefficient->

For the total number of target training samples within the target field sample set, +.>

For the second calculation weight ∈>

To calculate the constant. The invention can be realized by->

Calculating to obtain the same evaluation sub-coefficient by/>

Different evaluation sub-coefficients are calculated, and the same evaluation sub-coefficient is +.>

Subtracting different evaluator coefficients->

Obtaining an evaluation sub-coefficient difference value, and performing ∈10 according to the evaluation sub-coefficient difference value and a preset calculation constant>

And obtaining a final sample set similarity coefficient.

In an actual computing scenario, it may occur that

At this time, the present invention classifies the similarity coefficient obtained under the scene as 0.

The first calculation weight

Second calculation weight->

During initialization, the staff sets, first calculation weight +.>

Preferably greater than the second calculation weight +.>

. In different application scenarios, the computation power of the configured hardware device is different, which in turn results in different data volume requirements for the model in different scenarios. For example, in an application scenario with high computational power, it is possible to allow the volume and data volume of the final model to be large, the first computational weight at this time +. >

Is greater than the second calculation weight +.>

And first calculation weight +.>

And a second calculation weight->

The difference in (c) may be relatively small. For example, in an application scenario with weak calculation power, the volume and data volume of the final model are required to be small, and the first calculation weight at this time is +.>

Is greater than the second calculation weight +.>

And first calculation weight +.>

And a second calculation weight->

Is relatively large. First calculation weight->

Always greater than the second calculation weight +.>

But the quantity relation between the two can be relatively adjusted, and the volume requirements of different deployment scenes on different models can be met by the mode.

Step S1203, sorting all the first domain sample sets in descending order according to the sample set similarity coefficient, and taking the first domain sample set with the highest sample set similarity coefficient as the second domain sample set. According to the technical scheme provided by the invention, all the first field sample sets are ordered in a descending order according to the similarity coefficients of the sample sets, so that the first field sample set with the highest similarity coefficient is arranged at the front part, the first field sample set with the lowest similarity coefficient is arranged at the last part, the first field sample set with the highest similarity coefficient of the sample sets is preferentially used as the second field sample set, and the second field sample set at the moment can be regarded as being obtained by calculation.

If all the sample set similarity coefficients are smaller than the threshold similarity, judging that a first pre-training model corresponding to the expected final model does not exist at the moment, and reminding the user accordingly at the moment.

Step S1204, if it is determined that the difference between the similarity coefficient between the first domain sample set with the highest similarity coefficient of the sample sets and the first domain sample set with the next highest similarity coefficient of the sample sets is smaller than the preset difference, displaying the first domain sample set with the highest similarity coefficient and the next highest similarity coefficient, where the preset difference is 0.05. At this time, the similarity of the first field sample set with the highest similarity coefficient and the next highest similarity coefficient is relatively close, and at this time, the first field sample set with the highest similarity coefficient and the next highest similarity coefficient can be displayed.

In the above scenario, there are several possibilities:

scene 1: the first number of first domain sample sets with highest similarity coefficients is greater than the first number of first domain sample sets with second highest similarity coefficients, and the second number of first domain sample sets with highest similarity coefficients is less than the second number of first domain sample sets with second highest similarity coefficients;

scene 2: the first number of first domain sample sets with highest similarity coefficients is greater than the first number of first domain sample sets with second highest similarity coefficients, and the second number of first domain sample sets with highest similarity coefficients is greater than the second number of first domain sample sets with second highest similarity coefficients;

Scene 3: the first number of first domain sample sets having the highest similarity coefficient is less than the first number of first domain sample sets having the next highest similarity coefficient, and the second number of first domain sample sets having the highest similarity coefficient is less than the second number of first domain sample sets having the next highest similarity coefficient.

In the scene 1, the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the same training sample dimension. The highest first pre-training model is superior to the next highest first pre-training model in terms of body mass, sample similarity.

In the scene 2, the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the same training sample dimension. The highest first pretrained model is now physically inferior to the next highest first pretrained model, and the highest first pretrained model is superior to the next highest first pretrained model in sample similarity.

In the scene 3, the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the same training sample dimension. The highest first pretrained model is now superior in body weight to the next highest first pretrained model, and the highest first pretrained model is inferior in sample similarity to the next highest first pretrained model.

It can be seen that the first pre-training models corresponding to the highest similarity coefficients in the scene 2 and the scene 3 have different advantages and disadvantages, respectively, and the invention outputs two similar first pre-training models which may have different advantages and disadvantages, further determines a final second pre-training model according to the selection of the user, and combines the selection of the user to continuously train the formula for calculating the similarity coefficients.

In one possible implementation manner, the technical scheme provided by the invention further comprises:

if the user is judged to take the second highest first field sample set as the second field sample set, the original highest first field sample set is not taken as the second field sample set. At this time, the user may consider that the second domain sample set selected by the present invention does not meet the current scene requirement, so it may take the second highest first domain sample set as the second domain sample set.

The first number of next highest first-domain sample sets is taken as the first number to be compared, the second number of next highest first-domain sample sets is taken as the second number to be compared, and the first number of highest first-domain sample sets is taken as the third number to be compared, and the second number of highest first-domain sample sets is taken as the fourth number to be compared. At this time, the present invention analyzes the second highest first domain sample set and the highest first domain sample set, and obtains a first to-be-compared number, a second to-be-compared number, a third to-be-compared number, and a fourth to-be-compared number to be analyzed.

It will be understood that, when the situation 2 is the case, the preset condition may be met, that is, the first pre-training model corresponding to the second highest first domain sample set is superior to the first pre-training model with the highest volume, and the second highest first pre-training model is inferior to the first pre-training model with the highest volume in terms of sample similarity, but the deployment scenario of the corresponding model requires the final model with the lower volume, so that the user may use the first domain sample set with the lower volume of data as the second domain sample set.

In the case of the scene 3, the condition that the preset condition is met may occur, that is, the first pre-training model corresponding to the second highest first field sample set is inferior to the highest first pre-training model in the volume, the second highest first pre-training model is superior to the highest first pre-training model in the sample similarity, at this time, the deployment scene of the corresponding model may meet the requirement that the model with larger volume operates and the final model with more samples is required, so at this time, the user may use the first field sample set with more samples as the second field sample set.

In one possible implementation manner of the present invention, if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number, and the fourth to-be-compared number satisfy a preset condition, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, including:

and if the first to-be-compared number is larger than the third to-be-compared number and the second to-be-compared number is larger than the fourth to-be-compared number, judging that a preset condition is met. At this time, that is, the case of scene 3, the first number of the first-domain sample sets with the second highest similarity coefficient is greater than the first number of the first-domain sample sets with the highest similarity coefficient, and the second number of the first-domain sample sets with the second highest similarity coefficient is greater than the second number of the first-domain sample sets with the highest similarity coefficient.

And performing increasing training on the first calculated weight to obtain a third calculated weight after increasing training. At this time, the first calculation weight needs to be trained in an increasing manner, so that the third calculation weight is larger than the second calculation weight, and further the same evaluation sub-coefficient calculated later is relatively larger, and further the sample set similarity coefficient is relatively larger in the scene 3.

And if the first to-be-compared number is smaller than the third to-be-compared number and the second to-be-compared number is smaller than the fourth to-be-compared number, judging that a preset condition is met. At this time, that is, the case of scene 2, the first number of the first domain sample sets with the second highest similarity coefficient is smaller than the first number of the first domain sample sets with the highest similarity coefficient, and the second number of the first domain sample sets with the second highest similarity coefficient is smaller than the second number of the first domain sample sets with the highest similarity coefficient.

And performing increasing training on the second calculated weight to obtain a fourth calculated weight after increasing training. At this time, the second calculation weight needs to be trained in an increasing manner, so that the fourth calculation weight is larger relative to the first calculation weight, and further, the different subsequently calculated evaluation sub-coefficients are relatively larger, and further, the sample set similarity coefficient is relatively larger under the scene 2.

In one possible implementation manner, the method performs an increasing training on the first calculation weight to obtain a third calculation weight after the increasing training, and includes:

and calculating according to the difference of the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a third calculated weight after the increased training according to the first calculated weight and the increased training proportion. According to the method, the difference between the sample set similarity coefficient of the highest first-field sample set and the sample set similarity coefficient of the next highest first-field sample set is calculated, if the difference between the sample set similarity coefficients is larger, the training proportion is relatively larger, and the third training weight is obtained by combining the training proportion on the basis of the original first training weight.

and calculating according to the difference of the similarity coefficients and the similarity coefficient of the sample set of the highest first field sample set to obtain an increased training proportion, and obtaining a fourth calculation weight after the increased training according to the second calculation weight and the increased training proportion. According to the method, the calculation is carried out according to the difference between the sample set similarity coefficient of the highest first-field sample set and the sample set similarity coefficient of the next highest first-field sample set, if the difference between the sample set similarity coefficients is larger, the training proportion is relatively larger, and the fourth calculation weight after training is obtained by combining the training proportion on the basis of the original second calculation weight.

wherein k is _ide ³ To increase the third calculated weight after training, X ^fir _Sim For the highest similarity coefficient, X _Sim ^sec K is the next highest similarity coefficient _dif ⁴ To increase the fourth computational weight after training,

to increase the training ratio. By passing through

The difference of similarity coefficients can be obtained according to +.>

An increased training ratio can be obtained, and the invention will be based on +.>

The value to be added for obtaining the first calculation weight is according to +.>

And obtaining a final third calculation weight. The invention will be based on->

The value to be added for obtaining the second calculation weight is according to +.>

And obtaining the final fourth calculation weight. Through the training in the mode, the third calculation weight or the fourth calculation weight obtained by the method is more in accordance with the calculation scene of the method, so that the similarity coefficient of the sample set calculated later is relatively more accurate, and the sample set similarity coefficient accords with the corresponding calculation scene.

Step S130, traversing each target training sample in the target field sample set in turn, comparing the target field sample with a second training sample in the second field sample set, determining a target training sample different from the second training sample, and generating a difference sample set based on the determined target training sample, wherein each difference training sample in the difference sample set at least comprises a difference training statement. After the second field sample set is obtained, the target field sample set is compared with the second field sample set, and then the corresponding difference sample set is determined.

It will be appreciated that the differential training samples in the differential sample set are samples that the final model needs to be trained, but that have not been trained in the second pre-training model. Such as "turn on a double flash" in car-machine interaction, etc.

In one possible implementation manner, the step S130 includes:

and traversing each target training sample in the target field sample set in sequence, comparing the target field sample with a second training sample in a second field sample set, and determining a target training sample different from the second training sample. According to the invention, a comparison is carried out between the target field sample and the second training sample, so that the target training sample different from the second training sample is determined, the condition of sample omission is avoided, the repeated samples in the target training sample and the second training sample are not subjected to secondary training during retraining, the efficiency of obtaining a final model is improved, and the functional integrity of the final model is ensured.

A differential sample set is generated based on the determined target training samples. After the target training sample is traversed, a difference sample set is generated.

And step 140, taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentences to obtain at least one training word, constructing a slot template corresponding to the differential training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model. According to the method, the second pre-training model is trained again according to the difference sample set and the corresponding domain knowledge, so that the trained final model can meet the processing requirements of the specific knowledge domain. The terms "turn on double flashing" and "double flashing" are unique to the field of vehicle-machine interaction, so that the corresponding final model needs to be trained based on differential training samples such as "turn on double flashing".

In one possible implementation manner, the step S140 includes:

extracting all the difference training samples in the difference sample set, wherein each difference training sample at least comprises one difference training statement, and the difference training statement is provided with preset instruction information and/or preset feedback statement corresponding to the difference training statement. According to the invention, all the difference training samples in the difference sample set are firstly extracted, and as the final model trained by the invention is a natural language processing model, the difference training samples comprise corresponding difference training sentences which possibly have preset instruction information and/or preset feedback sentences corresponding to the difference training sentences. For example, the preset instruction information corresponding to the "turn on double flashing lights" can be to control the indicator lights to strobe according to the preset frequency. For example, when the dual flash is turned on, interaction with a person is required, and the preset feedback statement at this time may be "you confirm how to turn on the dual flash", and after receiving a confirmation statement input by the person, corresponding preset instruction information is output. Therefore, different differential training sentences may have different, corresponding preset instruction information and/or preset feedback sentences.

Controlling a second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, and constructing a slot template corresponding to the difference training sentences according to the training word, wherein the slot template at least comprises a first slot. In general, the pre-training model used as natural language processing has a semantic analysis function, so that the invention can perform word segmentation processing on the difference training sentences according to the second pre-training model to obtain corresponding training words. And, constructing a slot template corresponding to the difference training sentence according to the training words, wherein different training words may correspond to different slot templates, but one slot template may at least comprise one first slot.

Numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model. The present invention will number all the first slots in each slot template, such as slot 1, slot 2, slot 3, etc. The invention can determine the corresponding relation between each slot number and the training words, the training words in all slot templates can have sequential relation, and the final model can comprise the corresponding relation between each slot number and the training words in the slot templates, so that the follow-up voice recognition is more accurate.

According to the technical scheme provided by the invention, in one possible implementation mode, the second pre-training model is controlled to perform word segmentation processing on the difference training sentences to obtain at least one training word, a slot template corresponding to the difference training sentences is constructed according to the training word, the slot template at least comprises a first slot, and the method comprises the following steps:

and performing word segmentation processing on the difference training sentences to obtain a plurality of first words, and extracting the first words which are nouns or verbs as training words. The invention firstly performs word segmentation processing on the difference training sentences, for example, the word segmentation can be performed by opening a double flash lamp, and the invention takes nouns or the first words of verbs as training words, and the word segmentation process is performed by opening the double flash lamp, which is the verb, and the word segmentation process is performed by opening the double flash lamp, which is the noun.

And according to the position relation of each training word, establishing a first slot corresponding to each training word in the slot template. According to the invention, the first slot positions are established according to the position relation of training words, two first slot positions are arranged in the slot position template corresponding to the double-flash lamp opening, the first slot positions correspond to the opening, and the second slot positions correspond to the double-flash lamp.

And determining the synonym corresponding to each training word according to a preset word library, wherein the word library is provided with the corresponding relation between the training word and the synonym. In an actual interaction scene, a plurality of words may exist in a manner of expressing one meaning by a user, for example, the method of turning on a double-flash lamp can also be used for turning on a double-flash lamp, and the method of turning on and turning on can be regarded as synonymous words, so that the synonymous words corresponding to each training word are required to be determined according to a preset word library, and further sentences with the same meaning and different words can be identified.

And counting each training word and all corresponding synonym words to generate a word set. According to the invention, all training words and synonymous words are counted to obtain a final word set, and each first slot position may correspond to one word set.

In one possible implementation manner, the method for numbering all the first slots in the slot templates, determining the corresponding relation between each slot number and the training word, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model includes:

And numbering all the first slots in the slot templates in an ascending order according to the sequence to obtain the slot numbers corresponding to each first slot. The invention carries out ascending serial numbers aiming at all first slots in the slot template, for example, the slot number corresponding to the opening is slot 1, and the slot number corresponding to the double flashing lamp is slot 2.

And determining the corresponding relation between the slot number and the word set according to the corresponding relation between the slot number and the training words, and correspondingly storing the corresponding relation and the corresponding slot template to obtain a final model. The invention can combine the corresponding relation between the slot number and the training words to store, and obtain the final model. The invention takes the second pre-training model as the basic model to train, and obtains the corresponding final model, so that the final model has the corresponding domain knowledge and is easy to train.

The invention also provides a data processing method, which configures the final model trained by the technical scheme, as shown in fig. 3, and further comprises:

step S210, receiving a user control sentence, performing word segmentation processing on the control sentence to obtain at least one control word, and numbering all the control words in ascending order according to the time sequence of the control word. For example, when the user inputs a control sentence of "turn on double flashing lamps", the corresponding control words may include "turn on" and "double flashing lamps", and the invention numbers all the control words in ascending order according to the time sequence of the control words, at this time, the number of "turn on" may be 1, and the number of "double flashing lamps" may be 2.

Step S220, determining a word set corresponding to a first slot of the minimum number of the slot templates, and taking the corresponding slot templates as the slot templates to be screened if the training words in the word set are judged to correspond to the control words. When the invention is compared, the word set corresponding to the first slot position with the minimum number of the slot position templates is firstly obtained, for example, at least two slot position templates exist at the moment, the first training samples corresponding to the two slot position templates are respectively "double flashing lights are turned on" and "air conditioner is turned on", the word set corresponding to the first slot position with the minimum number of the two slot position templates can comprise [ open and open ], and the sentences corresponding to the slot position templates to be screened at the moment can be the "double flashing lights are turned on" and the air conditioner is turned on ".

Step S230, comparing the word sets of other first slots of the slot templates to be screened with the control words of other numbers. Through step S220, all the slot templates are screened for the first time, so that the number of the slot templates is greatly reduced, and at the moment, the invention can accurately screen, namely, word sets of other first slots of all the slot templates to be screened are compared with control words of other numbers.

And step 240, if all the first slots in the slot templates to be screened are judged to be completely corresponding to the control words, the corresponding slot templates to be screened are used as output slot templates, and the preset instruction information and/or the preset feedback statement corresponding to the output slot templates are output. At this time, a completely corresponding slot position template can be determined according to all control words, and preset instruction information and/or preset feedback sentences corresponding to the slot position template are output, for example, a car machine controls double-flashing lamps to perform double-flashing indication.

In one possible implementation manner, the step S240 includes:

if the maximum number of the control word is judged to be not corresponding to the maximum slot number of the first slot, deleting the non-corresponding slot template. The method and the device can rapidly determine the non-corresponding slot templates, delete the corresponding slot templates, realize the purpose of preliminarily locking the slot templates to be screened, and have the advantage of high determination efficiency.

If the maximum number of the control word corresponds to the maximum slot number of the first slots, comparing the control word with the word set of each first slot according to the number of the control word, and if the control word with the same number corresponds to the corresponding word set, judging that all the first slots in the slot template to be screened correspond to the control word completely. According to the invention, through the steps, the slot position templates are accurately screened, word sets in the first slot positions of the second numbers are compared with the control words of the second numbers, and if the word sets are completely corresponding to the control words of the second numbers, all the first slot positions in the slot position templates are completely corresponding to all the control words. At this time, corresponding preset instruction information and/or preset feedback sentences are output according to the slot position template.

The present invention also provides a storage medium having stored therein a computer program for implementing the methods provided by the various embodiments described above when executed by a processor.

The storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer. For example, a storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). In addition, the ASIC may reside in a user device. The processor and the storage medium may reside as discrete components in a communication device. The storage medium may be read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tape, floppy disk, optical data storage device, etc.

The present invention also provides a program product comprising execution instructions stored in a storage medium. The at least one processor of the device may read the execution instructions from the storage medium, the execution instructions being executed by the at least one processor to cause the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the terminal or the server, it should be understood that the processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A training method of a pre-training model fusing domain knowledge is characterized by comprising the following steps:

2. The method for training a pre-training model with fusion of domain knowledge according to claim 1, wherein,

the step of traversing each target training sample in the target field sample set in turn, comparing the target field sample with first training samples in the first field sample set, determining the target training samples identical to or corresponding to the first training samples, counting the first number of the target training samples identical to or corresponding to each first field sample set and the second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number, respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second highest similarity coefficient as the second field sample set, and comprises the following steps:

3. The method for training a pre-training model with fusion of domain knowledge according to claim 2, wherein,

counting the first number of the same or corresponding target training samples in each first field sample set and the second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number to respectively obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, wherein the method comprises the following steps:

，/>

，/>

for the same evaluator coefficient->

For the second calculation weight ∈>

To calculate a constant;

wherein the preset difference is 0.05.

4. A method of training a pre-training model incorporating domain knowledge as claimed in claim 3, further comprising:

5. The method for training a pre-training model with fusion of domain knowledge according to claim 4, wherein,

if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number and the fourth to-be-compared number meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, wherein the training comprises the following steps:

6. The method for training a pre-training model with fusion of domain knowledge according to claim 5, wherein,

and performing augmentation training on the first calculation weight to obtain a third calculation weight after augmentation training, wherein the method comprises the following steps:

to increase the training ratio.

7. The method for training a pre-training model with fusion of domain knowledge according to claim 6, wherein,

the step of using the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentences to obtain at least one training word, constructing a slot template corresponding to the differential training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model, wherein the step of:

8. The method for training a pre-training model with fusion of domain knowledge according to claim 7,

the second pre-training model is controlled to perform word segmentation processing on the difference training sentences to obtain at least one training word, a slot template corresponding to the difference training sentences is constructed according to the training word, and the slot template at least comprises a first slot and comprises:

9. The method for training a pre-training model with fusion of domain knowledge according to claim 8, wherein,

numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model, wherein the method comprises the following steps:

10. A data processing method for configuring the final model trained in claim 9, further comprising:

11. The method for processing data according to claim 10, wherein,

if it is determined that all the first slots in the slot templates to be screened completely correspond to the control words, the corresponding slot templates to be screened are used as output slot templates, and preset instruction information and/or preset feedback sentences corresponding to the output slot templates are output, including: