CN116028821A - Pre-training model training method integrating domain knowledge and data processing method - Google Patents

Pre-training model training method integrating domain knowledge and data processing method Download PDF

Info

Publication number
CN116028821A
CN116028821A CN202310314738.4A CN202310314738A CN116028821A CN 116028821 A CN116028821 A CN 116028821A CN 202310314738 A CN202310314738 A CN 202310314738A CN 116028821 A CN116028821 A CN 116028821A
Authority
CN
China
Prior art keywords
training
sample set
slot
target
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310314738.4A
Other languages
Chinese (zh)
Other versions
CN116028821B (en
Inventor
黄海峰
熊子奇
孙丽娟
曹扬
李响
蔡惠民
谢真强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN202310314738.4A priority Critical patent/CN116028821B/en
Publication of CN116028821A publication Critical patent/CN116028821A/en
Application granted granted Critical
Publication of CN116028821B publication Critical patent/CN116028821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a training method and a data processing method for a pre-training model fusing domain knowledge, wherein after judging that a model processing request and a target domain sample set are received, a server invokes a first domain sample set corresponding to each first pre-training model in a database; obtaining sample set similarity coefficients of a plurality of first field sample sets and a target field sample set, wherein the first field sample set with the highest similarity coefficient or the second highest similarity coefficient is used as a second field sample set; determining a target training sample different from the second training sample, generating a differential sample set based on the determined target training sample; and taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation on the difference training sentences to obtain at least one training word, and correspondingly storing the corresponding relation between the slot positions and the training sentences and the corresponding slot position templates to obtain a final model.

Description

Pre-training model training method integrating domain knowledge and data processing method
Technical Field
The invention relates to the technical field of data processing, in particular to a training method and a data processing method for a pre-training model fusing knowledge in the field.
Background
A pre-trained model is a model that is trained on a large amount of data and stored. The model is generally understood as a model created by a former person to solve similar problems, when a new problem is encountered, the model is not required to be trained from scratch, and the model can be directly used for starting with the model, so that the new problem can be solved by simple learning.
In an actual application scene, there may be pre-training models in multiple scenes, taking a natural language processing field as an example, different pre-training models are available in the elevator interaction field and the intelligent home interaction field, for example, one field needs to be developed as a car-to-machine interaction field, and training can be performed on the basis of the elevator interaction field and the intelligent home interaction field at this time, namely, the current interaction model in the elevator interaction field or the intelligent home interaction field is used as the pre-training model, and the pre-training model is continuously trained to obtain a new training model corresponding to the required field.
In the prior art, the most suitable pre-training model cannot be rapidly determined to perform subsequent data processing according to the deployment scene requirement of a user, so that the calculation effect of the corresponding model after deployment is poor. Therefore, a technical scheme is needed to be able to integrate domain knowledge, and perform corresponding selection and retraining in a plurality of pre-training models, so that the corresponding models have a better calculation effect after deployment.
Disclosure of Invention
The embodiment of the invention provides a training method and a data processing method for a pre-training model, which are used for fusing domain knowledge, and can be used for carrying out corresponding selection and retraining in a plurality of pre-training models, so that the training can be fast and efficiently carried out, a final model with comprehensive functions can be obtained, and the calculation effect of the corresponding model after deployment is better.
In a first aspect of the embodiment of the present invention, a training method for a pre-training model fusing domain knowledge is provided, including:
after judging that a model processing request and a target field sample set sent by a request end are received, a server invokes a first field sample set corresponding to each first pre-training model in a database, wherein the target field of the target field sample set is the current application field of the request end, the first field sample set is pre-stored sample data of the first field, the first field is a plurality of preset interactive application fields, each first field comprises a first pre-training model corresponding to the first field sample set, and samples included in the target field sample set and the first field sample set are corpus samples extracted from the corresponding fields;
traversing each target training sample in the target field sample set in sequence, comparing the target field sample with first training samples in the first field sample set, determining target training samples identical to or corresponding to the first training samples, counting first numbers of identical or corresponding target training samples in each first field sample set and second numbers of different or non-corresponding target training samples, calculating the similarity based on the first numbers and the second numbers to respectively obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second highest similarity coefficient as the second field sample set;
Traversing each target training sample in the target field sample set in turn, comparing the target field sample with a second training sample in a second field sample set, determining a target training sample different from the second training sample, generating a difference sample set based on the determined target training sample, wherein each difference training sample in the difference sample set at least comprises a difference training sentence;
and taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, constructing a slot template corresponding to the difference training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model.
Optionally, in one possible implementation manner of the first aspect, the traversing each target training sample in the target field sample set sequentially, comparing the target field sample with a first training sample in a first field sample set, determining a first number of target training samples that are the same as or correspond to the first training sample, counting a second number of target training samples that are the same as or correspond to each first field sample set, and calculating the similarity based on the first number and the second number, to obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample set, and comparing all sample set similarity coefficients, where the first field sample set with the highest similarity coefficient or the second highest similarity coefficient is used as the second field sample set, including:
Sorting all the first field sample sets in a descending order according to the sample set similarity coefficient, and taking the first field sample set with the highest sample set similarity coefficient as a second field sample set;
and if the difference of the similarity coefficients between the first field sample set with the highest similarity coefficient of the sample set and the first field sample set with the second highest similarity coefficient of the sample set is smaller than the preset difference value, displaying the first field sample set with the highest and second highest similarity coefficient.
Optionally, in one possible implementation manner of the first aspect, counting a first number of identical or corresponding target training samples in each first domain sample set and a second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number, to obtain sample set similarity coefficients of the plurality of first domain sample sets and the target domain sample set, respectively, including:
calculating according to the first quantity and the total quantity of the target training samples in the target field sample set to obtain the same evaluation sub-coefficient of the first field sample set and the target field sample set;
calculating according to the second quantity and the total quantity of the target training samples in the target field sample set to obtain different evaluation sub-coefficients of the first field sample set and the target field sample set;
Respectively carrying out weighting treatment on the same evaluation sub-coefficient and the different evaluation sub-coefficients to obtain a sample set similarity coefficient of the first field sample set and the target field sample set, calculating the sample set similarity coefficient by the following formula,
Figure SMS_1
,/>
Figure SMS_2
Figure SMS_3
,/>
Figure SMS_4
wherein X is Sim For the sample set similarity coefficient of the first domain sample set and the target domain sample set,
Figure SMS_5
for the same evaluator coefficient->
Figure SMS_6
For different evaluation sub-coefficients, S ide First number of training samples for the same or corresponding target, +.>
Figure SMS_7
For the total number of target training samples within the target domain sample set,/>
Figure SMS_8
for the first calculation weight, S dif For a second number of different or non-corresponding target training samples +.>
Figure SMS_9
For the second calculation weight ∈>
Figure SMS_10
To calculate a constant;
wherein the preset difference is 0.05.
Optionally, in one possible implementation manner of the first aspect, the method further includes:
if the user is judged to take the second highest first field sample set as the second field sample set, the original highest first field sample set is not taken as the second field sample set;
then the first number of the next highest first domain sample set is taken as the first to-be-compared number, the second number of the next highest first domain sample set is taken as the second to-be-compared number, and the first number of the highest first domain sample set is taken as the third to-be-compared number, and the second number of the highest first domain sample set is taken as the fourth to-be-compared number;
And if the first quantity to be compared, the second quantity to be compared, the third quantity to be compared and the fourth quantity to be compared meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight.
Optionally, in one possible implementation manner of the first aspect, if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number, and the fourth to-be-compared number meet a preset condition, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, including:
if the first to-be-compared number is larger than the third to-be-compared number and the second to-be-compared number is larger than the fourth to-be-compared number, judging that a preset condition is met;
performing increasing training on the first calculation weight to obtain a third calculation weight after increasing training;
if the first to-be-compared number is smaller than the third to-be-compared number and the second to-be-compared number is smaller than the fourth to-be-compared number, judging that a preset condition is met;
and performing increasing training on the second calculated weight to obtain a fourth calculated weight after increasing training.
Optionally, in a possible implementation manner of the first aspect, the training to increase the first computation weight to obtain a third computation weight after the training includes:
calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a third calculated weight after the increased training according to the first calculated weight and the increased training proportion;
and performing augmentation training on the second calculation weight to obtain a fourth calculation weight after augmentation training, wherein the method comprises the following steps:
calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a fourth calculation weight after the increased training according to the second calculation weight and the increased training proportion;
the third calculation weight after the increase training or the fourth calculation weight after the increase training is calculated by the following formula,
Figure SMS_11
Figure SMS_12
wherein k is ide 3 To increase the third calculated weight after training, X fir Sim For the highest similarity coefficient, X Sim sec K is the next highest similarity coefficient dif 4 To increase trainingThe fourth calculation weight to be obtained later is that,
Figure SMS_13
to increase the training ratio. / >
Optionally, in one possible implementation manner of the first aspect, the taking the first pre-training model corresponding to the second domain sample set as the second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentence to obtain at least one training word, constructing a slot template corresponding to the differential training sentence according to the training word, and correspondingly storing a corresponding relation between a slot and the training sentence and the corresponding slot template to obtain a final model, where the step includes:
extracting all difference training samples in the difference sample set, wherein each difference training sample at least comprises one difference training statement, and the difference training statement has preset instruction information and/or preset feedback statement corresponding to the difference training statement;
controlling a second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, and constructing a slot template corresponding to the difference training sentences according to the training word, wherein the slot template at least comprises a first slot;
numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model.
Optionally, in one possible implementation manner of the first aspect, the controlling the second pre-training model to perform word segmentation processing on the differential training sentence to obtain at least one training word, and constructing a slot template corresponding to the differential training sentence according to the training word, where the slot template includes at least one first slot, and includes:
according to the position relation of each training word, a first slot position corresponding to each training word is established in a slot position template;
determining synonym words corresponding to each training word according to a preset word library, wherein the word library is provided with corresponding relations between the training words and the synonym words;
and counting each training word and all corresponding synonym words to generate a word set.
Optionally, in one possible implementation manner of the first aspect, the numbering all the first slots in the slot templates, determining a correspondence between each slot number and a training word, and storing the correspondence and the corresponding slot template correspondingly to obtain a final model, where the obtaining includes:
numbering all the first slots in the slot template in ascending order according to the sequence to obtain the slot number corresponding to each first slot;
And determining the corresponding relation between the slot number and the word set according to the corresponding relation between the slot number and the training words, and correspondingly storing the corresponding relation and the corresponding slot template to obtain a final model.
In a second aspect of the embodiment of the present invention, a data processing method is provided, where a final model obtained by training in the first aspect of the embodiment of the present invention is configured, and the method further includes:
receiving a user control sentence, performing word segmentation processing on the control sentence to obtain at least one control word, and numbering all the control words in ascending order according to the time sequence of the control word;
determining a word set corresponding to a first slot of the minimum number of the slot templates, and taking the corresponding slot template as the slot template to be screened if judging that the training words in the word set correspond to the control words;
comparing the word sets of other first slots of all the slot templates to be screened with control words of other numbers;
if all the first slots in the slot templates to be screened are judged to be completely corresponding to the control words, the corresponding slot templates to be screened are used as output slot templates, and preset instruction information and/or preset feedback sentences corresponding to the output slot templates are output.
Optionally, in one possible implementation manner of the second aspect, if it is determined that all the first slots in the slot templates to be screened completely correspond to the control word, the corresponding slot templates to be screened are used as output slot templates, and the outputting of the preset instruction information and/or the preset feedback statement corresponding to the output slot templates includes:
if the maximum number of the control word is judged to be not corresponding to the maximum slot number of the first slot, deleting the non-corresponding slot template;
if the maximum number of the control word corresponds to the maximum slot number of the first slots, comparing the control word with the word set of each first slot according to the number of the control word, and if the control word with the same number corresponds to the corresponding word set, judging that all the first slots in the slot template to be screened correspond to the control word completely.
In a third aspect of embodiments of the present invention, there is provided a storage medium having stored therein a computer program for implementing the method of the first aspect and the various possible designs of the first aspect when the computer program is executed by a processor.
The training method and the data processing method for the pre-training model fusing domain knowledge can compare the target domain sample set with the first domain sample set of the first pre-training model, further determine a second pre-training model meeting the current training requirement in a plurality of first pre-training models, obtain a difference sample set according to the difference between the target domain sample set and the second domain sample set, and train the second pre-training model again by combining the difference sample set to obtain a final model. The method has the advantages that the method can train the final model belonging to the corresponding unique knowledge field on the basis of the prior trained model, has the advantage of high training efficiency, trains the second pre-training model again according to the difference sample set, enables the final model to be more comprehensive, and can meet the interaction scene of the corresponding knowledge field.
When the similarity coefficient of the sample set is calculated, the first quantity of the target training samples which are the same or corresponding to the first field sample set and the target field sample set and the second quantity of the target training samples which are different or not corresponding to the first field sample set are comprehensively considered, the similarity relation between the first field sample set and the model corresponding to the target field sample set can be reflected through the first quantity, and the volume and useless data volume of the corresponding model can be reflected through the second quantity. The method and the device can comprehensively consider a plurality of dimensions when calculating the similarity coefficient, so that the calculated similarity coefficient is more fit with a corresponding application scene.
According to the invention, the second field sample set is adjusted by combining with the user, and when the user is judged to actively adjust the second field sample set, the first calculation weight and the second calculation weight for calculating the similarity coefficient of the sample set are continuously trained, so that the trained third calculation weight and fourth calculation weight are more in line with the current calculation and application scene.
Drawings
FIG. 1 is a schematic view of an application scenario of the technical scheme provided by the invention;
FIG. 2 is a flow chart of a first embodiment of a pre-training model training method incorporating domain knowledge;
Fig. 3 is a flow chart of a data processing method.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.
It should be understood that, in various embodiments of the present invention, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
It should be understood that in the present invention, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present invention, "plurality" means two or more. "and/or" is merely an association relationship describing an association object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. "comprising A, B and C", "comprising A, B, C" means that all three of A, B, C comprise, "comprising A, B or C" means that one of the three comprises A, B, C, and "comprising A, B and/or C" means that any 1 or any 2 or 3 of the three comprises A, B, C.
It should be understood that in the present invention, "B corresponding to a", "a corresponding to B", or "B corresponding to a" means that B is associated with a, from which B can be determined. Determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. The matching of A and B is that the similarity of A and B is larger than or equal to a preset threshold value.
As used herein, "if" may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection" depending on the context.
The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
As shown in fig. 1, an application scenario schematic diagram of the technical scheme provided by the invention includes a request end, a server and a database, when the request end has a model requirement, the request end sends a corresponding model processing request and a target field sample set to the server, the target field sample set at this time can be a preliminary target field sample set preconfigured by a worker, and the server screens the pre-training model based on the target field sample set and a first field sample set corresponding to each pre-training model in the database, so as to obtain a pre-training model requiring subsequent training of knowledge depth of the corresponding field, and trains the corresponding pre-training model, thereby improving training efficiency of the model.
The invention provides a training method of a pre-training model fusing domain knowledge, which is shown in fig. 2 and comprises the following steps:
Step S110, after judging that a model processing request and a target field sample set sent by a request end are received, a server invokes a first field sample set corresponding to each first pre-training model in a database, wherein the target field of the target field sample set is the current application field of the request end, the first field sample set is pre-stored sample data of the first field, the first field is a plurality of preset interactive application fields, each first field comprises a first pre-training model corresponding to the first field sample set, and samples included in the target field sample set and the first field sample set are corpus samples extracted by corresponding fields. After receiving the model processing request and the target domain sample set, the method and the device can obtain all first domain sample sets corresponding to the first pre-training models.
The model processing request may be a natural language model processing request, an image processing request, or the like, and the target domain sample set may be preset by a staff member, and taking a natural language model as an example, different application scenarios may have different target domain sample sets. For example, the natural language model has corresponding application in the multi-subdivision fields of intelligent home interaction, elevator interaction and the like, and the intelligent home interaction and the elevator interaction are respectively models which are trained previously, namely, an intelligent home interaction model and an elevator interaction model exist, and the intelligent home interaction model and the elevator interaction model can be regarded as a first pre-training model. The request end needs to build a vehicle-computer interaction model, and the corresponding model processing request can be the vehicle-computer interaction model which needs to build natural language processing, and the target field sample set has corpus samples in the vehicle-computer interaction process.
Corpus samples in the target domain sample set, target training samples such as: "open door", "close door", "open window", "close window", "open air conditioner", "close air conditioner", "raise temperature", "lower temperature", etc.
A first pre-training model, such as a smart home interaction model, corpus samples within a first domain sample set of the smart home interaction model, a first training sample, such as: "turn on television", "turn off television", "turn on air conditioner", "turn off air conditioner", "raise temperature", "lower temperature", etc.
A first pre-training model, e.g. an elevator interaction model, corpus samples within a first domain sample set of the elevator interaction model, a first training sample, e.g.: "open door", "close door", "to floor 16", "to floor 18", etc.
Step S120, traversing each target training sample in the target field sample set in turn, comparing the target field sample with the first training samples in the first field sample set, determining the target training samples identical to or corresponding to the first training samples, counting the first number of the target training samples identical to or corresponding to each first field sample set, and calculating the similarity based on the first number and the second number of the target training samples different from or not corresponding to each first field sample set, respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second field sample set with the highest similarity coefficient as the second field sample set. In an actual application scenario, different models may have partially identical training samples, and in particular, in the field of natural language processing, languages corresponding to intelligent control implemented between different application scenarios may be identical and corresponding. It can be understood that if more training samples are identical between two models, the similarity between the two models is possibly larger, so that new training samples can be added for training on the basis of another model to obtain a model which is more suitable for a corresponding scene at present, and therefore the invention needs to calculate the sample set similarity coefficient of the target field sample set and the target field sample set, and further determine a second field sample set with higher similarity coefficient with the sample set.
In one possible implementation manner, the step S120 includes:
step S1201, traversing each target training sample in the target field sample set in turn, comparing the target field sample with the first training sample in the first field sample set, and determining a target training sample identical to or corresponding to the first training sample. The invention can compare the target training sample with each first training sample in sequence, and the comparison process can be to directly compare the sentences corresponding to the target training sample with the sentences corresponding to the first training samples.
Step S1202, counting a first number of the same or corresponding target training samples in each first field sample set and a second number of different or non-corresponding target training samples, and calculating the similarity based on the first number and the second number to obtain sample set similarity coefficients of the plurality of first field sample sets and the target field sample set respectively. The greater the first number of identical or corresponding target training samples, the greater the similarity between the expected final model to be trained and the corresponding first pre-trained model is demonstrated. The more the second number of different or non-corresponding target training samples, the greater the distinction between the desired final model and the corresponding first pre-trained model, which proves to be needed to be trained.
According to the method, when the similarity coefficient of the sample set is determined, the first number of the same or corresponding target training samples and the second number of different or non-corresponding target training samples are comprehensively considered, the situation that some first pre-training models are the same as the expected final model but have large model body can be avoided, and further the first pre-training models with large model body can be removed in the method, so that the first pre-training model with the highest similarity coefficient is probably not the same or the first number of the corresponding target training samples is the largest, and further the moderate body quantity of the first pre-training models can be ensured not to be excessively large when the first pre-training model which is the most similar to the final model is found.
In one possible implementation manner, the calculating the first number of the same or corresponding target training samples and the second number of different or non-corresponding target training samples in each first field sample set, calculating the similarity based on the first number and the second number, and respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets includes:
And calculating according to the first quantity and the total quantity of the target training samples in the target field sample set to obtain the same evaluation sub-coefficient of the first field sample set and the target field sample set. The total number of target training samples in the target field sample set can reflect the volume of the corresponding first pre-training model, and if the same evaluation sub-coefficient is larger, the first pre-training model is more similar to the expected final model under the volume of the corresponding model data volume.
And calculating according to the second quantity and the total quantity of the target training samples in the target field sample set to obtain different evaluation sub-coefficients of the first field sample set and the target field sample set. The larger the different evaluation sub-coefficients, the more different the first pre-trained model is proved to be from the desired final model at the volume of the corresponding model data volume.
Respectively carrying out weighting treatment on the same evaluation sub-coefficient and the different evaluation sub-coefficients to obtain a sample set similarity coefficient of the first field sample set and the target field sample set, calculating the sample set similarity coefficient by the following formula,
Figure SMS_14
,/>
Figure SMS_15
Figure SMS_16
,/>
Figure SMS_17
wherein X is Sim For the sample set similarity coefficient of the first domain sample set and the target domain sample set,
Figure SMS_19
For the same evaluator coefficient->
Figure SMS_21
For different evaluation sub-coefficients, S ide First number of training samples for the same or corresponding target, +.>
Figure SMS_24
For the total number of target training samples within the target field sample set, +.>
Figure SMS_18
For the first calculation weight, S dif For a second number of different or non-corresponding target training samples +.>
Figure SMS_23
For the second calculation weight ∈>
Figure SMS_26
To calculate the constant. The invention can be realized by->
Figure SMS_27
Calculating to obtain the same evaluation sub-coefficient by/>
Figure SMS_20
Different evaluation sub-coefficients are calculated, and the same evaluation sub-coefficient is +.>
Figure SMS_22
Subtracting different evaluator coefficients->
Figure SMS_25
Obtaining an evaluation sub-coefficient difference value, and performing ∈10 according to the evaluation sub-coefficient difference value and a preset calculation constant>
Figure SMS_28
And obtaining a final sample set similarity coefficient.
In an actual computing scenario, it may occur that
Figure SMS_29
At this time, the present invention classifies the similarity coefficient obtained under the scene as 0.
The first calculation weight
Figure SMS_33
Second calculation weight->
Figure SMS_36
During initialization, the staff sets, first calculation weight +.>
Figure SMS_41
Preferably greater than the second calculation weight +.>
Figure SMS_32
. In different application scenarios, the computation power of the configured hardware device is different, which in turn results in different data volume requirements for the model in different scenarios. For example, in an application scenario with high computational power, it is possible to allow the volume and data volume of the final model to be large, the first computational weight at this time +. >
Figure SMS_34
Is greater than the second calculation weight +.>
Figure SMS_38
And first calculation weight +.>
Figure SMS_42
And a second calculation weight->
Figure SMS_31
The difference in (c) may be relatively small. For example, in an application scenario with weak calculation power, the volume and data volume of the final model are required to be small, and the first calculation weight at this time is +.>
Figure SMS_37
Is greater than the second calculation weight +.>
Figure SMS_40
And first calculation weight +.>
Figure SMS_43
And a second calculation weight->
Figure SMS_30
Is relatively large. First calculation weight->
Figure SMS_35
Always greater than the second calculation weight +.>
Figure SMS_39
But the quantity relation between the two can be relatively adjusted, and the volume requirements of different deployment scenes on different models can be met by the mode.
Step S1203, sorting all the first domain sample sets in descending order according to the sample set similarity coefficient, and taking the first domain sample set with the highest sample set similarity coefficient as the second domain sample set. According to the technical scheme provided by the invention, all the first field sample sets are ordered in a descending order according to the similarity coefficients of the sample sets, so that the first field sample set with the highest similarity coefficient is arranged at the front part, the first field sample set with the lowest similarity coefficient is arranged at the last part, the first field sample set with the highest similarity coefficient of the sample sets is preferentially used as the second field sample set, and the second field sample set at the moment can be regarded as being obtained by calculation.
If all the sample set similarity coefficients are smaller than the threshold similarity, judging that a first pre-training model corresponding to the expected final model does not exist at the moment, and reminding the user accordingly at the moment.
Step S1204, if it is determined that the difference between the similarity coefficient between the first domain sample set with the highest similarity coefficient of the sample sets and the first domain sample set with the next highest similarity coefficient of the sample sets is smaller than the preset difference, displaying the first domain sample set with the highest similarity coefficient and the next highest similarity coefficient, where the preset difference is 0.05. At this time, the similarity of the first field sample set with the highest similarity coefficient and the next highest similarity coefficient is relatively close, and at this time, the first field sample set with the highest similarity coefficient and the next highest similarity coefficient can be displayed.
In the above scenario, there are several possibilities:
scene 1: the first number of first domain sample sets with highest similarity coefficients is greater than the first number of first domain sample sets with second highest similarity coefficients, and the second number of first domain sample sets with highest similarity coefficients is less than the second number of first domain sample sets with second highest similarity coefficients;
scene 2: the first number of first domain sample sets with highest similarity coefficients is greater than the first number of first domain sample sets with second highest similarity coefficients, and the second number of first domain sample sets with highest similarity coefficients is greater than the second number of first domain sample sets with second highest similarity coefficients;
Scene 3: the first number of first domain sample sets having the highest similarity coefficient is less than the first number of first domain sample sets having the next highest similarity coefficient, and the second number of first domain sample sets having the highest similarity coefficient is less than the second number of first domain sample sets having the next highest similarity coefficient.
In the scene 1, the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the same training sample dimension. The highest first pre-training model is superior to the next highest first pre-training model in terms of body mass, sample similarity.
In the scene 2, the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is larger than the second highest first pre-training model in the same training sample dimension. The highest first pretrained model is now physically inferior to the next highest first pretrained model, and the highest first pretrained model is superior to the next highest first pretrained model in sample similarity.
In the scene 3, the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the body size dimension, and the first pre-training model corresponding to the highest similarity coefficient is smaller than the second highest first pre-training model in the same training sample dimension. The highest first pretrained model is now superior in body weight to the next highest first pretrained model, and the highest first pretrained model is inferior in sample similarity to the next highest first pretrained model.
It can be seen that the first pre-training models corresponding to the highest similarity coefficients in the scene 2 and the scene 3 have different advantages and disadvantages, respectively, and the invention outputs two similar first pre-training models which may have different advantages and disadvantages, further determines a final second pre-training model according to the selection of the user, and combines the selection of the user to continuously train the formula for calculating the similarity coefficients.
In one possible implementation manner, the technical scheme provided by the invention further comprises:
if the user is judged to take the second highest first field sample set as the second field sample set, the original highest first field sample set is not taken as the second field sample set. At this time, the user may consider that the second domain sample set selected by the present invention does not meet the current scene requirement, so it may take the second highest first domain sample set as the second domain sample set.
The first number of next highest first-domain sample sets is taken as the first number to be compared, the second number of next highest first-domain sample sets is taken as the second number to be compared, and the first number of highest first-domain sample sets is taken as the third number to be compared, and the second number of highest first-domain sample sets is taken as the fourth number to be compared. At this time, the present invention analyzes the second highest first domain sample set and the highest first domain sample set, and obtains a first to-be-compared number, a second to-be-compared number, a third to-be-compared number, and a fourth to-be-compared number to be analyzed.
And if the first quantity to be compared, the second quantity to be compared, the third quantity to be compared and the fourth quantity to be compared meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight.
It will be understood that, when the situation 2 is the case, the preset condition may be met, that is, the first pre-training model corresponding to the second highest first domain sample set is superior to the first pre-training model with the highest volume, and the second highest first pre-training model is inferior to the first pre-training model with the highest volume in terms of sample similarity, but the deployment scenario of the corresponding model requires the final model with the lower volume, so that the user may use the first domain sample set with the lower volume of data as the second domain sample set.
In the case of the scene 3, the condition that the preset condition is met may occur, that is, the first pre-training model corresponding to the second highest first field sample set is inferior to the highest first pre-training model in the volume, the second highest first pre-training model is superior to the highest first pre-training model in the sample similarity, at this time, the deployment scene of the corresponding model may meet the requirement that the model with larger volume operates and the final model with more samples is required, so at this time, the user may use the first field sample set with more samples as the second field sample set.
In one possible implementation manner of the present invention, if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number, and the fourth to-be-compared number satisfy a preset condition, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, including:
and if the first to-be-compared number is larger than the third to-be-compared number and the second to-be-compared number is larger than the fourth to-be-compared number, judging that a preset condition is met. At this time, that is, the case of scene 3, the first number of the first-domain sample sets with the second highest similarity coefficient is greater than the first number of the first-domain sample sets with the highest similarity coefficient, and the second number of the first-domain sample sets with the second highest similarity coefficient is greater than the second number of the first-domain sample sets with the highest similarity coefficient.
And performing increasing training on the first calculated weight to obtain a third calculated weight after increasing training. At this time, the first calculation weight needs to be trained in an increasing manner, so that the third calculation weight is larger than the second calculation weight, and further the same evaluation sub-coefficient calculated later is relatively larger, and further the sample set similarity coefficient is relatively larger in the scene 3.
And if the first to-be-compared number is smaller than the third to-be-compared number and the second to-be-compared number is smaller than the fourth to-be-compared number, judging that a preset condition is met. At this time, that is, the case of scene 2, the first number of the first domain sample sets with the second highest similarity coefficient is smaller than the first number of the first domain sample sets with the highest similarity coefficient, and the second number of the first domain sample sets with the second highest similarity coefficient is smaller than the second number of the first domain sample sets with the highest similarity coefficient.
And performing increasing training on the second calculated weight to obtain a fourth calculated weight after increasing training. At this time, the second calculation weight needs to be trained in an increasing manner, so that the fourth calculation weight is larger relative to the first calculation weight, and further, the different subsequently calculated evaluation sub-coefficients are relatively larger, and further, the sample set similarity coefficient is relatively larger under the scene 2.
In one possible implementation manner, the method performs an increasing training on the first calculation weight to obtain a third calculation weight after the increasing training, and includes:
and calculating according to the difference of the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a third calculated weight after the increased training according to the first calculated weight and the increased training proportion. According to the method, the difference between the sample set similarity coefficient of the highest first-field sample set and the sample set similarity coefficient of the next highest first-field sample set is calculated, if the difference between the sample set similarity coefficients is larger, the training proportion is relatively larger, and the third training weight is obtained by combining the training proportion on the basis of the original first training weight.
And performing augmentation training on the second calculation weight to obtain a fourth calculation weight after augmentation training, wherein the method comprises the following steps:
and calculating according to the difference of the similarity coefficients and the similarity coefficient of the sample set of the highest first field sample set to obtain an increased training proportion, and obtaining a fourth calculation weight after the increased training according to the second calculation weight and the increased training proportion. According to the method, the calculation is carried out according to the difference between the sample set similarity coefficient of the highest first-field sample set and the sample set similarity coefficient of the next highest first-field sample set, if the difference between the sample set similarity coefficients is larger, the training proportion is relatively larger, and the fourth calculation weight after training is obtained by combining the training proportion on the basis of the original second calculation weight.
The third calculation weight after the increase training or the fourth calculation weight after the increase training is calculated by the following formula,
Figure SMS_44
Figure SMS_45
wherein k is ide 3 To increase the third calculated weight after training, X fir Sim For the highest similarity coefficient, X Sim sec K is the next highest similarity coefficient dif 4 To increase the fourth computational weight after training,
Figure SMS_46
to increase the training ratio. By passing through
Figure SMS_47
The difference of similarity coefficients can be obtained according to +.>
Figure SMS_48
An increased training ratio can be obtained, and the invention will be based on +.>
Figure SMS_49
The value to be added for obtaining the first calculation weight is according to +.>
Figure SMS_50
And obtaining a final third calculation weight. The invention will be based on->
Figure SMS_51
The value to be added for obtaining the second calculation weight is according to +.>
Figure SMS_52
And obtaining the final fourth calculation weight. Through the training in the mode, the third calculation weight or the fourth calculation weight obtained by the method is more in accordance with the calculation scene of the method, so that the similarity coefficient of the sample set calculated later is relatively more accurate, and the sample set similarity coefficient accords with the corresponding calculation scene.
Step S130, traversing each target training sample in the target field sample set in turn, comparing the target field sample with a second training sample in the second field sample set, determining a target training sample different from the second training sample, and generating a difference sample set based on the determined target training sample, wherein each difference training sample in the difference sample set at least comprises a difference training statement. After the second field sample set is obtained, the target field sample set is compared with the second field sample set, and then the corresponding difference sample set is determined.
It will be appreciated that the differential training samples in the differential sample set are samples that the final model needs to be trained, but that have not been trained in the second pre-training model. Such as "turn on a double flash" in car-machine interaction, etc.
In one possible implementation manner, the step S130 includes:
and traversing each target training sample in the target field sample set in sequence, comparing the target field sample with a second training sample in a second field sample set, and determining a target training sample different from the second training sample. According to the invention, a comparison is carried out between the target field sample and the second training sample, so that the target training sample different from the second training sample is determined, the condition of sample omission is avoided, the repeated samples in the target training sample and the second training sample are not subjected to secondary training during retraining, the efficiency of obtaining a final model is improved, and the functional integrity of the final model is ensured.
A differential sample set is generated based on the determined target training samples. After the target training sample is traversed, a difference sample set is generated.
And step 140, taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentences to obtain at least one training word, constructing a slot template corresponding to the differential training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model. According to the method, the second pre-training model is trained again according to the difference sample set and the corresponding domain knowledge, so that the trained final model can meet the processing requirements of the specific knowledge domain. The terms "turn on double flashing" and "double flashing" are unique to the field of vehicle-machine interaction, so that the corresponding final model needs to be trained based on differential training samples such as "turn on double flashing".
In one possible implementation manner, the step S140 includes:
extracting all the difference training samples in the difference sample set, wherein each difference training sample at least comprises one difference training statement, and the difference training statement is provided with preset instruction information and/or preset feedback statement corresponding to the difference training statement. According to the invention, all the difference training samples in the difference sample set are firstly extracted, and as the final model trained by the invention is a natural language processing model, the difference training samples comprise corresponding difference training sentences which possibly have preset instruction information and/or preset feedback sentences corresponding to the difference training sentences. For example, the preset instruction information corresponding to the "turn on double flashing lights" can be to control the indicator lights to strobe according to the preset frequency. For example, when the dual flash is turned on, interaction with a person is required, and the preset feedback statement at this time may be "you confirm how to turn on the dual flash", and after receiving a confirmation statement input by the person, corresponding preset instruction information is output. Therefore, different differential training sentences may have different, corresponding preset instruction information and/or preset feedback sentences.
Controlling a second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, and constructing a slot template corresponding to the difference training sentences according to the training word, wherein the slot template at least comprises a first slot. In general, the pre-training model used as natural language processing has a semantic analysis function, so that the invention can perform word segmentation processing on the difference training sentences according to the second pre-training model to obtain corresponding training words. And, constructing a slot template corresponding to the difference training sentence according to the training words, wherein different training words may correspond to different slot templates, but one slot template may at least comprise one first slot.
Numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model. The present invention will number all the first slots in each slot template, such as slot 1, slot 2, slot 3, etc. The invention can determine the corresponding relation between each slot number and the training words, the training words in all slot templates can have sequential relation, and the final model can comprise the corresponding relation between each slot number and the training words in the slot templates, so that the follow-up voice recognition is more accurate.
According to the technical scheme provided by the invention, in one possible implementation mode, the second pre-training model is controlled to perform word segmentation processing on the difference training sentences to obtain at least one training word, a slot template corresponding to the difference training sentences is constructed according to the training word, the slot template at least comprises a first slot, and the method comprises the following steps:
and performing word segmentation processing on the difference training sentences to obtain a plurality of first words, and extracting the first words which are nouns or verbs as training words. The invention firstly performs word segmentation processing on the difference training sentences, for example, the word segmentation can be performed by opening a double flash lamp, and the invention takes nouns or the first words of verbs as training words, and the word segmentation process is performed by opening the double flash lamp, which is the verb, and the word segmentation process is performed by opening the double flash lamp, which is the noun.
And according to the position relation of each training word, establishing a first slot corresponding to each training word in the slot template. According to the invention, the first slot positions are established according to the position relation of training words, two first slot positions are arranged in the slot position template corresponding to the double-flash lamp opening, the first slot positions correspond to the opening, and the second slot positions correspond to the double-flash lamp.
And determining the synonym corresponding to each training word according to a preset word library, wherein the word library is provided with the corresponding relation between the training word and the synonym. In an actual interaction scene, a plurality of words may exist in a manner of expressing one meaning by a user, for example, the method of turning on a double-flash lamp can also be used for turning on a double-flash lamp, and the method of turning on and turning on can be regarded as synonymous words, so that the synonymous words corresponding to each training word are required to be determined according to a preset word library, and further sentences with the same meaning and different words can be identified.
And counting each training word and all corresponding synonym words to generate a word set. According to the invention, all training words and synonymous words are counted to obtain a final word set, and each first slot position may correspond to one word set.
In one possible implementation manner, the method for numbering all the first slots in the slot templates, determining the corresponding relation between each slot number and the training word, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model includes:
And numbering all the first slots in the slot templates in an ascending order according to the sequence to obtain the slot numbers corresponding to each first slot. The invention carries out ascending serial numbers aiming at all first slots in the slot template, for example, the slot number corresponding to the opening is slot 1, and the slot number corresponding to the double flashing lamp is slot 2.
And determining the corresponding relation between the slot number and the word set according to the corresponding relation between the slot number and the training words, and correspondingly storing the corresponding relation and the corresponding slot template to obtain a final model. The invention can combine the corresponding relation between the slot number and the training words to store, and obtain the final model. The invention takes the second pre-training model as the basic model to train, and obtains the corresponding final model, so that the final model has the corresponding domain knowledge and is easy to train.
The invention also provides a data processing method, which configures the final model trained by the technical scheme, as shown in fig. 3, and further comprises:
step S210, receiving a user control sentence, performing word segmentation processing on the control sentence to obtain at least one control word, and numbering all the control words in ascending order according to the time sequence of the control word. For example, when the user inputs a control sentence of "turn on double flashing lamps", the corresponding control words may include "turn on" and "double flashing lamps", and the invention numbers all the control words in ascending order according to the time sequence of the control words, at this time, the number of "turn on" may be 1, and the number of "double flashing lamps" may be 2.
Step S220, determining a word set corresponding to a first slot of the minimum number of the slot templates, and taking the corresponding slot templates as the slot templates to be screened if the training words in the word set are judged to correspond to the control words. When the invention is compared, the word set corresponding to the first slot position with the minimum number of the slot position templates is firstly obtained, for example, at least two slot position templates exist at the moment, the first training samples corresponding to the two slot position templates are respectively "double flashing lights are turned on" and "air conditioner is turned on", the word set corresponding to the first slot position with the minimum number of the two slot position templates can comprise [ open and open ], and the sentences corresponding to the slot position templates to be screened at the moment can be the "double flashing lights are turned on" and the air conditioner is turned on ".
Step S230, comparing the word sets of other first slots of the slot templates to be screened with the control words of other numbers. Through step S220, all the slot templates are screened for the first time, so that the number of the slot templates is greatly reduced, and at the moment, the invention can accurately screen, namely, word sets of other first slots of all the slot templates to be screened are compared with control words of other numbers.
And step 240, if all the first slots in the slot templates to be screened are judged to be completely corresponding to the control words, the corresponding slot templates to be screened are used as output slot templates, and the preset instruction information and/or the preset feedback statement corresponding to the output slot templates are output. At this time, a completely corresponding slot position template can be determined according to all control words, and preset instruction information and/or preset feedback sentences corresponding to the slot position template are output, for example, a car machine controls double-flashing lamps to perform double-flashing indication.
In one possible implementation manner, the step S240 includes:
if the maximum number of the control word is judged to be not corresponding to the maximum slot number of the first slot, deleting the non-corresponding slot template. The method and the device can rapidly determine the non-corresponding slot templates, delete the corresponding slot templates, realize the purpose of preliminarily locking the slot templates to be screened, and have the advantage of high determination efficiency.
If the maximum number of the control word corresponds to the maximum slot number of the first slots, comparing the control word with the word set of each first slot according to the number of the control word, and if the control word with the same number corresponds to the corresponding word set, judging that all the first slots in the slot template to be screened correspond to the control word completely. According to the invention, through the steps, the slot position templates are accurately screened, word sets in the first slot positions of the second numbers are compared with the control words of the second numbers, and if the word sets are completely corresponding to the control words of the second numbers, all the first slot positions in the slot position templates are completely corresponding to all the control words. At this time, corresponding preset instruction information and/or preset feedback sentences are output according to the slot position template.
The present invention also provides a storage medium having stored therein a computer program for implementing the methods provided by the various embodiments described above when executed by a processor.
The storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer. For example, a storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). In addition, the ASIC may reside in a user device. The processor and the storage medium may reside as discrete components in a communication device. The storage medium may be read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tape, floppy disk, optical data storage device, etc.
The present invention also provides a program product comprising execution instructions stored in a storage medium. The at least one processor of the device may read the execution instructions from the storage medium, the execution instructions being executed by the at least one processor to cause the device to implement the methods provided by the various embodiments described above.
In the above embodiments of the terminal or the server, it should be understood that the processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (11)

1. A training method of a pre-training model fusing domain knowledge is characterized by comprising the following steps:
after judging that a model processing request and a target field sample set sent by a request end are received, a server invokes a first field sample set corresponding to each first pre-training model in a database, wherein the target field of the target field sample set is the current application field of the request end, the first field sample set is pre-stored sample data of the first field, the first field is a plurality of preset interactive application fields, each first field comprises a first pre-training model corresponding to the first field sample set, and samples included in the target field sample set and the first field sample set are corpus samples extracted from the corresponding fields;
traversing each target training sample in the target field sample set in sequence, comparing the target field sample with first training samples in the first field sample set, determining target training samples identical to or corresponding to the first training samples, counting first numbers of identical or corresponding target training samples in each first field sample set and second numbers of different or non-corresponding target training samples, calculating the similarity based on the first numbers and the second numbers to respectively obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second highest similarity coefficient as the second field sample set;
Traversing each target training sample in the target field sample set in turn, comparing the target field sample with a second training sample in a second field sample set, determining a target training sample different from the second training sample, generating a difference sample set based on the determined target training sample, wherein each difference training sample in the difference sample set at least comprises a difference training sentence;
and taking the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, constructing a slot template corresponding to the difference training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model.
2. The method for training a pre-training model with fusion of domain knowledge according to claim 1, wherein,
the step of traversing each target training sample in the target field sample set in turn, comparing the target field sample with first training samples in the first field sample set, determining the target training samples identical to or corresponding to the first training samples, counting the first number of the target training samples identical to or corresponding to each first field sample set and the second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number, respectively obtaining sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, comparing all sample set similarity coefficients, and taking the first field sample set with the highest similarity coefficient or the second highest similarity coefficient as the second field sample set, and comprises the following steps:
Sorting all the first field sample sets in a descending order according to the sample set similarity coefficient, and taking the first field sample set with the highest sample set similarity coefficient as a second field sample set;
and if the difference of the similarity coefficients between the first field sample set with the highest similarity coefficient of the sample set and the first field sample set with the second highest similarity coefficient of the sample set is smaller than the preset difference value, displaying the first field sample set with the highest and second highest similarity coefficient.
3. The method for training a pre-training model with fusion of domain knowledge according to claim 2, wherein,
counting the first number of the same or corresponding target training samples in each first field sample set and the second number of different or non-corresponding target training samples, calculating the similarity based on the first number and the second number to respectively obtain sample set similarity coefficients of a plurality of first field sample sets and the target field sample sets, wherein the method comprises the following steps:
calculating according to the first quantity and the total quantity of the target training samples in the target field sample set to obtain the same evaluation sub-coefficient of the first field sample set and the target field sample set;
Calculating according to the second quantity and the total quantity of the target training samples in the target field sample set to obtain different evaluation sub-coefficients of the first field sample set and the target field sample set;
respectively carrying out weighting treatment on the same evaluation sub-coefficient and the different evaluation sub-coefficients to obtain a sample set similarity coefficient of the first field sample set and the target field sample set, calculating the sample set similarity coefficient by the following formula,
Figure QLYQS_1
,/>
Figure QLYQS_2
Figure QLYQS_3
,/>
Figure QLYQS_4
wherein X is Sim For the sample set similarity coefficient of the first domain sample set and the target domain sample set,
Figure QLYQS_5
for the same evaluator coefficient->
Figure QLYQS_6
For different evaluation sub-coefficients, S ide First number of training samples for the same or corresponding target, +.>
Figure QLYQS_7
For the total number of target training samples within the target field sample set, +.>
Figure QLYQS_8
For the first calculation weight, S dif For a second number of different or non-corresponding target training samples +.>
Figure QLYQS_9
For the second calculation weight ∈>
Figure QLYQS_10
To calculate a constant;
wherein the preset difference is 0.05.
4. A method of training a pre-training model incorporating domain knowledge as claimed in claim 3, further comprising:
if the user is judged to take the second highest first field sample set as the second field sample set, the original highest first field sample set is not taken as the second field sample set;
Then the first number of the next highest first domain sample set is taken as the first to-be-compared number, the second number of the next highest first domain sample set is taken as the second to-be-compared number, and the first number of the highest first domain sample set is taken as the third to-be-compared number, and the second number of the highest first domain sample set is taken as the fourth to-be-compared number;
and if the first quantity to be compared, the second quantity to be compared, the third quantity to be compared and the fourth quantity to be compared meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight.
5. The method for training a pre-training model with fusion of domain knowledge according to claim 4, wherein,
if the first to-be-compared number, the second to-be-compared number, the third to-be-compared number and the fourth to-be-compared number meet the preset conditions, training the first calculation weight or the second calculation weight to obtain a trained third calculation weight or fourth calculation weight, wherein the training comprises the following steps:
if the first to-be-compared number is larger than the third to-be-compared number and the second to-be-compared number is larger than the fourth to-be-compared number, judging that a preset condition is met;
Performing increasing training on the first calculation weight to obtain a third calculation weight after increasing training;
if the first to-be-compared number is smaller than the third to-be-compared number and the second to-be-compared number is smaller than the fourth to-be-compared number, judging that a preset condition is met;
and performing increasing training on the second calculated weight to obtain a fourth calculated weight after increasing training.
6. The method for training a pre-training model with fusion of domain knowledge according to claim 5, wherein,
and performing augmentation training on the first calculation weight to obtain a third calculation weight after augmentation training, wherein the method comprises the following steps:
calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a third calculated weight after the increased training according to the first calculated weight and the increased training proportion;
and performing augmentation training on the second calculation weight to obtain a fourth calculation weight after augmentation training, wherein the method comprises the following steps:
calculating according to the difference between the similarity coefficients and the sample set similarity coefficient of the highest first field sample set to obtain an increased training proportion, and obtaining a fourth calculation weight after the increased training according to the second calculation weight and the increased training proportion;
The third calculation weight after the increase training or the fourth calculation weight after the increase training is calculated by the following formula,
Figure QLYQS_11
Figure QLYQS_12
wherein k is ide 3 To increase the third calculated weight after training, X fir Sim For the highest similarity coefficient, X Sim sec K is the next highest similarity coefficient dif 4 To increase the fourth computational weight after training,
Figure QLYQS_13
to increase the training ratio.
7. The method for training a pre-training model with fusion of domain knowledge according to claim 6, wherein,
the step of using the first pre-training model corresponding to the second field sample set as a second pre-training model, controlling the second pre-training model to perform word segmentation processing on the differential training sentences to obtain at least one training word, constructing a slot template corresponding to the differential training sentences according to the training word, and correspondingly storing the corresponding relation between the slot and the training sentences and the corresponding slot template to obtain a final model, wherein the step of:
extracting all difference training samples in the difference sample set, wherein each difference training sample at least comprises one difference training statement, and the difference training statement has preset instruction information and/or preset feedback statement corresponding to the difference training statement;
controlling a second pre-training model to perform word segmentation processing on the difference training sentences to obtain at least one training word, and constructing a slot template corresponding to the difference training sentences according to the training word, wherein the slot template at least comprises a first slot;
Numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model.
8. The method for training a pre-training model with fusion of domain knowledge according to claim 7,
the second pre-training model is controlled to perform word segmentation processing on the difference training sentences to obtain at least one training word, a slot template corresponding to the difference training sentences is constructed according to the training word, and the slot template at least comprises a first slot and comprises:
according to the position relation of each training word, a first slot position corresponding to each training word is established in a slot position template;
determining synonym words corresponding to each training word according to a preset word library, wherein the word library is provided with corresponding relations between the training words and the synonym words;
and counting each training word and all corresponding synonym words to generate a word set.
9. The method for training a pre-training model with fusion of domain knowledge according to claim 8, wherein,
numbering all first slots in the slot templates, determining the corresponding relation between each slot number and the training words, and storing the corresponding relation and the corresponding slot templates correspondingly to obtain a final model, wherein the method comprises the following steps:
Numbering all the first slots in the slot template in ascending order according to the sequence to obtain the slot number corresponding to each first slot;
and determining the corresponding relation between the slot number and the word set according to the corresponding relation between the slot number and the training words, and correspondingly storing the corresponding relation and the corresponding slot template to obtain a final model.
10. A data processing method for configuring the final model trained in claim 9, further comprising:
receiving a user control sentence, performing word segmentation processing on the control sentence to obtain at least one control word, and numbering all the control words in ascending order according to the time sequence of the control word;
determining a word set corresponding to a first slot of the minimum number of the slot templates, and taking the corresponding slot template as the slot template to be screened if judging that the training words in the word set correspond to the control words;
comparing the word sets of other first slots of all the slot templates to be screened with control words of other numbers;
if all the first slots in the slot templates to be screened are judged to be completely corresponding to the control words, the corresponding slot templates to be screened are used as output slot templates, and preset instruction information and/or preset feedback sentences corresponding to the output slot templates are output.
11. The method for processing data according to claim 10, wherein,
if it is determined that all the first slots in the slot templates to be screened completely correspond to the control words, the corresponding slot templates to be screened are used as output slot templates, and preset instruction information and/or preset feedback sentences corresponding to the output slot templates are output, including:
if the maximum number of the control word is judged to be not corresponding to the maximum slot number of the first slot, deleting the non-corresponding slot template;
if the maximum number of the control word corresponds to the maximum slot number of the first slots, comparing the control word with the word set of each first slot according to the number of the control word, and if the control word with the same number corresponds to the corresponding word set, judging that all the first slots in the slot template to be screened correspond to the control word completely.
CN202310314738.4A 2023-03-29 2023-03-29 Pre-training model training method integrating domain knowledge and data processing method Active CN116028821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310314738.4A CN116028821B (en) 2023-03-29 2023-03-29 Pre-training model training method integrating domain knowledge and data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310314738.4A CN116028821B (en) 2023-03-29 2023-03-29 Pre-training model training method integrating domain knowledge and data processing method

Publications (2)

Publication Number Publication Date
CN116028821A true CN116028821A (en) 2023-04-28
CN116028821B CN116028821B (en) 2023-06-13

Family

ID=86089587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310314738.4A Active CN116028821B (en) 2023-03-29 2023-03-29 Pre-training model training method integrating domain knowledge and data processing method

Country Status (1)

Country Link
CN (1) CN116028821B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236409A (en) * 2023-11-16 2023-12-15 中电科大数据研究院有限公司 Small model training method, device and system based on large model and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180177461A1 (en) * 2016-12-22 2018-06-28 The Johns Hopkins University Machine learning approach to beamforming
CN111860670A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Domain adaptive model training method, image detection method, device, equipment and medium
CN112395879A (en) * 2020-11-10 2021-02-23 华中科技大学 Scientific and technological text named entity recognition method
CN114036306A (en) * 2022-01-07 2022-02-11 四川大学 Model training method and device, computer equipment and computer readable storage medium
CN114064906A (en) * 2022-01-17 2022-02-18 深圳佑驾创新科技有限公司 Emotion classification network training method and emotion classification method
CN114565104A (en) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 Language model pre-training method, result recommendation method and related device
CN114677575A (en) * 2020-12-24 2022-06-28 华为技术有限公司 Scene migration method and device and electronic equipment
CN114708857A (en) * 2020-12-31 2022-07-05 中兴通讯股份有限公司 Speech recognition model training method, speech recognition method and corresponding device
CN115063604A (en) * 2022-08-08 2022-09-16 中科视语(北京)科技有限公司 Feature extraction model training and target re-identification method and device
US20230073550A1 (en) * 2021-12-28 2023-03-09 Beijing Baidu Netcom Science Technology Co., Ltd. Method for extracting text information, electronic device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180177461A1 (en) * 2016-12-22 2018-06-28 The Johns Hopkins University Machine learning approach to beamforming
CN111860670A (en) * 2020-07-28 2020-10-30 平安科技(深圳)有限公司 Domain adaptive model training method, image detection method, device, equipment and medium
CN112395879A (en) * 2020-11-10 2021-02-23 华中科技大学 Scientific and technological text named entity recognition method
CN114677575A (en) * 2020-12-24 2022-06-28 华为技术有限公司 Scene migration method and device and electronic equipment
CN114708857A (en) * 2020-12-31 2022-07-05 中兴通讯股份有限公司 Speech recognition model training method, speech recognition method and corresponding device
US20230073550A1 (en) * 2021-12-28 2023-03-09 Beijing Baidu Netcom Science Technology Co., Ltd. Method for extracting text information, electronic device and storage medium
CN114036306A (en) * 2022-01-07 2022-02-11 四川大学 Model training method and device, computer equipment and computer readable storage medium
CN114064906A (en) * 2022-01-17 2022-02-18 深圳佑驾创新科技有限公司 Emotion classification network training method and emotion classification method
CN114565104A (en) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 Language model pre-training method, result recommendation method and related device
CN115063604A (en) * 2022-08-08 2022-09-16 中科视语(北京)科技有限公司 Feature extraction model training and target re-identification method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
PENG GAO 等: "Multi-source fast transfer learning algorithm based on support vector machine", 《APPLIED INTELLIGENCE》, pages 1 - 15 *
WOUTER M. KOUW 等: "An introduction to domain adaptation and transfer learning", 《ARXIV》, pages 1 - 42 *
刘大鹏 等: "结合源域差异性与目标域不确定性的深度迁移主动学习方法", 《模式识别与人工智能》, vol. 34, no. 10, pages 898 - 908 *
苟嫣: "基于机器学习的辐射源识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 7, pages 136 - 763 *
陈官羽: "基于迁移学习的识别现金贷欺诈研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2, pages 140 - 620 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236409A (en) * 2023-11-16 2023-12-15 中电科大数据研究院有限公司 Small model training method, device and system based on large model and storage medium
CN117236409B (en) * 2023-11-16 2024-02-27 中电科大数据研究院有限公司 Small model training method, device and system based on large model and storage medium

Also Published As

Publication number Publication date
CN116028821B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN110427617B (en) Push information generation method and device
CN108564941B (en) Voice recognition method, device, equipment and storage medium
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
US11941366B2 (en) Context-based multi-turn dialogue method and storage medium
CN108899013B (en) Voice search method and device and voice recognition system
CN109960747B (en) Video description information generation method, video processing method and corresponding devices
WO2022134894A1 (en) Speech recognition method and apparatus, computer device, and storage medium
CN111026842A (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN112154465A (en) Method, device and equipment for learning intention recognition model
CN110021293B (en) Voice recognition method and device and readable storage medium
EP4086893A1 (en) Natural language understanding method and device, vehicle and medium
CN111832305B (en) User intention recognition method, device, server and medium
CN110223134B (en) Product recommendation method based on voice recognition and related equipment
CN116028821B (en) Pre-training model training method integrating domain knowledge and data processing method
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
JP2020004382A (en) Method and device for voice interaction
CN111710337A (en) Voice data processing method and device, computer readable medium and electronic equipment
JP7178394B2 (en) Methods, apparatus, apparatus, and media for processing audio signals
Shen et al. Estimation of Gap Between Current Language Models and Human Performance.
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN112115131A (en) Data denoising method, device and equipment and computer readable storage medium
JP6810580B2 (en) Language model learning device and its program
CN113705207A (en) Grammar error recognition method and device
CN112528653A (en) Short text entity identification method and system
CN116304014A (en) Method for training entity type recognition model, entity type recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant