CN114357974B

CN114357974B - Similar sample corpus generation method and device, electronic equipment and storage medium

Info

Publication number: CN114357974B
Application number: CN202111622743.9A
Authority: CN
Inventors: 张阳; 漆骏锋; 胡伯良
Original assignee: Beijing Haitai Fangyuan High Technology Co Ltd
Current assignee: Beijing Haitai Fangyuan High Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-09-23
Anticipated expiration: 2041-12-28
Also published as: CN114357974A

Abstract

The application relates to the field of data processing, in particular to a method, a device, electronic equipment and a storage medium for generating similar sample corpora, which solve the problems that the generating process of the similar sample corpora is complex and effective similar sample corpora is difficult to generate, and the method comprises the following steps: the method comprises the steps of obtaining a first seed statement in a target field and each second seed statement in other fields, inputting the first seed statement into each pre-training model added with noise disturbance, obtaining each first fusion result, obtaining each second fusion result determined according to each second seed statement, generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result. Therefore, the generation process of the similar sample linguistic data is simplified, the generation efficiency of the similar sample linguistic data is improved, and effective similar sample linguistic data can be generated.

Description

Similar sample corpus generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method and an apparatus for generating similar sample corpora, an electronic device, and a storage medium.

Background

With the widespread application of machine learning technology, people can respectively construct corresponding text similarity models for texts in different fields in the vertical field in a machine learning manner so as to process text similarity tasks, duplication searching tasks and retrieval tasks. Therefore, in order to implement training of a text similarity model, a similar sample corpus is usually required to be constructed in a targeted manner, where the similar sample corpus includes a similar positive sample corpus and a similar negative sample corpus.

At present, when corresponding similar sample corpora are respectively generated for text similar models in different fields, the similar sample corpora are generated by directly executing operations such as content deletion, content replacement, position exchange and the like in the original similar sample corpora by means of manually established generation rules.

However, in the existing corpus generating method, because the generating quality of the corpus directly depends on the rationality of the formulation of the generating rule, the required manual intervention degree is high, so the generating process of the corpus sample is complex, the realization difficulty is high, and the number of the similar sample corpuses which can be generated is very limited. In addition, considering that content replacement itself needs to use a text similar model, under the condition that training of the text similar model cannot be realized based on the effective similar sample corpus, the effective similar sample corpus cannot be generated in a content replacement mode.

Therefore, a new method for generating similar sample corpora is needed to solve the above-mentioned problems.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating similar sample corpora, electronic equipment and a storage medium, and aims to solve the problems that the generation process of the similar sample corpora is complex and effective similar sample corpora is difficult to generate in the prior art.

The embodiment of the application provides the following specific technical scheme:

in a first aspect, a method for generating similar sample corpora is provided, which is applied to a similar sample corpus generation process in a target field, and includes:

acquiring a first seed sentence of a target field and acquiring second seed sentences in other fields except the target field, wherein the seed sentences comprise entity nouns in the field to which the seed sentences belong;

constructing pre-training models comprising a plurality of layers of coding networks, inputting the first seed statement into each pre-training model added with noise disturbance, and obtaining each first fusion result determined according to the output vector belonging to the coding network of a preset first class level in each pre-training model added with noise disturbance;

determining a target pre-training model in each pre-training model, respectively inputting each second seed statement into the target pre-training model, and respectively obtaining a second fusion result determined according to an output vector of a coding network belonging to a preset second class level in the target pre-training model;

and generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Optionally, the obtaining the first seed statement of the target field and obtaining each second seed statement in other fields except the target field includes:

acquiring a first candidate text of a target field, and acquiring second candidate texts in other fields except the target field;

processing the first candidate text and the second candidate text into a specified coding format, and respectively performing noise reduction processing and illegal character cleaning processing on the first candidate text and the second candidate text in the specified coding format;

and splitting the processed first candidate text according to the designated characters to obtain a first seed sentence, and splitting the processed second candidate text according to the designated characters to obtain each second seed sentence.

Optionally, the obtaining a first candidate text of a target field and obtaining second candidate texts in other fields except the target field includes:

acquiring a trained text field classification model, wherein the text field classification model is obtained by training based on text samples of each field;

and respectively inputting the obtained candidate texts into the text field classification model, obtaining classification results corresponding to the candidate texts, taking the candidate text belonging to the target field as a first candidate text, and taking the candidate text not belonging to the target field as a second candidate text.

Optionally, the constructing each pre-training model including a multi-layer coding network includes:

acquiring a reference model comprising a plurality of layers of coding networks, and determining the attention head number of each layer of coding network in the reference model and the inactivation probability of neurons in each layer of coding network;

and constructing pre-training models comprising a plurality of layers of coding networks by adjusting the attention head number of the coding networks in the reference model and the inactivation probability of the neurons.

Optionally, when noise disturbance is added to each pre-training model, any one or a combination of the following operations is respectively performed for each pre-training model:

based on each configured first disturbance factor, processing input data of each layer of coding network respectively;

respectively processing the model parameters of each layer of coding network based on each configured second disturbance factor;

processing the gradient parameter obtained by calculation during reverse propagation based on the configured third disturbance factor;

processing input data of each layer of coding network respectively by adopting each preset first noise function;

respectively processing the model parameters of each layer of coding network by adopting each preset second noise function;

and processing the gradient parameters obtained by calculation during reverse propagation by adopting a preset third noise function.

Optionally, the obtaining of each first fusion result determined according to the output vector of the coding network belonging to the preset first class hierarchy in each pre-training model added with noise disturbance includes:

for each pre-training model with noise disturbance added, respectively executing the following operations:

determining at least one target level coding network belonging to a preset first class level in a pre-training model added with noise disturbance, and obtaining an output vector of each target level coding network;

and carrying out weighted summation on elements at the same positions in the output vectors to obtain a corresponding first fusion result.

Optionally, the generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result include:

determining a target first fusion result in each first fusion result, and combining the target first fusion result with each other first fusion result except the target first fusion result in each first fusion result respectively to obtain each group of similar positive sample corpora;

and combining the target first fusion result with each second fusion result respectively to obtain each group of similar negative sample corpora.

In a second aspect, a device for generating similar sample corpora is provided, which is applied to a similar sample corpus generation process in a target field, and includes:

the acquiring unit is used for acquiring a first seed sentence of a target field and acquiring second seed sentences in other fields except the target field, wherein the seed sentences comprise entity nouns in the field to which the seed sentences belong;

the construction unit is used for constructing pre-training models comprising multiple layers of coding networks, inputting the first seed statement into each pre-training model added with noise disturbance, and obtaining each first fusion result determined according to the output vector belonging to the coding network of a preset first class level in each pre-training model added with noise disturbance;

the determining unit is used for determining a target pre-training model in each pre-training model, inputting each second seed statement into the target pre-training model respectively, and obtaining a second fusion result determined according to an output vector of a coding network belonging to a preset second class level in the target pre-training model;

and the generating unit is used for generating each group of similar positive sample corpora according to each first fusion result and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Optionally, when the first seed sentence in the target field is obtained and each second seed sentence in other fields except the target field is obtained, the obtaining unit is configured to:

Optionally, when the first candidate text in the target field is obtained, and the second candidate texts in other fields except the target field are obtained, the obtaining unit is configured to:

Optionally, when constructing each pre-training model including a multi-layer coding network, the construction unit is configured to:

Optionally, when noise disturbance is added to each pre-training model, the constructing unit respectively performs any one or a combination of the following operations for each pre-training model:

respectively processing input data of each layer of coding network based on each configured first disturbance factor;

processing the gradient parameters obtained by calculation during reverse propagation based on the configured third disturbance factor;

Optionally, when obtaining each first fusion result determined according to the output vector of the coding network belonging to the preset first class hierarchy in each pre-training model added with noise disturbance, the constructing unit is configured to:

for each pre-training model with the noise disturbance added, respectively executing the following operations:

determining at least one target level coding network belonging to a preset first level in a pre-training model added with noise disturbance, and obtaining an output vector of each target level coding network;

Optionally, when generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result, the generating unit is configured to:

determining a target first fusion result in each first fusion result, and combining the target first fusion result with each other first fusion result except the target first fusion result in each first fusion result to obtain each group of similar positive sample corpora;

In a third aspect, a computer-readable electronic device is provided, comprising:

a memory for storing executable instructions;

a processor configured to read and execute executable instructions stored in the memory to implement the method of any of the first aspect.

In a fourth aspect, a storage medium is proposed, in which instructions that, when executed by an electronic device, enable the electronic device to perform the method according to any of the first aspect.

The beneficial effect of this application is as follows:

the method comprises the steps of obtaining a first seed sentence in a target field, obtaining second seed sentences in other fields except the target field, wherein the seed sentences contain entity nouns in the field to which the first seed sentences belong, constructing pre-training models comprising multilayer coding networks, inputting the first seed sentences into the pre-training models to which noise disturbance is added, obtaining output vectors belonging to coding networks of a preset first class level in the pre-training models to which the noise disturbance is added, determining first fusion results, determining target pre-training models in the pre-training models, and inputting the second seed sentences into the target pre-training models, and respectively obtaining second fusion results determined according to output vectors of the coding networks belonging to a preset second class hierarchy in the target pre-training model, generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Thus, when generating similar sample corpora of the target field, the first seed sentences of the target field are respectively input into each pre-training model added with noise disturbance, so that various noises are fused in each first fusion result generated corresponding to the first seed sentences to different degrees, the similarity among the similar sample corpora is ensured and the difference among the similar sample corpora is also ensured in the generated similar positive sample corpus, meanwhile, when generating similar negative sample corpora, at least one target pre-training model determined from each pre-training model is adopted, and based on each second seed sentence of different fields, the corresponding similar negative sample corpora is generated, so that the similar negative sample corpora in the generated similar sample negative corpus has obvious semantic difference, thereby not only simplifying the generation process of the similar sample corpora, the generation efficiency of similar sample corpora is improved, and effective similar sample corpora can be generated.

Drawings

FIG. 1 is a schematic diagram illustrating a process of generating similar sample corpora according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a logic structure of a similar sample corpus generation apparatus according to an embodiment of the present application;

fig. 3 is a schematic diagram of a hardware component structure of an electronic device to which an embodiment of the present application is applied.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein.

In the related art, in order to implement a task such as a text similarity matching task, a duplicate finding task, or a retrieval task for texts in different fields, processing is usually implemented according to an obtained text similarity model by means of a machine learning technique, and therefore, it is very important to obtain a similar sample corpus capable of training the text similarity model.

In the related art, when generating similar sample corpora in a targeted manner, operations such as content deletion, content position adjustment and the like are directly performed on the original similar sample corpora usually by means of a manually formulated generation rule, or a general synonym replacement mode is adopted to replace part of content in the original similar sample corpora to generate the similar sample corpora.

However, since texts in different fields have great differences, similar sample corpora obtained by directly replacing the common synonyms may not be semantically similar in the corresponding fields, and thus effective training samples cannot be obtained.

Aiming at the problems that the generating process of similar sample linguistic data is complex and effective similar sample linguistic data is difficult to generate in the prior art, the application provides a method, a device, electronic equipment and a storage medium for generating similar sample linguistic data in a targeted manner, in the technical scheme provided by the application, a first seed sentence in a target field is obtained, second seed sentences in other fields except the target field are obtained, the seed sentences contain entity nouns in the field to which the seed sentences belong, pre-training models comprising multilayer coding networks are constructed, the first seed sentence is input into each pre-training model added with noise disturbance, output vectors belonging to the coding networks of a preset first class level in each pre-training model added with noise disturbance are obtained, and each first fusion result is determined, and then determining a target pre-training model in each pre-training model, inputting each second seed statement into the target pre-training model respectively, obtaining output vectors belonging to a preset second class level coding network in the target pre-training model respectively, determining a second fusion result, generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Preferred embodiments of the present application will be described in further detail below with reference to the accompanying drawings:

referring to fig. 1, which is a schematic diagram of a generating process of similar sample corpora in the embodiment of the present application, the following describes the generating process of similar sample corpora in the embodiment of the present application with reference to fig. 1:

it should be noted that, in this embodiment of the application, the processing device for generating similar sample corpora may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. And the electronic equipment can also be desktop computers, mobile phones, mobile computers, tablet computers and the like.

Considering the text difference between the vertical domains, when the text in the target domain is intended to be processed by using the deep learning model, for example, a text similar model is used, or before other tasks related to the text similar task are processed by using other models, the target domain needs to be corresponded, and the similar positive sample corpus and the similar negative sample corpus which meet the training requirement are generated in a targeted manner, wherein the similar positive sample corpus in the application is composed of sample corpuses with similar semantemes, the similar negative sample corpus is composed of sample corpuses with dissimilar semantemes, and different service scenes are divided in the vertical domain, such as an education domain, a science domain, an internet of vehicles domain, a medical domain and the like.

Step 101: the processing device obtains a first seed sentence of a target field and obtains second seed sentences in other fields except the target field.

In the embodiment of the application, a seed sentence needs to be constructed before generating similar sample corpus, wherein the seed sentence specifically includes a first seed sentence in a target field and a second seed sentence in other fields except the target field, and the seed sentence includes a noun in a field to which the seed sentence belongs, so that the field to which the seed sentence belongs can be determined by means of the seed sentence.

For example, assuming that the target domain is a medical domain, the corresponding seed sentence may include terms such as "XX disease", "XX diagnosis and treatment method", "XX blood index", and the like.

When the processing device respectively constructs corresponding seed sentences according to candidate texts in a target field and other fields, in order to process the texts into a recognizable form, the processing device needs to preprocess the obtained candidate texts, sequentially realize the operations of unifying the encoding format of the candidate texts, reducing noise, removing illegal characters in the candidate texts, and splitting the candidate texts into sentence forms according to specified characters.

Specifically, the processing device obtains a first candidate text in a target field and obtains second candidate texts in other fields except the target field, processes the first candidate text and the second candidate text into a specified coding format, respectively performs noise reduction processing and illegal character cleaning processing on the first candidate text and the second candidate text in the specified coding format, then splits the processed first candidate text according to specified characters to obtain a first seed sentence, and splits the processed second candidate text according to the specified characters to obtain each second seed sentence.

In the embodiment of the application, when the processing device acquires a first candidate text in a target field and acquires a second candidate text in other fields, the processing device may flexibly acquire candidate texts from various text sources, specifically, the processing device may acquire candidate texts from periodicals, papers, and other published or published texts, or the processing device may crawl candidate texts from various related websites by using a web crawler.

In the embodiment of the present application, when determining a first candidate text in a target field and determining a second candidate text in other fields except the target field, a processing device may adopt the following two ways:

the method comprises the steps of generating a first candidate sample and a second candidate sample based on a text with strong domain distinguishability.

Specifically, the processing device may directly acquire an academic journal published in a target field or a text strongly related to the target field as a first candidate text, and at the same time, acquire the published academic journal or a text strongly related to the determined other field from other fields except the target field as a second candidate text.

For example, assuming that the target field is a medical field, the text in a prescription order of a doctor, a medical examination report, and a journal related to medicine may be acquired to generate a first candidate text, and then the second candidate text may be a development report in the field of car networking, an interaction protocol in the field of communication, and generate a second candidate text.

And secondly, classifying the text content by adopting a text field classification model, taking the text corresponding to the target field as a first candidate text, and taking the text corresponding to other fields as a second candidate text.

Specifically, the processing device may obtain a trained text field classification model, where the text field classification model is obtained by training based on text samples in each field, and then input each obtained candidate text into the text field classification model, to obtain a classification result corresponding to each candidate text, and use a candidate text belonging to a target field as a first candidate text and use a candidate text not belonging to the target field as a second candidate text.

In the embodiment of the present application, the text domain classification model may specifically be a text classification model constructed based on a BERT model, wherein,

the text classification model comprises a BERT input layer, a BERT coding layer, a BERT output layer and a full-connection classification layer.

BERT input layer: embedded representations of the input corpus can be obtained, and specific embedded representations are determined based on vectorized content of the following three aspects: character vectorization (Token entries), character Position vectorization (Position entries), and Segment encoding (Segment entries) to which a character belongs.

Bert coding layer: according to the actual processing requirement, an encoding network comprising 12 layers can be set, each layer of encoding network corresponds to the attention of multiple heads, the inactivation probability (Dropout) of the neurons in the multi-layer encoding network is set according to the processing requirement, and encoding normalization is realized.

For example, the number of attention points per layer is set to 12, and the neuron inactivation probability (Dropout) is set to 0.1.

Bert output layer: and the method is used for outputting the final coding result of the samples according to the requirement or outputting the coding result of each position of the samples according to the layer.

And the full connection classification layer can set the total number of classification categories, receive the output result of the Bert output layer and output the final classification result through a linear full connection structure. In the embodiment of the present application, the multi-classification task may be set according to actual processing needs, for example, the classification category is set to 20.

In the embodiment of the application, after the text classification model based on the BERT model is constructed according to actual processing requirements, text classification is completed based on the text classification model obtained through training, or text classification can be completed by directly adopting the existing text classification models with other structures.

Under the condition of constructing the text classification model based on the BERT model, the processing equipment can adopt sample texts in the target field and other various fields in a targeted mode to carry out targeted training on the text classification model.

Specifically, texts obtained from various fields including a target field are used as sample texts, the fields corresponding to the sample texts are used as labels, training samples are generated, and a constructed text classification model is subjected to multiple rounds of iterative training based on the training samples until a preset convergence condition is met, wherein the convergence condition can be that the number of times of a loss value continuous set value reaches a set threshold value, and the set value and the set threshold value are set according to actual processing requirements, so that the method and the device do not make too much limitation.

In a training process, the following operations may be specifically performed: inputting a sample text into a text classification model to obtain a field classification result, calculating a loss value of the text classification model by adopting a cross entropy loss function according to the difference between the field classification result and a corresponding label, and adjusting a model parameter of the text classification model by means of back propagation of the loss value.

The processing equipment can realize classification of text attribution fields based on the trained text classification model.

For example, the processing device may classify the obtained text content by using the trained text classification model, and then determine a text whose classification result is the target field as a first candidate text, and determine a text whose classification result is the other field as a second candidate text.

Therefore, by means of the text classification model, the domain of text attribution does not need to be considered when the text is obtained, meanwhile, under the condition of the existing multi-domain cross fusion, the text content generally has the problem of the content in each domain, so that the domains corresponding to the text can be effectively distinguished, and a basis is provided for obtaining effective seed sentences subsequently.

Further, in the embodiment of the present application, in consideration of that text contents acquired from different approaches may have different encoding formats, the processing device needs to process the first candidate text and the second candidate text into encoding formats capable of being processed, and at the same time, in consideration of that the acquired text may include contents interfering with text processing, such as description information related to a text format and a typesetting manner, the first candidate text and the second candidate text need to be further subjected to noise reduction processing.

In addition, the processing device may generally select a recognizable character set in advance in consideration of limited processing capability of the processing device, so that characters not in the character set are all listed as illegal characters, and further, the illegal characters need to be cleaned from the obtained first candidate text and second candidate text, so as to avoid interference of the illegal characters on a normal text processing process, and after a series of noise reduction and cleaning operations are completed, a first seed sentence and each second seed sentence are split from the processable first candidate text and the processable second candidate text according to specified characters.

For example, assuming that the processing device acquires that text 1, text 2, and text 3 are included in the first candidate text, and text 1 adopts the encoding format of GBK, text 2 adopts the encoding format of UTF-8, and text 3 adopts the encoding format of GB2312, in the case where it is determined that text in the GBK format can be processed, the processing device needs to process the encoding format of text 2 into GBK by UTF-8, and the processing device needs to process the encoding format of text 3 into GBK by GB 2312.

For another example, assuming that the recognizable character selected by the processing device is "Chinese", the processing device will not recognize non-Chinese characters, and thus the non-Chinese characters in the candidate text need to be cleaned up in order to avoid the occurrence of messy codes in the subsequent processing.

It should be noted that performing format conversion, noise reduction, and illegal character cleaning on a text is a conventional technique in the art, and this application will not be described in detail here.

In addition, in the embodiment of the present application, when splitting the seed sentence from the first candidate text and the second candidate text according to the specified characters, a large number of first seed sentences and a large number of second seed sentences are generally split, in the embodiment of the present application, for convenience of describing the generation manner of similar sample corpora, only one first seed sentence in the target domain is taken as an example in the related description, a description will be given of a process of generating groups of similar positive sample corpora and groups of similar negative sample corpora based on a first seed sentence, in the actual processing process, the related groups of similar positive sample corpora and similar negative sample corpora can be respectively generated for each first seed statement according to actual processing requirements, the related implementation principle is the same as the principle generated based on one first seed statement, and the description is omitted.

Therefore, by means of the mode of unifying the coding format, reducing noise of the text content and removing illegal characters in the text content, the obtained text can be arranged into a content-compliant form according to actual processing requirements, the generation efficiency of the sample corpus is improved in an auxiliary mode, and the effectiveness of the generated sample corpus is ensured to a certain extent.

Step 102: the processing equipment constructs pre-training models comprising multiple layers of coding networks, inputs first seed sentences into the pre-training models added with noise disturbance, and obtains first fusion results determined according to output vectors of the coding networks belonging to a preset first class level in the pre-training models added with the noise disturbance.

In the embodiment of the application, when a plurality of pre-training models including a plurality of layers of coding networks are constructed by a processing device, a reference model including the plurality of layers of coding networks can be obtained, the number of attention heads of each layer of coding networks in the reference model and the inactivation probability of neurons in each layer of coding networks are determined, and then the pre-training models including the plurality of layers of coding networks are constructed by adjusting the number of attention heads of the coding networks in the reference model and the inactivation probability of the neurons.

Specifically, the processing device may select a BERT model with a specified structure as a reference model, and obtain each pre-training model by adjusting a BERT model structure and parameters, where the adjusted model structure refers to a head number (head) of attention in a coding layer in the BERT model, the adjusted parameters specifically refer to values of attention deactivation probabilities (attention probabilities), and the number of each obtained pre-training model is set according to actual processing needs, which is not specifically limited in this application.

Referring to table 1, which illustrates a reference model in the application embodiment and other models obtained by adjustment based on the reference model, the number of coding layers of BERT is 12, the number of attention heads of each coding layer is 6, and the probability of attention deactivation is 0.1, that is, the model indicated by model number M1 in table 1, then, based on reference model M1, only the probability of attention deactivation may be adjusted to obtain model 2, or only the number of attention heads may be adjusted to obtain model 2, or both the number of attention heads and the probability of attention deactivation may be adjusted simultaneously to obtain model 3.

TABLE 1

	Model number	Head	attention_dropout
				Reference model	M1	6	0.1
Model 1	M2	6	0.15
				Model 2	M3	12	0.1
Model 3	M4	12	0.15

It should be noted that, in some possible embodiments of the present application, each pre-training model obtained by the processing device may include a selected reference model, such as the model corresponding to the model numbers M1-M4 illustrated in table 1, as each pre-training model, and in other possible embodiments of the present application, each pre-training model obtained by the processing device may not include a selected reference model, such as the model corresponding to the model numbers M2-M4 illustrated in table 1, as each pre-training model.

Therefore, through the constructed pre-training models with different structures and parameters, each group of similar positive sample corpora can be generated aiming at one first seed statement in a different mode, and a processing basis is provided.

In the embodiment of the application, the processing device is configured to increase differences between the first fusion results generated according to the same first seed statement, and avoid overfitting of the model caused by performing model training on the generated similar positive sample corpus according to the first fusion results, so that the processing device needs to add noise disturbance to each obtained pre-training model.

When the processing device adds noise disturbance in each pre-training model, a linear transformation mode can be specifically adopted, noise disturbance is added in one pre-training model by means of configured disturbance factors, and noise disturbance can be added in one pre-training model by using a noise function alone or in combination according to actual processing requirements.

Specifically, the processing device may add noise disturbance to the processing results of different stages in a pre-training model at different processing stages of the pre-training model by means of linear transformation with the aid of configured disturbance factors, respectively; in addition, the processing device can apply a noise function to different processing stages of a pre-training model to achieve targeted addition of noise disturbance to processing results of different stages in the pre-training model.

Based on the above proposed method for adding noise disturbance, at least one or a combination of the following processing means may be used to actually add noise disturbance.

The processing means 1 and the processing device respectively process the input data of each layer of coding network based on each configured first perturbation factor.

In this embodiment of the application, the processing device may add disturbance to input data of each coding layer in the pre-training model.

Specifically, the processing device may obtain input data of a layer of coding network to which noise is added by using the following formula:

En＝μ*E

wherein, E is input data in a vector form, mu is a first disturbance factor, the value interval corresponding to mu is [0.85, 1], and En is input data added with noise.

It should be noted that, in the implementation of the present application, the same first perturbation factor may be used for processing the input data of each coding layer, or different first perturbation factors may be used for processing the input data of each coding layer according to actual processing requirements, which is not limited in this application.

And the processing means 2 and the processing device respectively process the model parameters of each layer of coding network based on each configured second perturbation factor.

In this embodiment of the application, the processing device may add disturbance to the model parameters of each layer of the coding network in the pre-training model.

Specifically, the processing device may obtain the model parameter added with the noise by using the following formula:

Wn＝φ*W

wherein Wn is the model parameter in the layer of coding network after the disturbance is added, phi is the second disturbance factor, the value interval corresponding to phi is [0.70, 1], and W is the model parameter in the layer of coding network.

It should be noted that, in the implementation of the present application, the same second perturbation factor may be used for processing the model parameter of each coding layer, or different second perturbation factors may be used for processing the model parameter of each coding layer according to the actual processing requirement, which is not limited in this application.

And the processing means 3 and the processing equipment process the gradient parameters obtained by calculation during reverse propagation based on the configured third disturbance factor.

In this embodiment of the present application, the processing device may add disturbance to the gradient parameter obtained by calculation during reverse propagation by using the following formula:

Gn＝θ*G

wherein Gn is the gradient parameter after disturbance is added, the third disturbance factor is θ, the corresponding value range is [0.9, 1], and G is the gradient parameter calculated in the back propagation process.

It should be noted that the calculation process of the gradient parameters is a conventional technique in the art, and the present application is not specifically described herein.

Therefore, by means of the disturbance factors configured in the processing means 1-3, noise disturbance can be added to the pre-training model, so that each first fusion result with difference can be obtained in the pre-training process of the pre-training model in the subsequent process, and overfitting of other models caused by the fact that similar sample corpora constructed based on the first fusion results are used for training other models is avoided.

And the processing means 4 and the processing equipment respectively process the input data of each layer of coding network by adopting each preset first noise function.

Specifically, the processing device may use a gaussian noise function as the first noise function, and implement processing on the input data of the first layer of coding network by using the following formula:

En＝E+θ*N(a，b)

the method comprises the steps that En is an input vector of a layer of coding network added with noise disturbance, E is the input vector of the layer of coding network, N (a, b) is a Gaussian noise function, a is the mean value of E, b is the variance of E, and theta is a coefficient, and the value is set according to actual processing requirements, such as the value is 0.05.

It should be noted that, in the implementation of the present application, in order to describe generation of each set of positive sample corpora and each set of negative sample corpora, it is described by taking an example of inputting one first seed sentence into each pre-training model, and in an actual processing process, in a process of performing a batch of iterative training on the pre-training models, a plurality of input first seed sentences are generally input at the same time, so that the pre-training models can obtain input vectors in a layer of coding network for each input first seed sentence at the same time, so that element mean values and variances at each position can be calculated, and then the element mean values and the variances at each position are used as parameters in a gaussian noise function.

Meanwhile, the same first noise function may be adopted for processing the input data of each coding layer, or different first noise functions determined based on different coefficients θ may be respectively adopted for processing the input data of each coding layer according to actual processing requirements, which is not specifically limited in the present application.

And the processing means 5 and the processing equipment respectively process the model parameters of each layer of coding network by adopting each preset second noise function.

Specifically, the processing device may use the gaussian noise function as the second noise function, and implement processing on the model parameter of the first layer of coding network by using the following formula:

Wn＝W+β*N(c，d)

wherein Wn is a model parameter in a layer of coding network added with noise interference, W is a model parameter in the layer of coding network, β is a coefficient, and a value is set according to actual processing needs, for example, the value is 0.05, N (c, d) is a second noise function, c is a mean value of W, and d is a variance of W.

It should be noted that, in the implementation of the present application, the noise disturbance adding manner proposed by the processing means B may be applied to calculate the mean and variance of each model parameter based on the values determined by the same model parameter in the corresponding layer of coding network in the previous training round of each batch, and then the mean and variance of each model parameter and the parameter in the gaussian noise function are calculated.

Meanwhile, the same second noise function may be adopted for processing the model parameters of each coding layer, or different second noise functions determined based on different coefficients β may be adopted for processing the model parameters of each coding layer according to actual processing requirements, which is not specifically limited in the present application.

And the processing means 6 and the processing equipment process the gradient parameters obtained by calculation during reverse propagation by adopting a preset third noise function.

Specifically, the processing device may use the gaussian noise function as a third noise function, and add disturbance to the gradient parameter calculated during the reverse propagation by using the following formula:

Gn＝G+γ*N(e，f)

wherein Gn is a gradient parameter after disturbance is added, G is a gradient parameter calculated during reverse propagation, N (e, f) is a third noise function, e is a mean value of G, f is a variance of G, and γ is a coefficient, and a value is set according to actual processing requirements, such as a value of 0.05.

Thus, by means of the processing mode adopted in the processing means 4-6, the processing device can call a noise function in the training process of each pre-training model, selectively perform dynamic interference on the input data and the model parameters, or the input data and the model parameters, realize noise superposition processing, and increase the difference between each subsequently obtained first fusion result.

In this embodiment of the application, after determining each pre-training model to which noise disturbance is added, the processing device may input an obtained first seed statement in the target field into each pre-training model to which noise disturbance is added, and obtain each first fusion result determined according to an output vector belonging to a coding network of a preset first class level in each pre-training model to which noise disturbance is added.

Specifically, the processing device obtains each first fusion result determined according to the output vector belonging to the coding network of the preset first class hierarchy in each pre-training model added with noise disturbance. The processing equipment respectively executes the following operations aiming at each pre-training model added with the noise disturbance: determining at least one target level coding network belonging to a preset first level in a pre-training model added with noise disturbance, obtaining output vectors of all target level coding networks, and performing weighted summation on elements at the same positions in all output vectors to obtain a corresponding first fusion result.

It should be noted that, in this embodiment of the application, the processing device may preset a first class of hierarchy according to an actual processing requirement, where the first class of hierarchy includes numbers of designated coding networks of each layer, and in addition, the number of layers of the coding networks constrained in the first class of hierarchy and a specific number are determined according to the actual processing requirement, and this application is not limited specifically herein.

In addition, in this embodiment of the application, the processing device sets corresponding weight parameters in advance for output vectors of each layer of coding networks in the first class of hierarchy, and a sum of the weight parameters corresponding to each layer of coding networks in the first class of hierarchy is 1, where the set weight parameters are set according to actual needs, for example, the set weight parameters may be set to be the same or different, and this application does not make specific limitations.

For example, assuming that a preset first class hierarchy is specifically a 3 rd layer, a 6 th layer, a 9 th layer, and a 12 th layer, taking a process of generating a corresponding first fusion result by using a pre-training model added with noise disturbance as an example, after a processing device inputs a first seed statement into the pre-training model, output vectors of the 3 rd, 6 th, 9 th, and 12 th layers in the pre-training model added with noise disturbance are obtained, assuming that a weight parameter is respectively configured for the output vectors of each hierarchy according to actual processing requirements to be 0.25, weighting and summing up the output vectors according to the configured weight parameters, so as to obtain a first fusion result.

Therefore, in the process of carrying out unsupervised pre-training on the pre-training model added with the noise disturbance by adopting the first seed sentences, the processing equipment can obtain each pre-training model added with the noise disturbance, and each first fusion result generated according to the same first seed sentences can ensure that the difference exists between each first fusion result.

Step 103: the processing equipment determines a target pre-training model in each pre-training model, inputs each second seed statement into the target pre-training model respectively, and obtains a second fusion result determined according to an output vector of a coding network belonging to a preset second class level in the target pre-training model.

Specifically, the processing device determines a target training model for generating the second fusion result in each pre-training model, where the processing device may arbitrarily select a pre-training model as a target pre-training model in each pre-training model, and the number of the determined target training models may be one or more according to actual processing needs.

In the embodiment of the application, when the fact that the subsequent generation of the similar negative sample corpora needs to be combined with the first fusion result and each second fusion result to form and generate is considered, after the pre-training model added with the noise disturbance corresponding to the first fusion result is determined, the pre-training model corresponding to the pre-training model added with the noise disturbance can be used as a target pre-training model, so that the generated groups of similar negative sample corpora are generated based on the same model structure, and when other models are subsequently trained, the other models can be more concentrated in learning semantic differences among the similar negative sample corpora.

Further, the processing device respectively inputs the obtained second seed sentences into a target pre-training model, and respectively obtains a second fusion result determined according to an output vector of a coding network belonging to a preset second class hierarchy in the target pre-training model.

Specifically, after the processing device inputs a second seed sentence into the target pre-training model, the following operations are respectively performed: and determining at least one target level coding network belonging to a preset second class level in the target pre-training model, obtaining output vectors of all the target level coding networks, and performing weighted summation on elements at the same positions in all the output vectors to obtain a corresponding second fusion result.

It should be noted that, in this embodiment of the application, the processing device may preset a second class of hierarchy according to an actual processing requirement, where the second class of hierarchy includes numbers of the designated coding networks of each layer, and the number of layers of the coding networks constrained in the second class of hierarchy and a specific number are determined according to the actual processing requirement, and this application is not limited specifically herein.

For example, assuming that the preset second class of layers are specifically the 3 rd layer, the 6 th layer, the 9 th layer, and the 12 th layer, the processing device inputs a second seed statement into the target pre-training model to obtain output vectors of the 3 rd, the 6 th, the 9 th, and the 12 th layers in the target pre-training model, and assuming that the weight parameters configured for the output vectors of each level are 0.25 respectively according to actual processing requirements, then the output vectors are weighted and summed according to the configured weight parameters to obtain a second fusion result.

In this way, the first fusion result and the second fusion result due to the construction of the similar negative sample corpus can be generated by means of the pre-training model of the same structure.

Step 104: and the processing equipment generates each group of similar positive sample corpora according to each first fusion result and generates each group of similar negative sample corpora according to each first fusion result and each second fusion result.

In this embodiment of the application, the processing device determines a target first fusion result in each first fusion result, combines the target first fusion result with each other first fusion result except the target first fusion result in each first fusion result, respectively, to obtain each group of similar positive sample corpora, and combines the target first fusion result with each second fusion result, respectively, to obtain each group of similar negative sample corpora.

For example, assuming that there are 4 pre-training models, M1-M4, respectively, according to a first seed statement Si, M1 with noise disturbance added generates a first fusion result of Vi 1; the first fusion result generated by the M2 adding the noise disturbance is Vi 2; the first fusion result generated by M3 with noise disturbance added is Vi3, and the first fusion result generated by M4 with noise disturbance added is Vi4, then a pre-trained model can be selected from M1-M4, and assuming that M1 is selected, then the groups of similar positive sample corpora constructed by Vi1 for the target first fusion result are: { Vi1, Vi2}, { Vi1, Vi3}, and { Vi1, Vi4 }.

For another example, taking the pre-training model M1 corresponding to the target first fusion result as the target pre-training model, and assuming that the second seed sentences Sj1-Sj5 are based on the second seed sentences, the second fusion results generated by the target pre-training model corresponding to the second seed sentences Sj1-Sj5 are Nj1-Nj5, and the generated groups of similar negative sample corpora are: { Vi1, Nj1}, { Vi1, Nj2}, { Vi1, Nj3}, { Vi1, Nj4}, and { Vi1, Nj5 }.

Therefore, each pre-training model added with noise disturbance is combined to generate similar positive sample corpora based on each fusion result in one first seed statement, and the same pre-training model is combined to generate similar negative sample corpora based on the fusion results generated by the seed statements in different fields, so that the similar negative sample corpora in the generated similar negative sample corpora group have obvious semantic difference, the generation process of the similar sample corpora is simplified, the generation efficiency of the similar sample corpora is improved, and effective similar sample corpora can be generated.

In summary, in the case that the pre-training model is a BERT model, according to the technical scheme of the present application, after building each BERT model with a different structure, in each perturbed BERT model, in an unsupervised pre-training process based on a first seed statement, a first fusion result is generated by weighting according to an output vector of each designated level, and each first fusion result can generate each group of similar positive sample corpora.

Based on the same inventive concept, referring to fig. 2, which is a schematic diagram of a logic structure of a similar sample corpus generating device in the embodiment of the present application, the similar sample corpus generating device 200 includes an obtaining unit 201, a constructing unit 202, a determining unit 203, and a generating unit 204, wherein,

an obtaining unit 201, configured to obtain a first seed sentence in a target field, and obtain second seed sentences in other fields except the target field, where the seed sentences include entity nouns in the field to which the seed sentences belong;

a constructing unit 202, configured to construct pre-training models each including multiple layers of coding networks, and input the first seed statement into each pre-training model to which noise disturbance is added, to obtain each first fusion result determined according to an output vector belonging to a coding network of a preset first class level in each pre-training model to which noise disturbance is added;

the determining unit 203 is configured to determine a target pre-training model in each pre-training model, and input each second seed statement into the target pre-training model respectively to obtain a second fusion result determined according to an output vector of a coding network belonging to a preset second class level in the target pre-training model;

the generating unit 204 is configured to generate each group of similar positive sample corpora according to each first fusion result, and generate each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Optionally, when the first seed sentence in the target field is obtained, and each second seed sentence in other fields except the target field is obtained, the obtaining unit 201 is configured to:

Optionally, when the first candidate text in the target field is obtained and the second candidate texts in other fields except the target field are obtained, the obtaining unit 201 is configured to:

Optionally, when constructing each pre-training model including a multi-layer coding network, the constructing unit 202 is configured to:

Optionally, when noise disturbance is added to each pre-training model, the constructing unit 202 performs any one or a combination of the following operations for each pre-training model:

respectively processing input data of each layer of coding network by adopting each preset first noise function;

Optionally, when obtaining each first fusion result determined according to the output vector of the coding network belonging to the preset first class hierarchy in each pre-training model added with noise disturbance, the constructing unit 202 is configured to:

Optionally, when generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result, the generating unit 204 is configured to:

Based on the same inventive concept as the above method embodiment, an electronic device is further provided in the embodiment of the present application, referring to fig. 3, which is a schematic diagram of a hardware component structure of an electronic device to which the embodiment of the present application is applied, and the electronic device 300 may at least include a processor 301 and a memory 302. The memory 302 stores program codes, and when the program codes are executed by the processor 301, the processor 301 executes any one of the steps of generating the similar sample corpus.

In some possible implementations, a computing device according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of generating similar sample corpora according to various exemplary embodiments of the present application described above in the present specification. For example, a processor may perform the steps as shown in fig. 1.

Based on the same inventive concept, in the embodiment based on the generation of the similar sample corpus, a computer-readable storage medium is provided, and when an instruction in the storage medium is executed by an electronic device, the electronic device is enabled to execute the method for generating the similar sample corpus.

In summary, the present application provides a method, an apparatus, an electronic device, and a storage medium for generating similar sample corpora, in which in a technical solution provided by the present application, a first seed sentence in a target field is obtained, second seed sentences in other fields except the target field are obtained, the seed sentences include nouns in the field to which the seed sentences belong, pre-training models each including a multi-layer coding network are then constructed, the first seed sentence is input into each pre-training model to which noise disturbance is added, output vectors belonging to a coding network of a preset first class level in each pre-training model to which noise disturbance is added are obtained, first fusion results are determined, then a target pre-training model is determined in each pre-training model, and each second seed sentence is input into the target pre-training model, and respectively obtaining second fusion results determined according to output vectors of the coding networks belonging to a preset second class hierarchy in the target pre-training model, generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result.

Thus, when generating the similar sample corpus of the target field, the first seed sentence of the target field is respectively input into each pre-training model added with noise disturbance, so that various noises are fused in each first fusion result generated corresponding to the first seed sentence to different degrees, and the similarity among the similar sample corpuses is ensured, and simultaneously, when generating the similar negative sample corpus, at least one target pre-training model determined from each pre-training model is adopted to generate the corresponding similar negative sample corpus based on each second seed sentence of different fields, so that the similar negative sample corpus in the generated similar negative sample corpus has obvious semantic difference, thereby simplifying the generation process of the similar sample corpus, the generation efficiency of similar sample corpora is improved, and effective similar sample corpora can be generated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims

1. A method for generating similar sample corpora is applied to a similar sample corpus generating process in a target field, and comprises the following steps:

generating each group of similar positive sample corpora according to each first fusion result, and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result;

the obtaining of each first fusion result determined according to the output vector belonging to the coding network of the preset first class hierarchy in each pre-training model added with noise disturbance includes:

determining at least one target level coding network belonging to a preset first level in a pre-training model added with noise disturbance, and obtaining an output vector of each target level coding network; carrying out weighted summation on elements at the same positions in each output vector to obtain a corresponding first fusion result;

the generating each group of similar positive sample corpora according to the first fusion results and generating each group of similar negative sample corpora according to the first fusion results and the second fusion results includes:

determining a target first fusion result in each first fusion result, and combining the target first fusion result with each other first fusion result except the target first fusion result in each first fusion result to obtain each group of similar positive sample corpora; and combining the target first fusion result with each second fusion result respectively to obtain each group of similar negative sample corpora.

2. The method of claim 1, wherein the obtaining a first seed sentence in a target domain and obtaining respective second seed sentences in other domains than the target domain comprises:

3. The method of claim 2, wherein the obtaining a first candidate text in a target domain and obtaining a second candidate text in other domains than the target domain comprises:

4. The method of claim 1, wherein constructing pre-trained models each comprising a multi-layered coding network comprises:

acquiring a reference model containing a plurality of layers of coding networks, and determining the attention head number of each layer of coding network in the reference model and the inactivation probability of neurons in each layer of coding network;

5. The method of claim 1, wherein any one or a combination of the following operations are performed separately for each pre-trained model as noise perturbations are added in each pre-trained model:

6. A similar sample corpus generating device is applied to a similar sample corpus generating process in a target field, and comprises the following steps:

the generating unit is used for generating each group of similar positive sample corpora according to each first fusion result and generating each group of similar negative sample corpora according to each first fusion result and each second fusion result;

wherein, when obtaining each first fusion result determined according to the output vector belonging to the coding network of the preset first class hierarchy in each pre-training model added with noise disturbance, the constructing unit is configured to:

when generating each group of similar positive sample corpora according to the first fusion results and generating each group of similar negative sample corpora according to the first fusion results and the second fusion results, the generating unit is configured to: determining a target first fusion result in each first fusion result, and combining the target first fusion result with each other first fusion result except the target first fusion result in each first fusion result respectively to obtain each group of similar positive sample corpora; and combining the target first fusion result with each second fusion result respectively to obtain each group of similar negative sample corpora.

7. The apparatus according to claim 6, wherein when the first seed sentence of the target domain is obtained, and the second seed sentences in other domains except the target domain are obtained, the obtaining unit is configured to:

8. The apparatus of claim 7, wherein when obtaining a first candidate text in a target domain and obtaining a second candidate text in other domains except the target domain, the obtaining unit is configured to:

9. The apparatus of claim 6, wherein the build unit, in building each pre-trained model comprising a multi-layered coding network, is to:

10. The apparatus of claim 6, wherein the construction unit performs any one or a combination of the following operations for each pre-trained model when adding noise disturbance in each pre-trained model, respectively:

11. A computer-readable electronic device, comprising:

a memory for storing executable instructions;

a processor for reading and executing executable instructions stored in the memory to implement the method of any one of claims 1 to 5.

12. A storage medium, wherein instructions in the storage medium, when executed by an electronic device, enable the electronic device to perform the method of any of claims 1-5.