CN115408498A

CN115408498A - Data dynamic identification method based on natural language

Info

Publication number: CN115408498A
Application number: CN202211359030.2A
Authority: CN
Inventors: 杨介; 崔昆俞; 赵鸿; 伍之洲
Original assignee: Zhongfu Safety Technology Co Ltd
Current assignee: Zhongfu Safety Technology Co Ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2022-11-29
Anticipated expiration: 2042-11-02
Also published as: CN115408498B

Abstract

The application discloses a data dynamic identification method based on natural language, mainly relates to the technical field of data dynamic identification, and is used for solving the problems of general applicability and high uncertainty of the existing model performance. The method comprises the following steps: determining semantic tag data corresponding to the sample data; generating an experiment set; splitting the data into a training data set and a verification data set; importing sample data in a training data set into a preset encoder; splicing into sample splicing data; importing semantic label data in a training data set into a preset generator, and further splicing into label splicing data; determining the cost value of the distance between the sample splicing data and the label splicing data to obtain a trained preset discriminator; obtaining a trained preset encoder and a preset generator; obtaining verification sample splicing data; obtaining verification label splicing data; and completing the matching of the data. By the method, the fitting degree of the model and the data is improved, and the accuracy is prompted.

Description

Data dynamic identification method based on natural language

Technical Field

The application relates to the technical field of dynamic data identification, in particular to a dynamic data identification method based on natural language.

Background

The text classification task in the natural language processing field is widely applied to various fields and industries, the application scene range is wide, and various resources (service platforms, hardware, technical frameworks, data and the like) which can be utilized in the development and implementation process are various.

The existing method for dynamically identifying data comprises the following steps: and (3) adopting a general structure, taking a BERT or similar model [ CLS ] layer as the input of a classifier, and carrying out classification task training. Can meet certain industrial requirements under normal conditions, and has simple realization and short development period.

However, as data asset security management becomes more and more standard, a certain bottleneck exists in the performance level in a real service scene through a general architecture, and the universality and uncertainty of the corresponding model performance are large.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention provides a dynamic data identification method based on natural language, so as to solve the above-mentioned technical problems.

The application provides a data dynamic identification method based on natural language, which comprises the following steps: acquiring a sample set, and determining semantic tag data corresponding to each sample data in the sample set; generating an experiment set based on the sample data, the semantic label data and the mapping relation between the sample data and the semantic label data; splitting the experiment set into a training data set and a verification data set; importing sample data in a training data set into a preset encoder; acquiring a plurality of sample sub-splicing data from a hidden layer of a preset encoder based on a preset sample splicing data acquisition position, and further splicing the sample sub-splicing data into sample splicing data; importing semantic label data and preset reference dimension data in a training data set into a preset generator; acquiring a plurality of label sub-splicing data from a hidden layer of a preset generator based on a preset label splicing data acquisition position, and splicing the label sub-splicing data into label splicing data; determining a distance cost value between the sample splicing data and the label splicing data based on a preset distribution distance equation; importing the distance cost value, the preset learning rate, the preset smoothing constant and the initial discriminator weight value into a preset optimizer to complete weight updating of the preset discriminator so as to obtain a trained preset discriminator; importing a preset learning rate, a preset smoothing constant, an initial encoder weight value and a trained preset discriminator into a preset optimizer to finish updating of the preset encoder weight value; importing a preset learning rate, a preset smoothing constant, an initial encoder weight value and a trained preset discriminator into a preset optimizer to finish updating of a preset generator weight value; so as to obtain a trained preset encoder and a preset generator; obtaining verification sample splicing data corresponding to sample data in a verification data set based on a trained preset encoder; obtaining verification label splicing data based on the trained preset generator or label splicing data; and matching the verification sample splicing data with the verification label splicing data based on a trained preset discriminator or a preset matching degree calculation formula.

Further, the hidden layer of the preset encoder and the hidden layer of the preset generator both comprise a text semantic coding network layer and a label semantic coding network layer.

Further, determining semantic tag data corresponding to each sample data in the sample set specifically includes: acquiring a semantic tag set through a preset semantic tag interface; wherein the semantic tag set comprises semantic tag data; or, semantic tag data corresponding to each sample data is obtained through a preset keyword/subject term extraction algorithm; or analyzing the part of speech of the sample data through a preset sample part of speech analysis algorithm to obtain preset attribute words corresponding to the sample data, and splicing the preset attribute words into semantic tag data; or when the preset associated data set corresponding to the sample set is obtained, extracting the keywords/subject terms corresponding to the preset associated data set through a keyword/subject term extraction algorithm and a preset sample part-of-speech analysis algorithm to obtain semantic tag data.

Further, acquiring a sample set specifically includes: and acquiring real service data or replacing open source service data or artificial sample data as a sample set through a preset sample uploading process.

Further, before determining a distance cost value between the sample stitching data and the label stitching data based on a preset inter-distribution distance equation, the method further includes: replacing joint distribution in the Wasserstein-distance method with an encoder, replacing edge distribution with a generator, and replacing sampling with sample splicing data and label splicing data; obtaining a preset distance cost value calculation formula:

wherein D () is the output result of the preset discriminator,

the data is spliced for the samples and,

data is spliced for the label.

Further, importing the distance cost value, a preset learning rate, a preset smoothing constant and an initial discriminator weight value into a preset optimizer, and finishing weight updating of the preset discriminator, specifically comprising: updating the formula by the preset discriminator weight:

updating the weight value of the preset discriminator; wherein, the first and the second end of the pipe are connected with each other,

to preset the weight values of the encoders generated in the updating process,

in order to be the cost value of the distance,

to prepareThe learning rate is set, and the learning rate is set,

and

is a preset smoothing constant; when the weight value is larger than c or smaller than-c, through a preset clipping formula:

performing gradient clipping on a preset discriminator weight value; wherein c is a clipping threshold.

Further, leading the preset learning rate, the preset smoothing constant, the initial encoder weight value and the trained preset discriminator into the preset optimizer, completing the updating of the preset encoder weight value, and specifically comprising: updating the formula by presetting the encoder weight:

updating the weight value of the preset encoder; wherein the content of the first and second substances,

to preset the weight values of the encoders generated in the updating process,

the data is spliced for the samples and,

in order to preset the learning rate, the learning rate is set,

and

is a preset smoothing constant.

Further, the preset learning rate, the preset smoothing constant, the initial encoder weight value and the trained preset discriminator are led into the preset optimizer to complete preset generationUpdating the weight value specifically includes: updating the formula by the preset generator weight:

updating the weight value of the preset generator; wherein the content of the first and second substances,

to update the weight values of the preset generators generated in the process,

the data is spliced for the label and,

in order to preset the learning rate, the learning rate is set,

and

is a preset smoothing constant.

Further, after obtaining the trained preset encoder and preset generator, the method further comprises: and modifying the semantic label data by presetting a semantic label modification interface.

As can be appreciated by those skilled in the art, the present invention has at least the following beneficial effects:

(1) And through presetting the sample splicing data acquisition position and presetting the label splicing data acquisition position, the sample splicing data and the label splicing data which are formed by splicing different depth hidden layers of the related structure are acquired. Because the hidden layer corresponding to the acquisition position is preset, relevant technicians can select proper hidden layer input data or output data for flexible splicing according to the network structure characteristics of the relevant technicians (a preset encoder, a preset generator and the like).

(2) Due to reasons such as data asset privacy and safety, when effective real service scene data cannot be obtained during model training, relevant open source data can be obtained through a preset interface, semantic tag data relevant to requirements can be obtained, a simulation training effect separated from the real service scene data can be achieved, risks caused by problems such as data privacy disclosure are avoided, and compared with a traditional scheme, the effect is improved to a certain extent.

(3) Finally, the range and definition of the semantic tag data can be modified through the preset semantic tag modification interface, so that the classification function can be dynamically adjusted to a certain degree, more diversified requirements are met, and better support is provided for user customization.

Drawings

Some embodiments of the disclosure are described below with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of a dynamic data identification method based on natural language according to an embodiment of the present application.

Detailed Description

It should be understood by those skilled in the art that the embodiments described below are only preferred embodiments of the present disclosure, and do not mean that the present disclosure can be implemented only by the preferred embodiments, which are merely for explaining the technical principles of the present disclosure and are not intended to limit the scope of the present disclosure. All other embodiments that can be derived by one of ordinary skill in the art from the preferred embodiments provided by the disclosure and that fall within the scope of the disclosure are intended to be encompassed by the present disclosure without any inventive step.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The technical solutions proposed in the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 further provides a data dynamic identification method based on natural language for the embodiment of the present application, and as shown in fig. 1, the method provided in the embodiment of the present application mainly includes the following steps:

and 110, acquiring a sample set, and determining semantic tag data corresponding to each sample data in the sample set.

It should be noted that the sample set may be real service data, and in the case of lack of support of the real service data, or the real service scene data cannot directly participate in training due to privacy and security, the sample set may also be substitute open-source service data rich in similar semantics. The replacement open source service data may be supplemented by similar near semantic replacements, or other open source data sets. In the event that no suitable substitute open source business data can be found, approximately 50 samples (artificial sample data) can be artificially made for each tag as a set of samples. The specific contents of the real service data, the alternative open source service data and the artificial sample data can be determined by those skilled in the art according to actual conditions.

The method for obtaining the sample set may be obtaining through a preset sample uploading process.

The method for determining semantic tag data corresponding to each sample data in the sample set may specifically be: (1) Acquiring a semantic tag set uploaded by an operator through a preset semantic tag interface; wherein the semantic tag set comprises semantic tag data; (2) And importing the sample set into a preset keyword/subject term extraction algorithm to obtain semantic tag data corresponding to each sample data. The preset keyword/subject term extraction algorithm can be any algorithm with a keyword/subject term extraction function, such as a TF-IDF algorithm; (3) The method has the advantages that the sentence backbone is analyzed through syntactic analysis and is directly made of sentence backbone paragraphs, the labels obtained by the method are more generalized and are suitable for individual requirements, or under the condition that original data are not labeled, data related to samples can be searched by combining other business data of companies, and all semantic label data are obtained through business logic and algorithm correlation.

Thus, the acquisition of the sample set, the acquisition of the semantic tag data and the mapping of the sample data and the semantic tag data are completed.

Step 120, generating an experiment set based on the sample data, the semantic tag data and the mapping relation between the sample data and the semantic tag data; and splitting the experiment set into a training data set and a verification data set.

Step 130, importing sample data in the training data set into a preset encoder; and acquiring a plurality of sample sub-splicing data from a hidden layer of a preset encoder based on a preset sample splicing data acquisition position, and splicing the sample sub-splicing data into sample splicing data.

It should be noted that the hidden layer of the preset encoder at least includes a text semantic coding network and a tag semantic coding network. And presetting a sample splicing data acquisition position for defining which hidden layer corresponding to the text semantic coding network and the label semantic coding network extracts hidden layer input data or output data as sample sub-splicing data. And further splicing all the obtained sample sub-spliced data into sample spliced data. For example,

for the output of the step 1 text semantic coding network,

outputting the labeled semantic coding network in the step 1;

the data is stitched for the sample.

Step 140, importing semantic label data and preset reference dimension data in the training data set into a preset generator; and acquiring a plurality of label sub-splicing data from the hidden layer of the preset generator based on the preset label splicing data acquisition position, and splicing the label sub-splicing data into label splicing data.

It should be noted that the hidden layer of the preset generator at least includes a text semantic coding network and a label semantic coding network. Preset tag splice data acquisitionAnd the position is used for defining which hidden layer corresponding to the text semantic coding network and the label semantic coding network extracts hidden layer input data or output data as label sub-splicing data. And then splicing all the obtained label sub-splicing data into label splicing data. For example,

the output of the network is generated for the tag semantics in step 1,

generating the output of the network for the text semantics in the step 1;

data is spliced for the label. In addition, the preset reference dimension data may be a null value or supervisory information related to the tag, and the specific content thereof may be determined by those skilled in the art.

In addition, since the sample data may correspond to a plurality of semantic tag data, one or more semantic tag data may exist inside the semantic tag data. The separation method of the semantic tag data may be any feasible method, for example, a separator "|" separation.

Based on the step 130 and the step 140, those skilled in the art can understand that the sample stitching data and the label stitching data formed by stitching different depth hidden layers of the related structure are obtained by presetting the sample stitching data obtaining position and the label stitching data obtaining position. Because the hidden layer corresponding to the acquisition position is preset, relevant technicians can select proper hidden layer input data or output data for flexible splicing according to the network structure characteristics of the relevant technicians (a preset encoder, a preset generator and the like).

Thus, the acquisition of the sample splicing data and the label splicing data is completed.

Step 150, determining a distance cost value between the sample splicing data and the label splicing data based on a preset inter-distribution distance equation; and importing the distance cost value, the preset learning rate, the preset smoothing constant and the initial discriminator weight value into a preset optimizer, and finishing weight updating of the preset discriminator to obtain the trained preset discriminator.

It should be noted that the preset inter-distribution distance equation is any feasible measurement formula capable of calculating the difference between the two output distributions.

The method for obtaining the preset inter-distribution distance equation may specifically be: and (3) adopting a Wasserstein-distance formula as an object equation, jointly distributing and edge distribution corresponding to an encoder and a generator, and sampling (X, Y) corresponding to sample splicing data and label splicing data.

According to the formula:

and obtaining the cost value of the preset distance. Wherein D () presets the discriminator output result,

the data is spliced for the samples and,

data is spliced for the label.

In addition, the above theoretical method based on Wasserstein-distance may be replaced by any theory capable of calculating the difference between two output distributions, such as Kullback-leibler divergence theory, jensen-Shannon divergence theory, and the like.

Wherein, leading-in distance cost value, preset learning rate, preset smooth constant, initial discriminator weighted value into predetermineeing the optimizer, accomplishing the weight update of predetermineeing the discriminator, specifically can be, through predetermineeing discriminator weight update formula:

updating the weight value of the preset discriminator; wherein the content of the first and second substances,

to preset the weight values of the encoders generated in the updating process,

in order to be the cost value of the distance,

in order to preset the learning rate,

and

performing gradient clipping on the weight value of the preset discriminator; where c is the clipping threshold.

Thus, the preset encoder training is completed.

Step 160, importing a preset learning rate, a preset smoothing constant, an initial encoder weight value and a trained preset discriminator into a preset optimizer to complete updating of the preset encoder weight value; importing a preset learning rate, a preset smoothing constant, an initial encoder weight value and a trained preset discriminator into a preset optimizer to finish updating of a preset generator weight value; to obtain a trained preset encoder and preset generator.

Wherein, will predetermine the learning rate, predetermine smooth constant, initial encoder weighted value, the leading-in optimizer of predetermineeing of the good ware of distinguishing of training, accomplish the update of predetermineeing the encoder weighted value, specifically can be for, through predetermineeing encoder weight update formula:

updating the weight value of a preset encoder; wherein the content of the first and second substances,

for the weight values of the preset encoders generated in the updating process,

the data is spliced for the samples and,

in order to preset the learning rate,

and

is a preset smoothing constant.

Wherein, leading-in the optimizer of predetermineeing preset learning rate, predetermineeing smooth constant, initial encoder weighted value, the good ware of predetermineeing of training, the update of the completion preset generater weighted value specifically is, through predetermineeing generater weight update formula:

to update the weight values of the preset generators generated in the process,

the data is spliced for the label and,

in order to preset the learning rate,

and

is a preset smoothing constant.

Thus, the training of the preset encoder and the preset generator is completed.

In addition, after obtaining the trained preset encoder and preset generator, the application can further: and modifying the semantic tag data by presetting a semantic tag modification interface. And then can adjust classification function to a certain extent developments, satisfy more diversified demand, provide better support for user-defined.

Step 170, obtaining verification sample splicing data corresponding to sample data in the verification data set based on the trained preset encoder; obtaining verification label splicing data based on the trained preset generator or label splicing data; and matching the verification sample splicing data with the verification label splicing data based on a trained preset discriminator or a preset matching degree calculation formula.

It should be noted that, based on the tag splicing data, the method for obtaining the verification tag splicing data includes: and directly using the label splicing data as verification label splicing data. The preset matching degree calculation formula is any existing formula capable of calculating the matching degree of the verification sample splicing data and the verification label splicing data.

Wherein the validation tag splice data is generated online with a generator. When a new label appears and output data needs to be modified, the model does not need to be retrained, and semantic label data with higher customization degree can be provided on line.

So far, the technical solutions of the present disclosure have been described in connection with the foregoing embodiments, but it is easily understood by those skilled in the art that the scope of the present disclosure is not limited to only these specific embodiments. The technical solutions in the above embodiments can be split and combined, and equivalent changes or substitutions can be made on related technical features by those skilled in the art without departing from the technical principles of the present disclosure, and any changes, equivalents, improvements, and the like made within the technical concept and/or technical principles of the present disclosure will fall within the protection scope of the present disclosure.

Claims

1. A dynamic data identification method based on natural language is characterized in that the method comprises the following steps:

acquiring a sample set, and determining semantic tag data corresponding to each sample data in the sample set;

generating an experiment set based on the sample data, the semantic tag data and the mapping relation between the sample data and the semantic tag data; splitting the experiment set into a training data set and a verification data set;

importing sample data in a training data set into a preset encoder; acquiring a plurality of sample sub-splicing data from a hidden layer of a preset encoder based on a preset sample splicing data acquisition position, and splicing the sample sub-splicing data into sample splicing data;

importing semantic label data and preset reference dimension data in a training data set into a preset generator; acquiring a plurality of label sub-splicing data from a hidden layer of a preset generator based on a preset label splicing data acquisition position, and splicing the label sub-splicing data into label splicing data;

determining a distance cost value between the sample splicing data and the label splicing data based on a preset inter-distribution distance equation; importing the distance cost value, the preset learning rate, the preset smoothing constant and the initial discriminator weight value into a preset optimizer to complete weight updating of the preset discriminator so as to obtain a trained preset discriminator;

importing the preset learning rate, the preset smoothing constant, the initial encoder weight value and the trained preset discriminator into the preset optimizer to finish updating of the preset encoder weight value; importing the preset learning rate, the preset smoothing constant, the initial encoder weight value and the trained preset discriminator into the preset optimizer to finish updating of the weight value of the preset generator; so as to obtain a trained preset encoder and a preset generator;

obtaining verification sample splicing data corresponding to sample data in a verification data set based on a trained preset encoder; obtaining verification label splicing data based on the trained preset generator or label splicing data; and matching the verification sample splicing data with the verification label splicing data based on a trained preset discriminator or a preset matching degree calculation formula.

2. The dynamic natural language based data recognition method of claim 1,

the hidden layer of the preset encoder and the hidden layer of the preset generator both comprise a text semantic coding network layer and a label semantic coding network layer.

3. The method according to claim 1, wherein determining semantic tag data corresponding to each sample data in the sample set specifically comprises:

acquiring a semantic tag set through a preset semantic tag interface; wherein the semantic tag set comprises semantic tag data; or the like, or, alternatively,

obtaining semantic tag data corresponding to each sample data through a preset keyword/subject term extraction algorithm;

or analyzing the part of speech of the sample data through a preset sample part of speech analysis algorithm to obtain preset attribute words corresponding to the sample data, and splicing the preset attribute words into semantic tag data;

or when a preset associated data set corresponding to the sample set is obtained, extracting the keywords/subject terms corresponding to the preset associated data set through a keyword/subject term extraction algorithm and a preset sample part-of-speech analysis algorithm to obtain semantic tag data.

4. The dynamic data identification method based on natural language according to claim 1, wherein the obtaining of the sample set specifically includes:

and acquiring real service data or replacing open source service data or artificial sample data as a sample set through a preset sample uploading process.

5. The dynamic natural language-based data recognition method of claim 1, wherein before determining the cost value of the distance between the sample-stitched data and the label-stitched data based on a preset inter-distribution distance equation, the method further comprises:

replacing joint distribution in the Wasserstein-distance method with an encoder, replacing edge distribution with a generator, and replacing sampling with sample splicing data and label splicing data;

obtaining a preset distance cost value calculation formula:

wherein D () is the output result of the preset discriminator,

the data is spliced for the samples and,

data is spliced for the label.

6. The dynamic data recognition method according to claim 1, wherein the step of importing the distance cost value, the preset learning rate, the preset smoothing constant, and the initial discriminator weight value into a preset optimizer to complete updating of the weight of the preset discriminator includes:

updating the formula by the preset discriminator weight:

to preset the weight values of the encoders generated in the updating process,

in order to be the cost value of the distance,

in order to preset the learning rate,

and

performing gradient clipping on the weight value of the preset discriminator; wherein c is a clipping threshold.

7. The dynamic data recognition method based on natural language according to claim 1, wherein the updating of the preset encoder weight value is completed by importing the preset learning rate, the preset smoothing constant, the initial encoder weight value, and the trained preset discriminator into the preset optimizer, and specifically includes:

updating the formula by presetting the encoder weight:

to preset the weight values of the encoders generated in the updating process,

the data is spliced for the samples and,

in order to preset the learning rate, the learning rate is set,

and

is a preset smoothing constant.

8. The dynamic data recognition method according to claim 1, wherein the step of importing the preset learning rate, the preset smoothing constant, the initial encoder weight value, and the trained preset discriminator into the preset optimizer to complete updating of the preset generator weight value includes:

updating the formula by presetting the weight of the generator:

to update the weight values of the preset generators generated in the process,

the data is spliced for the label and,

in order to preset the learning rate,

and

is a preset smoothing constant.

9. The dynamic natural language based data recognition method of claim 1, wherein after obtaining the trained predictive coder and predictive generator, the method further comprises:

and modifying the semantic label data by presetting a semantic label modification interface.