CN116226638A

CN116226638A - Model training method, data benchmarking method, device and computer storage medium

Info

Publication number: CN116226638A
Application number: CN202310134675.4A
Authority: CN
Inventors: 操涛涛; 王龙; 陈立力
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-06-06

Abstract

The application provides a model training method based on data targeting, a data targeting method, a data targeting device and a computer storage medium. The model training method comprises the following steps: obtaining a sample to be trained; extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model; extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model; the feature fusion module of the data benchmarking model is utilized to fuse the first vector feature and the second vector feature to obtain fusion features; inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained; and training the data label matching model based on the prediction type and the label type of the sample to be trained. Through the mode, the data targeting device can predict from multiple dimensions of the data through the structuring module and the data module, and accuracy and comprehensiveness of the data targeting are guaranteed.

Description

Model training method, data benchmarking method, device and computer storage medium

Technical Field

The present disclosure relates to the field of computer data processing, and in particular, to a model training method based on data targeting, a data targeting method, a data targeting device, and a computer storage medium.

Background

With the maturation and development of technologies such as big data and artificial intelligence, big data construction has gradually become the trend and the focus of informatization work. The big data technology can ensure the comprehensiveness and accuracy of data under different application scenes, accelerate the data sharing among different application modules, and promote the informatization level development of the application scenes.

However, there are still some problems in the construction of big data at present, which need to be solved. For example, at present, phenomena of insufficient data sharing and unsmooth business process connection still exist among all application modules. The method is characterized in that mass data resources exist in the current application scene, but due to different structures of tables used for storing data in each application module, different names of data items and the like, a barrier is generated between data, so that the information island phenomenon is caused. Therefore, the data in different application modules need to be marked to realize the management of unified standard data.

In an application scenario, a manual alignment method and an algorithm automatic alignment method are generally used for alignment of data. The manual marking method mainly relies on personal experience to manually mark, the marking efficiency is low, the personal experience of mark marking personnel is needed, and the problem of low standard accuracy is easy to occur. The automatic alignment method of the algorithm has the problem of low alignment accuracy due to the fact that the categories of the data elements are more.

Disclosure of Invention

The technical problem that this application mainly solves is how to improve the data to target rate of accuracy under the circumstances of guaranteeing data to target efficiency, and to this, this application provides a model training method, data to target device and computer readable storage medium based on data to target.

In order to solve the technical problems, one technical scheme adopted by the application is as follows: the method for training the model based on the data benchmarking comprises the following steps: obtaining a sample to be trained; extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model; extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model; the feature fusion module of the data benchmarking model is utilized to fuse the first vector feature and the second vector feature to obtain fusion features; inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained; and training the data label matching model based on the prediction type and the label type of the sample to be trained.

Wherein the structured data comprises data content of a sample to be trained; the structuring module comprises a plurality of single-mode models; the structuring module extracts the data content to obtain a first vector feature, comprising: extracting one piece of data in the data content by using each single-mode model to obtain a plurality of data feature vectors; and fusing a plurality of data feature vectors extracted by the plurality of single-mode models to obtain a first vector feature.

Wherein the number of network layers is different and/or the parameters of the network layers are different among the single-mode models.

Wherein the number of single mode models is the same as the number of pieces of data in the data content.

The text data comprises a field name, a field annotation and a field type of a sample to be trained; the text data module comprises a language model; extracting a second vector feature corresponding to text data of a sample to be trained by using a text data module of a data benchmarking model, wherein the method comprises the following steps: respectively sending the field name, the field annotation and the field type into a language model to obtain corresponding name features, annotation features and type features; and fusing the name feature, the annotation feature and the type feature to obtain a second vector feature.

After the field name, the field annotation and the field type are respectively sent into the language model to obtain the corresponding name feature, the annotation feature and the type feature, the method further comprises the following steps: and respectively calculating the name feature, the annotation feature and the type feature by using the full connection layer to obtain corresponding category features, wherein the category features comprise matching and non-matching.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided a data benchmarking method, the method comprising: obtaining data to be subjected to target alignment and preprocessing the data to be subjected to target alignment; inputting the preprocessing result into a pre-trained data benchmarking model, wherein the data benchmarking model is trained by the model training method; and obtaining the prediction type output by the data benchmarking model.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided a data targeting device comprising a processor and a memory coupled to the processor, the memory storing program data, the processor being operable to execute the program data to implement a model training method as described above, and/or a data targeting method.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided a computer readable storage medium storing program data which, when executed, is adapted to carry out the model training method and/or the data benchmarking method described above.

The beneficial effects of this application are: different from the condition of the prior art, the data alignment method provided by the invention is applied to a data alignment device, and the data alignment device acquires a sample to be trained; extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model; extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model; the feature fusion module of the data benchmarking model is utilized to fuse the first vector feature and the second vector feature to obtain fusion features; inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained; and training the data label matching model based on the prediction type and the label type of the sample to be trained. By means of the method, compared with a conventional data alignment method, the method and the device for aligning the data have the advantages that the multidimensional data features of the original data items are extracted in the data alignment device, a plurality of corresponding models are respectively constructed, the trained models are used for aligning the data sources, and the problem of alignment accuracy caused by using only data information with single dimension is avoided. And the standard alignment model trained in the application adopts a plurality of different networks, so that the comprehensiveness and accuracy of data alignment are ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

FIG. 1 is a flow chart of an embodiment of a data benchmarking-based model training method provided herein;

FIG. 2 is a schematic overall flow chart of the data benchmarking-based model training method provided by the present application;

FIG. 3 is a schematic structural diagram of a single-mode model in a structural module of the data benchmarking-based model training method provided by the present application;

FIG. 4 is a network schematic diagram of an input layer and a hidden layer in a structural module of the data benchmarking-based model training method provided by the present application;

FIG. 5 is a schematic diagram of a text data module of the data benchmarking-based model training method provided in the present application;

FIG. 6 is a flow chart of an embodiment of a data benchmarking method provided herein;

FIG. 7 is a schematic flow chart of the data alignment method provided in the present application applied to a data alignment device;

FIG. 8 is a schematic flow chart of a preprocessing operation in the data benchmarking method provided by the present application;

FIG. 9 is a schematic structural diagram of a first embodiment of a data benchmarking device provided herein;

FIG. 10 is a schematic diagram of a second embodiment of a data benchmarking apparatus provided herein;

fig. 11 is a schematic structural diagram of an embodiment of a computer readable storage medium provided in the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

The data targeting device and the data targeting model can be used as execution bodies of the data targeting method and/or the model training method based on the data targeting provided by the embodiment of the application, and the data targeting model can be carried on the data targeting device and is a core part of the data targeting device.

The model training method is mainly applied to a data benchmarking device, wherein the data benchmarking device can be a server or a system formed by mutually matching a server and terminal equipment. Accordingly, each part, such as each unit, sub-unit, module and sub-module, included in the data benchmarking device may be all disposed in the server, or may be disposed in the server and the terminal device respectively.

Further, the server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules, for example, software or software modules for providing a distributed server, or may be implemented as a single software or software module, which is not specifically limited herein. In some possible implementations, the model training method of the embodiments of the present application may be implemented by way of a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1 to fig. 2, fig. 1 is a schematic flow chart of an embodiment of a data benchmarking-based model training method provided in the present application; fig. 2 is an overall flow diagram of a data benchmarking-based model training method provided in the present application.

Step 11: and obtaining a sample to be trained.

In particular, the sample to be trained may include a field name, a field annotation, a field type, a data content, and a corresponding data element encoding of the data. The data element code is used to represent the category of the piece of data. As shown in table 1, when the data element code of the data of the training sample is DE 1, the corresponding field name is hmsfhm, the field annotation is citizenship number, the field type is string, the data content includes "654101 x, 2118 x 653226 x 1234 x 654101 x 711x,653223 x 3838 x" four data, when the data element code of the data of the training sample is DE 2 x, the corresponding field name is xm, the field annotation is name, the field type is string, the data content includes "stretch, prune x, and" four data ".

TABLE 1

Step 12: and extracting a first vector characteristic corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model.

In particular, the structured data comprises the data content of the sample to be trained, as shown in the table above. The structuring module comprises several single-mode models. Wherein the number of network layers is different and/or the parameters of the network layers are different among the single-mode models. The number of single-mode models is the same as the number of pieces of data in the data content. Each single-mode model acquires one piece of data in the data content, and performs feature extraction on the data content.

Specifically, the data benchmarking device selects N groups of samples from the structured data, the number of categories of each group is M, and a training sample set is constructed: train= { a ₁ ,a ₂ ,...,a _N (wherein a) _i Is a one-dimensional array: a, a _i ＝[x ₁ ⁽ⁿ⁾ ,x ₂ ⁽ⁿ⁾ ,…,x _i ⁽ⁿ⁾ ,y ⁽ⁿ⁾ ]Where i is the dimension of each set of data, n refers to each set of samples, and y is the data element encoding category of each set of data.

Referring to fig. 3 and fig. 4, fig. 3 is a schematic structural diagram of a single-mode model in a structural module of the data-targeting-based model training method provided in the present application, and fig. 4 is a schematic network diagram of an input layer and a hidden layer in the structural module of the data-targeting-based model training method provided in the present application. The unimodal model includes an input layer and a hidden layer.

Specifically, after a piece of data is input through the input layer of the unimodal model, it enters the hidden layer. The hidden layer can abstract the characteristics of the input data to another dimension space for showing the more abstract characteristics of the data, so that the linear division can be better performed. Multiple hidden layers may be included in the single-modality model to facilitate better linear partitioning of different types of data.

Specifically, the single-mode model further comprises a normalization layer, wherein the normalization layer is used for normalizing the characteristics, the normalization is carried out to normalize the data to a number ranging from 0 to 1, and the specific processing method is as shown in the following formula:

wherein one layer has d-dimensional input:x＝(x ⁽¹⁾ ...x ^(d) ) D represents the dimension of the input vector, k represents each dimension of the input vector, E [ x ] ^(k) ]Representing the mean value of each dimension Var [ x ] ^(k) ]Representing the variance of each dimension.

Optionally, after the data is normalized by the normalization layer in the single-mode model, the data may further enter the hidden layer to re-abstract the data features to obtain more representative features, where the number of hidden layers in the single-mode model may be one or several, and is not limited herein.

Specifically, after the data is processed by the front network layer in the single-mode model, the data also enters a feature extraction module in the structuring module to perform feature extraction on the data content, so as to obtain a corresponding data feature vector.

Specifically, the structuring module extracts the data content to obtain a first vector feature, including: extracting one piece of data in the data content by using each single-mode model to obtain a plurality of data feature vectors; and fusing a plurality of data feature vectors extracted by the plurality of single-mode models to obtain a first vector feature.

Specifically, feature fusion is to fuse a plurality of existing feature sets to generate new fusion features, so that a plurality of data features can be utilized in a neutralizing way, and the advantage complementation of the plurality of features is realized, so that a more robust and accurate extraction result is obtained. The method can obtain the most differential information from a plurality of original feature sets involved in the fusion process, eliminate redundant information generated by correlation among different feature sets, and enable subsequent decisions to be possible.

Alternatively, a vector stitching approach may be employed when the structuring module fuses the data feature vectors extracted for each of the single-mode models. The specific use method of the vector splicing method is as follows: with eigenvectors v1, v2, v ₁ ∈R ⁿ ，v ₂ ∈R ^m ,R ^m And R is ⁿ Refers to the dimension of the vector in the real number domain, which is stitched to the feature vector on the same order, where the dimension of the vector in the real number domain is referred to as v1= [0.1,0.3,0.5 ]]，v2＝[0.2,0.3]After splicingVector v= [0.1,0.3,0.5,0.2,0.3 ]]. After splicing, the linear mapping is used to convert v '=wv, where v' ∈r ⁿ ,v∈R ^n+m ，

W represents an n×m order matrix on the real number domain. The purpose of linear mapping transformation is to transform two vectors into a unified dimension.

Step 13: and extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model.

In particular, the text data module of the data benchmarking model includes a language model, which may be a BERT model. The BERT model is a bi-directional encoder characterization from the transformer, and aims to obtain a representation of text containing rich semantic information by using large-scale unlabeled corpus training. The BERT model may train the original word vectors of each word/word in the input text and then convert the trained original word vectors into vector representations of each word/word in the text fused with full text semantic information. The text data includes the field name, field notes, and field type of the sample to be trained.

Specifically, the data benchmarking device respectively sends a field name, a field annotation and a field type into the language model to obtain corresponding name features, annotation features and type features; and fusing the name feature, the annotation feature and the type feature to obtain a second vector feature.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a text data module according to the data pair-based model training method provided in the present application. The data label matching device selects n groups of data such as field names, field notes, field types and the like, such as citizen identity numbers, names and the like. The number of coding categories of each group of data elements is M, and a training sample set is constructed: train= { a ₁ ,a ₂ ,...,a _n (wherein a) _n Representing an lxe order matrix over the real number domain:

where L represents the text length of the training sample and E represents the word vector dimension.

The data alignment model calculates a training sample set in the BERT model to obtain text semantic vectors corresponding to the samples:

sem＝(cls,token ₁ ,token ₂ ,…,token _L ,sep)

wherein cls= (x) ₁ ,x ₂ ,…,x _E ) Representing the semantics of the training sample, token _i ＝(x ₁ ,x ₂ ,…,x _E ) Representing the semantics of each word in the training sample.

Optionally, after obtaining the text semantic vector corresponding to each sample data, the data benchmarking model further calculates corresponding category features by using the full connection layer to the name feature, the annotation feature and the type feature, wherein the category features include matching and non-matching. As shown in the following formula:

wherein W is ₁ ^T 、

b ₁ 、b ₂ Are all full connection layer parameters.

Z ₁ And Z ₂ The method can respectively indicate whether the text semantic vector obtained by the BERT model calculation of the original sample data is matched with the actual semantic of the original sample data or not, so as to be used for subsequently confirming the accuracy of the BERT model calculation.

Specifically, after the text semantic vector corresponding to each training sample is obtained, the data benchmarking model fuses the text semantic vectors corresponding to the name feature, the annotation feature and the type feature with the same data element code to obtain a second vector feature. The fusion method is the same as the feature fusion method in the step 12, and will not be described here.

Step 14: and fusing the first vector features and the second vector features by using a feature fusion module of the data benchmarking model to obtain fusion features.

Specifically, the feature fusion module in the data benchmarking device fuses the structural features and semantic features obtained in step 12 and step 13, that is, the first vector features and the second vector features, according to the feature fusion method described in step 12, which is not described herein.

Step 15: and inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained.

Optionally, the data scaling device may calculate the obtained fusion feature using a softmax function to obtain a probability that the fusion feature belongs to each data element encoding category:

where Zi is the output of the classifier front stage output unit. j represents the category index, and the total number of categories is c. pro _i Probabilities belonging to each data element coding category respectively.

The data targeting device processes the fusion feature through a softmax function to obtain the probability of each data element coding category, and the category corresponding to the maximum probability is selected as the data element category to which the fusion feature belongs, namely the prediction category.

Step 16: and training the data label matching model based on the prediction type and the label type of the sample to be trained.

Specifically, the data targeting device trains the sample set to be trained obtained in the step 11 to the whole data targeting model so as to adjust parameters of various networks in the data targeting model.

Alternatively, the data targeting device may calculate a cross entropy loss function value between the predicted class and the label class of the training sample, by the loss function valueBack-propagating the parameters in the fine-tuning data pair-scale model is performed. Wherein S is the number of classifications, y is the classification label, and if it is class i, y _i =1, otherwise y _i =0. p is the output of the softmax layer in the last step 15, i.e. the probability that the class is i.

When the calculated cross entropy loss function value is greater than a preset threshold value, the data is considered to be needed to be readjusted for parameters in the standard model. The preset threshold value can be set according to the requirement of the user, and is not limited herein.

Referring to fig. 6 and fig. 7, fig. 6 is a flow chart of an embodiment of a data alignment method provided in the present application, and fig. 7 is a flow chart of a data alignment method provided in the present application applied to a data alignment device.

Step 61: and obtaining the data to be subjected to target alignment and preprocessing the data to be subjected to target alignment.

Specifically, the data targeting device acquires data which need to be subjected to targeting, namely, is not marked with data source type codes, and performs preprocessing operation on the data.

The text information (such as field name, field annotation and field type) of the field and the data content of the field are contained in the to-be-marked data.

For example: the text information of the field gmsfhm is: (gmsfhm, citizen identification number, string), the data content is: (654101) 2118, 653226, 1234, 654101, 711x,653223, 3838.

The preprocessing aims to carry out related cleaning operation on text information and data content in target data so as to improve the target matching accuracy of a target matching model of subsequent data.

Referring to fig. 8, fig. 8 is a schematic flow chart of a preprocessing operation in the data alignment method provided in the present application. The data targeting device sequentially carries out preprocessing operation on text information and data content in the targeting data.

The field names of data items are typically: pinyin acronyms, taking the data item "criminal suspicion name" as an example: the pinyin acronym is fzxyrxm, pinyin full-pinyin fanzuixianyirenxingming, english suspect_name, pinyin and English mixed fzxyr_name.

Specifically, the field name preprocessing includes cleaning, converting, and/or translating the field name.

The field name cleaning mainly comprises deleting and replacing special symbols and spaces contained in the field name. For example, a special symbol, a space, is replaced with a blank string. The above-described ' _' of the criminal suspect's name in english is replaced with an empty string, and becomes a ' select name '.

The field name conversion is mainly to unify the letters in the field names in case, such as replacing the capital letters therein with lowercase letters.

The field name translation is mainly to translate English words in the field name into Chinese pinyin, such as translating English ' select_name ' of the criminal suspect name into Fanzuixiananyirenging '.

Specifically, the purpose of the field annotation preprocessing is to make relevant interpretation on the data, and the steps similar to those of the field name preprocessing are not described herein.

Specifically, the preprocessing of the data content is to clean dirty data in the data content, and includes deleting and replacing special symbols and spaces appearing in the data content. For example, the special symbol and space are replaced by blank character string.

Step 62: inputting the preprocessing result into a pre-trained data benchmarking model, wherein the data benchmarking model is trained by the model training method.

Specifically, after the data targeting device preprocesses the to-be-targeted data, the structured data and the text data in the data are respectively sent to a corresponding structured module and a text data module in the data targeting model so as to generate a final fusion feature vector.

Step 63: and obtaining the prediction type output by the data benchmarking model.

Specifically, the obtained fusion feature vector is predicted by using a classification function such as a softmax function or an LR function, and the predicted category with the highest final probability is output as the category of the data to be aligned as a result.

Different from the condition of the prior art, the data alignment method provided by the invention is applied to a data alignment device, and the data alignment device acquires a sample to be trained; extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model; extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model; the feature fusion module of the data benchmarking model is utilized to fuse the first vector feature and the second vector feature to obtain fusion features; inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained; and training the data label matching model based on the prediction type and the label type of the sample to be trained. By means of the method, compared with a conventional data alignment method, the method and the device for aligning the data have the advantages that the multidimensional data features of the original data items are extracted in the data alignment device, a plurality of corresponding models are respectively constructed, the trained models are used for aligning the data sources, and the problem of alignment accuracy caused by using only data information with single dimension is avoided. And the standard alignment model trained in the application adopts a plurality of different networks, so that the comprehensiveness and accuracy of data alignment are ensured.

The method of the foregoing embodiment may be implemented by using a data alignment device, and is described below with reference to fig. 9, where fig. 9 is a schematic structural diagram of a first embodiment of the data alignment device provided in the present application.

As shown in fig. 9, the data benchmarking device 90 in the embodiment of the present application includes an acquisition module 91, a feature extraction module 92, a feature fusion module 93, and a training module 94.

The acquiring module 91 is configured to acquire a sample to be trained.

The feature extraction module 92 is configured to extract a first vector feature corresponding to the structured data of the sample to be trained by using the structuring module of the data benchmarking model, and extract a second vector feature corresponding to the text data of the sample to be trained by using the text data module of the data benchmarking model.

And the feature fusion module 93 is used for generating test cases based on the association result.

The training module 94 is configured to input the fusion features into the classifier of the target model, obtain a prediction class of the sample to be trained, and train the target model based on the prediction class and the label class of the sample to be trained.

The method of the foregoing embodiment may be implemented by a data targeting device, and referring to fig. 10, fig. 10 is a schematic structural diagram of a second embodiment of the data targeting device provided in the present application, where the data targeting device 100 includes a memory 101 and a processor 102, the memory 101 is used for storing program data, and the processor 102 is used for executing the program data to implement the following method:

obtaining a sample to be trained; extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structuring module of the data benchmarking model; extracting a second vector feature corresponding to the text data of the sample to be trained by using a text data module of the data benchmarking model; the feature fusion module of the data benchmarking model is utilized to fuse the first vector feature and the second vector feature to obtain fusion features; inputting the fusion characteristics into a classifier of the data benchmarking model to obtain the prediction category of the sample to be trained; and training the data label matching model based on the prediction type and the label type of the sample to be trained.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of a computer readable storage medium provided in the present application, where the computer readable storage medium 110 stores program data 111, and the program data 111, when executed by a processor, is configured to implement the following method:

Embodiments of the present application are implemented in the form of software functional units and sold or used as a stand-alone product, which may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims

1. A model training method based on data benchmarking, which is characterized by comprising the following steps:

obtaining a sample to be trained;

extracting a first vector feature corresponding to the structured data of the sample to be trained by using a structured module of a data benchmarking model;

extracting a second vector feature corresponding to the text data of the sample to be trained by using the text data module of the data benchmarking model;

the feature fusion module of the data benchmarking model is utilized to fuse the first vector features and the second vector features to obtain fusion features;

inputting the fusion characteristics into a classifier of the data benchmarking model, and obtaining the prediction category of the sample to be trained;

and training the data benchmarking model based on the prediction category and the label category of the sample to be trained.

2. The method for training a model according to claim 1,

the structured data comprises the data content of the sample to be trained;

the structuring module comprises a plurality of single-mode models;

the structuring module extracts the data content to obtain the first vector feature, including:

extracting one piece of data in the data content by utilizing each single-mode model to obtain a plurality of data feature vectors;

and fusing a plurality of data feature vectors extracted by the single-mode models to obtain the first vector features.

3. The method for training a model according to claim 2, wherein,

the number of network layers and/or the parameters of the network layers are different among the single-mode models.

4. The method for training a model according to claim 2, wherein,

the number of the single-mode models is the same as the number of the data in the data content.

5. The method for training a model according to claim 1,

the text data comprises a field name, a field annotation and a field type of the sample to be trained;

the text data module comprises a language model;

the text data module for utilizing the data benchmarking model extracts a second vector feature corresponding to the text data of the sample to be trained, including:

respectively sending the field name, the field annotation and the field type into the language model to obtain corresponding name features, annotation features and type features;

and fusing the name feature, the annotation feature and the type feature to obtain the second vector feature.

6. The method for training a model according to claim 5,

after the field name, the field annotation and the field type are respectively sent into the language model to obtain the corresponding name feature, annotation feature and type feature, the method further comprises the steps of:

and respectively calculating the name feature, the annotation feature and the type feature by using a full connection layer to obtain corresponding category features, wherein the category features comprise matching and non-matching.

7. A method for data targeting, the method comprising:

obtaining to-be-aligned data and preprocessing the to-be-aligned data;

inputting the preprocessing result into a pre-trained data benchmarking model, wherein the data benchmarking model is trained by the model training method according to any one of claims 1 to 6;

and obtaining the prediction type output by the data benchmarking model.

8. The method of claim 7, wherein,

the pretreatment comprises the following steps: and cleaning, converting and/or translating the target data to be checked.

9. A data targeting device, wherein the data targeting device comprises a memory and a processor coupled to the memory;

wherein the memory is configured to store program data, and the processor is configured to execute the program data to implement the model training method according to any one of claims 1 to 6, and/or the data benchmarking method according to claims 7 to 8.

10. A computer storage medium for storing program data which, when executed by a computer, is adapted to carry out the model training method of any one of claims 1 to 6 and/or the data benchmarking method of claims 7 to 8.