CN115841016A

CN115841016A - Model training method, device and equipment based on feature selection

Info

Publication number: CN115841016A
Application number: CN202211275449.XA
Authority: CN
Inventors: 陈忠德; 吴睿泽; 姜聪; 李弘慧; 董鑫; 龙璨; 何勇; 程磊; 莫林剑
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-03-24

Abstract

The embodiment of the specification discloses a model training method, a device and equipment based on feature selection. Obtaining the sharing characteristics Z of the M tasks; for a kth task, determining a weight vector of a feature to the task according to the shared feature Z, and determining a first prediction result of the kth task according to the shared feature and the weight vector; aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, generating a modified shared feature, and determining a second prediction result of the kth task according to the modified shared feature; determining a causal effector of the ith feature for the kth task based on a difference between the first prediction and the second prediction; and determining the difference between the causal factor and the weight vector, and training the multitask model according to the difference generated loss value, so that the characteristic with causal relation is selectively learned for each task in the training process.

Description

Model training method, device and equipment based on feature selection

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for model training based on feature selection.

Background

In a multitasking model, since all features are shared among tasks without distinguishing their usefulness for each task, some tasks are affected by features that are useful but not useful for other tasks, resulting in the creation of negative transitions that degrade the model. Meanwhile, when calculating the correlation between tasks and features in the conventional way, false correlation may be shown due to potential mixing between the features and the task targets, which also causes a negative migration phenomenon.

Based on this, there is a need for a model training scheme that can accurately express causality between features and tasks.

Disclosure of Invention

The embodiment of the specification provides a model training method, a device, equipment and a storage medium based on feature selection, and is used for solving the following technical problems: there is a need for a model training scheme that can accurately express causality between features and tasks.

To solve the above technical problem, one or more embodiments of the present specification are implemented as follows:

in a first aspect, an embodiment of the present specification provides a feature selection-based model training method, which is applied to a multitask model including N features and M tasks, where N and M are natural numbers greater than 1, and the method includes:acquiring shared characteristics Z of the M tasks, wherein the size of Z is N x M, and each row corresponds to one characteristic; for the kth task, determining a weight vector alpha of the feature to the task according to the shared feature Z ^k According to the shared features Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M; aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, and generating a modified shared feature Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result ^ of the kth task>

Wherein i is more than or equal to 1 and less than or equal to N; determining a causal effect factor ^ for the ith feature for the kth task based on a difference between the first prediction and the second prediction>

Determining the causal effect factor->

And the weight vector alpha ^k And training the multitask model according to the difference generation loss value.

In a second aspect, an embodiment of the present specification provides a feature selection-based model training apparatus, applied in a multi-task model including N features and M tasks, where N and M are natural numbers greater than 1, the apparatus including: the acquisition module is used for acquiring shared characteristics Z of the M tasks, wherein the size of Z is N x M, and each line corresponds to one characteristic; a first prediction module for determining a weight vector alpha of the feature to the task according to the shared feature Z for the kth task ^k According to the shared feature Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M; a second prediction module, which replaces the ith row in the shared characteristic Z with a preset value aiming at the ith characteristic to generate a modified shared characteristic Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result ^ of the kth task>

Wherein i is more than or equal to 1 and less than or equal to N; a causal factor module that determines a causal factor ≥ for a kth task for an ith feature based on a difference between the first prediction and the second prediction>

A fusion training module that determines the causal effect factor->

In a third aspect, embodiments of the present specification provide an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

In a fourth aspect, embodiments of the present specification provide a non-volatile computer storage medium having stored thereon computer-executable instructions that, when read by a computer, cause the one or more processors to perform the method of the first aspect.

At least one technical scheme adopted by one or more embodiments of the specification can achieve the following beneficial effects: obtaining the shared characteristics Z of the M tasks; for a kth task, determining a weight vector of a feature to the task according to the shared feature Z, and determining a first prediction result of the kth task according to the shared feature and the weight vector; aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, generating a modified shared feature, and determining a second prediction result of the kth task according to the modified shared feature; determining a causal effector of the ith feature for the kth task based on a difference between the first prediction and the second prediction; and determining the difference between the causal effect factors and the weight vectors, and training the multitask model according to the difference generation loss value. Therefore, more stable characteristics with causal relationship are selectively learned for each task in the training process, and false correlation characteristics are not learned, and the accuracy of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the present specification, and for those skilled in the art, other drawings may be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating the correlation between a feature and a task provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a method for training a model based on feature selection according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a model provided in an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a model training apparatus based on feature selection according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

The embodiment of the specification provides a model training method, a device, equipment and a storage medium based on feature selection.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any inventive step based on the embodiments of the present disclosure, shall fall within the scope of protection of the present application.

In current multi-task learning, it is often necessary to determine correlations between features and tasks. However, this situation also tends to cause the model to appear to select useful features for each task because of some potential common factor among the tasks. As shown in FIG. 1, FIG. 1 is a diagram illustrating the correlation effects between features and tasks. A strong correlation appears to exist due to the possibility of a common potential factor U between X1 and Y2.

For example, in summer, the ice cream sales and the drowned death number have a strong correlation, because a correlation path of 'ice cream sales < -temperature > drowned death number' exists in summer, wherein the temperature is a common potential factor, at this time, the weight of the feature of the ice cream sales (namely X1) on predicting the drowned death number (namely Y2) is very large by using the conventional multi-task model, however, in winter, as the relationship between the ice cream sales and the temperature becomes weaker or almost nonexistent, the feature of the ice cream sales obviously cannot accurately predict the drowned death number, and the expression is that the recognition accuracy of the model on the task Y2 is negatively shifted.

Based on the above, the embodiments of the present specification provide a model training scheme based on feature selection, so as to achieve accurate expression of causality between features and tasks, and improve the accuracy of the model.

In a first aspect, as shown in fig. 2, fig. 2 is a schematic flowchart of a model training method based on feature selection provided in an embodiment of this specification, and is applied to a multitask model including N features and M tasks, where N and M are natural numbers greater than 1, including the following steps:

s201, obtaining the sharing characteristics Z of the M tasks.

For the input training sample, a conventional feature extraction method may be adopted to obtain N embedded features for the multi-task model. And converting the obtained N embedded features into feature representations with the same dimension (denoted as dimension D) through dimension transformation, so that the feature representations of the N embedded features can be combined to obtain a combined feature X, wherein the size of X is N X D.

Furthermore, the merged feature X may be processed by using a corresponding activation function, so as to obtain the shared features Z of the M tasks, where the size of Z is N × M, and each row corresponds to one feature.

For example, X may be processed in such a way that Shared characteristic Z = relu (WsX), where Ws is a Shared parameter in a multitasking model, which may be any multitasking model, for example, a network skeleton of Shared-Bottom is adopted. relu is a commonly used activation function, under which values less than 0 are reset to 0, while values greater than 0 are unchanged.

S203, aiming at the kth task, determining a weight vector alpha of the shared characteristic Z for the task ^k According to the shared features Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M

In the process, the input of each task is the same shared feature Z, and the weight vector alpha of the feature to the task is learned in the process ^k ，α ^k Is a vector of length N.

E.g. alpha ^k ＝soft max(relu(tanh(G ^k Z))) in which G ^k Is a learnable weight parameter of length N, and α ^k The degree of importance, alpha, of the N features of the input to each task is characterized ^k The value of each element in (a) is reformed to the interval (0, 1) based on the softmax activation function. In this process, the model may also obtain a weight vector α for characterizing the importance of the features to each task through the learning of the self-attention mechanism, SENET, or MSSM ^k 。

Furthermore, it is possible to use the shared feature Z and the weight vector α ^k Determining a first prediction result for a kth task

In particular, after having obtained the shared feature and the weight vector, the shared feature Z and the weight vector α may be obtained ^k Generating output characteristics A ^k I.e. A ^k ＝α ^k ·Z。/>

Output characteristic A obtained after weight processing ^k Obviously better reflecting the importance of each feature to the task. So that A can be substituted ^k As input of the kth task, and activating the output characteristic A by using a preset activation function ^k Generating a first prediction result

Wherein it is present>

I.e. the network parameters contained in the kth task tower.

S205, aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, and generating the modified shared feature Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result for a kth task

At this point, to avoid a situation as may occur in fig. 1 (i.e., there may be potential common factors between the features and tasks), the connections between the features and tasks need to be severed. Specifically, for the ith feature, the ith row in the shared feature Z is replaced by a preset value.

The preset value may be 0, or the numerical value in the ith row in the shared feature Z may be synchronously scaled down (for example, the ith row is multiplied by a positive number much smaller than 1, for example, 0.01), so that the feature value corresponding to the ith row after synchronous scaling down has almost no influence on the calculation of the first prediction result with respect to other features. That is, the ith feature after the replacement by the preset value does not affect the calculation of the first prediction result.

According to the modified shared characteristic Z _-i Determining a second prediction result for a kth task

In this process, the same activation function should be used for the calculation of the second predictor and the first predictor, e.g. both activation functions sigmoid (relu ()).

S207, determining a causal effector of the ith feature on the kth task according to the difference between the first prediction result and the second prediction result

First, the difference between the first prediction result and the second prediction result can be calculated:

the first prediction result is calculated under the condition that the ith feature is included in the features, and the second prediction result is calculated under the condition that the ith feature is not includedThe difference between the features is calculated, so that the influence of the ith feature on the kth task can be represented.

Furthermore, for the k-th task, after the differences corresponding to the other features are calculated, the causal effector of the i-th feature on the k-th task can be calculated for the N features in common

Wherein τ is a predetermined temperature parameter for controlling the resolution of the whole causal factor. Obviously, if->

The greater the causality of the ith feature for the kth task, and vice versa the smaller, i.e. </>>

Positively correlated with the causal relationship of the ith feature to the kth task.

S209, determining the causal effector

As mentioned above, the weight vector α ^k The importance degree of each feature included in the shared feature Z with respect to each task is also characterized by a vector. Thus, the causal effector can be determined

And the weight vector alpha ^k The difference in (a).

For example, calculating the causal Effector

And the weight vector alpha ^k Determining the distance as the difference. The distance calculation here may include various distance values such as KL divergence, JS divergence, euclidean distance, or cosine similarity.

Further respectively calculating the causal effect factors of each task

And the weight vector alpha ^k Obtaining N x M distance values; adding the N M distance values as a characteristic influence loss value L _causal And training the multitask model according to the characteristic influence loss value. I.e. is>

Wherein D is the distance calculation mode, gamma ^k Is N->

The combined vector, which is alpha ^k The lengths are N, and the characteristic positions are in one-to-one correspondence.

Further, the multitask model may be trained based on the loss values. Specific training includes G for the foregoing ^k Training is performed, and parameters contained in each task tower are trained.

In one embodiment, training the multitask model according to the feature impact loss value comprises: get sample book label y ^k According to the first prediction result

And a sample label y ^k Determines a sample prediction loss value L _pred (ii) a Fusing the sample prediction loss value L _pred And training the multi-task model by the characteristic influence loss value.

I.e. first an overall sample prediction of the multitask model can be calculatedLoss value

Wherein L is _k As a loss function of the kth task, ω ^k Is the loss value weight of the kth task.

Thereby fusing the sample prediction loss value L _pred And the feature impact loss value train the multitask model, i.e. L _total ＝L _pred +λ ₁ L _causal Wherein λ is ₁ Is a hyper-parameter.

In another embodiment, a regular loss value of the weight vector may be further determined, and the multi-task model is trained by fusing the feature impact loss value and the regular loss value.

I.e. introducing a regular loss value of the weight vector, e.g. using a first norm for determining the weight vector a ^k First regular loss value L of _reg And thus L _total ＝L _pred +λ ₁ L _causal +λ ₂ L _reg L here _reg I.e. the characterization is for the weight vector a ^k A first norm of ₂ Likewise hyperparametric, λ ₁ And λ ₂ Is used to control the feature impact loss value and the strength of the regularization term.

As shown in fig. 3, fig. 3 is a schematic diagram of a model provided in the embodiment of the present disclosure. In this diagram, a multi-task learning model for 2 tasks (i.e., task a and task B) is exemplarily presented. Each task will calculate the corresponding weight vector α ^A And causal effector gamma ^A And alpha of corresponding task B ^B And gamma ^B To respectively calculate D (alpha) of the task A ^A ,γ ^A ) And D (α) of task B ^B ,γ ^B ) And further calculating loss values to train TowerA and TowerB.

Obtaining the sharing characteristics Z of the M tasks; for the kth task, determining a weight vector of the shared feature Z for the task, and determining a first prediction result of the kth task according to the shared feature and the weight vector; aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, generating a modified shared feature, and determining a second prediction result of the kth task according to the modified shared feature; determining a causal effector of the ith feature for the kth task based on a difference between the first prediction and the second prediction; and determining the difference between the causal effect factor and the weight vector, and training the multitask model according to the difference generation loss value. Therefore, more stable characteristics with causal relationship are selectively learned for each task in the training process instead of false correlation characteristics, and the accuracy of the model is improved.

Based on the same idea, one or more embodiments of the present specification further provide a device and an apparatus corresponding to the above method,

in a second aspect, as shown in fig. 4, fig. 4 is a schematic structural diagram of a model training apparatus based on feature selection according to an embodiment of the present disclosure, which is applied to a multitask model including N features and M tasks, where N and M are natural numbers greater than 1, and the apparatus includes:

an obtaining module 401, configured to obtain shared features Z of the M tasks, where the size of Z is N × M, and each row corresponds to one feature;

a first prediction module 403, for the kth task, determining a weight vector α of the feature to the task according to the shared feature Z ^k According to the shared features Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M;

a second prediction module 405, configured to replace an ith row in the shared feature Z with a preset value for an ith feature, and generate a modified shared feature Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result for a kth task

Wherein i is more than or equal to 1 and less than or equal to N;

a causal factor module 407 for determining a causal factor for the ith feature for the kth task based on a difference between the first prediction and the second prediction

A fusion training module 409 for determining the causal Effector

Optionally, the first prediction module 403 multiplies the shared feature Z and the weight vector α ^k Generating output characteristics A ^k Activating the output characteristic A by using a preset activation function ^k Generating a first prediction result

Accordingly, the second prediction module 405, according to the modified shared characteristic Z _-i Determining a second prediction result ^ of the kth task>

The method comprises the following steps: activating the modified shared feature Z using the preset activation function _-i Generating a second prediction result>

Optionally, the causal factor module 407 calculates the causal factor

And the weight vector alpha ^k Determining the distance as the difference.

Optionally, the fusion trainingA training module 409 for calculating causal effect factors of each task respectively

And the weight vector alpha ^k Obtaining N M distance values; adding the N x M distance values as a characteristic influence loss value L _causal And training the multitask model according to the characteristic influence loss value.

Optionally, the fusion training module 409 obtains a sample label y ^k According to the first prediction result

Optionally, the fusion training module 409 determines a regular loss value of the weight vector, and trains the multitask model by fusing the feature impact loss value and the regular loss value.

Optionally, the second prediction module replaces the ith row in the shared feature Z with 0; or synchronously reducing the numerical value in the ith row in the shared characteristic Z.

Optionally, in the apparatus, the distance includes KL divergence, JS divergence, euclidean distance, or cosine similarity.

In a third aspect, as shown in fig. 5, fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification, where the electronic device includes:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

In a fourth aspect, based on the same idea, the present specification further provides a non-volatile computer storage medium corresponding to the method described above, and storing computer-executable instructions, which, when the computer reads the computer-executable instructions in the storage medium, cause one or more processors to execute the method according to the first aspect.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as straightforward improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as ABEL (Advanced Boolean Expression Language), AHDL (Advanced Description Language), afl (Language Description Language), traffic, CUPL (computer unified Programming Language), HDCal (jhdware Description Language), lang, lola, HDL, las, software, rhydl (Hardware Description Language), and vhigh-Language (Hardware Description Language), which is currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of a microprocessor or processor, a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: the ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that instead of implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by an article of manufacture with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, the embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the embodiments described herein may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises that element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are all described in a progressive manner, and the same and similar parts in the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus, device, non-volatile computer storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to the description, reference may be made to some of the description of the method embodiments.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also or may be advantageous.

The above description is merely one or more embodiments of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to one or more embodiments of the present description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A model training method based on feature selection is applied to a multi-task model containing N features and M tasks, wherein N and M are natural numbers larger than 1, and the method comprises the following steps:

acquiring shared characteristics Z of the M tasks, wherein the size of Z is N x M, and each row corresponds to one characteristic;

for the kth task, determining a weight vector alpha of the feature to the task according to the shared feature Z ^k According to the shared features Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M;

aiming at the ith feature, replacing the ith row in the shared feature Z with a preset value, and generating a modified shared feature Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result for a kth task

Wherein i is more than or equal to 1 and less than or equal to N;

determining a causal effector of the ith feature for the kth task based on a difference between the first prediction and the second prediction

Determining the causal Effector

2. Such as rightThe method of claim 1, wherein the shared feature Z and the weight vector α are based on ^k Determining a first prediction result for a kth task

The method comprises the following steps:

multiplying the shared feature Z and the weight vector α ^k Generating output characteristics A ^k Activating the output characteristic A by using a preset activation function ^k Generating a first prediction result

Accordingly, according to the modified shared characteristic Z _-i Determining a second prediction result for a kth task

The method comprises the following steps: activating the modified shared feature Z using the preset activation function _-i Generating a second predicted result

3. The method of claim 1, wherein the causal effector is determined

And the weight vector alpha ^k The difference of (a) includes:

calculating the causal effector

And the weight vector alpha ^k Determining the distance as the difference.

4. The method of claim 3, wherein training the multitask model according to the difference generation loss values comprises:

respectively calculating the causal effect factors of each task

And the weight vector alpha ^k Obtaining N M distance values;

adding the N M distance values as a characteristic influence loss value L _causal And training the multitask model according to the characteristic influence loss value.

5. The method of claim 4, wherein training the multitask model according to the feature impact loss values comprises:

obtaining a sample Label y ^k According to the first prediction result

And a sample label y ^k Determines a sample prediction loss value L _pred ；

Fusing the sample prediction loss value L _pred And training the multi-task model by the characteristic influence loss value.

6. The method of claim 4, wherein training the multitask model according to the feature impact loss values comprises:

and determining a regular loss value of the weight vector, and training the multitask model by fusing the characteristic influence loss value and the regular loss value.

7. The method of claim 1, wherein replacing the ith row in the shared features Z with a preset value comprises:

replacing the ith row in the shared characteristic Z with 0; or,

and synchronously reducing the numerical value in the ith row in the shared characteristic Z.

8. The method of claim 3, the distance comprising KL divergence, JS divergence, euclidean distance, or cosine similarity.

9. A model training device based on feature selection is applied to a multi-task model comprising N features and M tasks, wherein N and M are natural numbers larger than 1, and the device comprises:

the acquisition module is used for acquiring shared characteristics Z of the M tasks, wherein the size of Z is N x M, and each line corresponds to one characteristic;

a first prediction module for determining a weight vector alpha of the feature to the task according to the shared feature Z for the kth task ^k According to the shared features Z and the weight vector alpha ^k Determining a first prediction result for a kth task

Wherein k is more than or equal to 1 and less than or equal to M;

a second prediction module, which replaces the ith row in the shared characteristic Z with a preset value aiming at the ith characteristic to generate a modified shared characteristic Z _-i According to said modified shared characteristic Z _-i Determining a second prediction result for a kth task

Wherein i is more than or equal to 1 and less than or equal to N;

a causal factor module to determine a causal factor for the ith feature for the kth task based on a difference between the first prediction and the second prediction

A fusion training module for determining the causal effector

10. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.