CN111126515B

CN111126515B - Model training method based on artificial intelligence and related device

Info

Publication number: CN111126515B
Application number: CN202010237383.XA
Authority: CN
Inventors: 宋奕兵; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-24
Anticipated expiration: 2040-03-30
Also published as: CN111126515A

Abstract

The embodiment of the application discloses a model training method and a related device based on artificial intelligence, wherein an endogenous response graph respectively reflecting a target object and background content is obtained by constructing a positive label pair and a negative label pair and inputting a first loss pair and a second loss pair obtained by the positive label pair and the negative label pair to calculate partial derivatives, and then the distance between the target object and the background content is measured by utilizing L2 norm loss.

Description

Model training method based on artificial intelligence and related device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a model training method and related apparatus based on artificial intelligence.

Background

Target tracking is a typical scene of application of a neural network model, and by means of the neural network model, recognition of a target in a video frame can be achieved, and the target can be tracked in a video.

One target tracking scenario is a situation without predefined targets, such as tracking of suspicious dangerous targets in security, tracking of candidate ad spots in ad placement, and the like.

In such a target tracking scenario, it is difficult for the neural network model to know all information of a target before tracking the target, and therefore, it is difficult to accurately distinguish the tracked target from background interference in the video in the tracking process, and especially when the features of the interfering object and the target are similar, serious tracking drift is caused.

Disclosure of Invention

In order to solve the technical problem, the application provides a model training method and a related device based on artificial intelligence, which can improve the resolution capability of a network model to an object to be recognized and background content in an image.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides an artificial intelligence based model training method, where the method is performed by a processing device, and the method includes:

training a network model according to a target image for identifying a target object and a training image including the target object to obtain a recognition result corresponding to the training image, wherein the recognition result is used for identifying a predicted position of the target object in the training image, the training image is provided with a corresponding positive label and a corresponding negative label, the positive label is used for identifying an actual position of the target object in the training image, and the negative label is used for identifying an actual position of background content in the training image;

determining a first loss of the identification result relative to the positive tag and a second loss relative to the negative tag;

determining a background disturbance loss according to the first loss and the second loss, wherein the background disturbance loss is used for identifying the influence of the background content on the identification of the target object;

and updating the parameters of the network model according to the background disturbance loss.

On the other hand, the embodiment of the present application provides an artificial intelligence based model training apparatus, the apparatus includes a training unit, a determining unit, and an updating unit:

the training unit is used for training a network model according to a target image for identifying a target object and a training image including the target object to obtain a recognition result corresponding to the training image, wherein the recognition result is used for identifying a predicted position of the target object in the training image, the training image is provided with a corresponding positive label and a corresponding negative label, the positive label is used for identifying an actual position of the target object in the training image, and the negative label is used for identifying an actual position of background content in the training image;

the determining unit is used for determining a first loss of the identification result relative to the positive label and a second loss relative to the negative label;

the determining unit is further configured to determine a background perturbation loss according to the first loss and the second loss, where the background perturbation loss is used to identify an influence of the background content on identifying the target object;

and the updating unit is used for updating the parameters of the network model according to the background disturbance loss.

In another aspect, an embodiment of the present application provides an artificial intelligence based model training apparatus, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the above aspect according to instructions in the program code.

In another aspect, the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.

According to the technical scheme, the target image and the training image are adopted for training in the training process of the network model for target tracking, the target image marks the target object to be tracked, and the training image contains the target object, so that the network model can obtain the recognition result corresponding to the training image, and the recognition result reflects the predicted position of the target object in the training image. The training image is provided with a positive label identifying the actual position of the target object in the training image and a negative label identifying the actual position of the background content in the training image, and a first loss relative to the positive label and a second loss relative to the negative label can be determined according to the recognition result. And determining background disturbance loss according to the first loss and the second loss to identify the influence of background content on the identification of the target object in the training image, and updating the parameters of the network model by using the background disturbance loss, so that when the network model identifies the object in the image, the negative influence of the background content in the image on the identification is reduced, the resolution capability of the network model on the object to be identified and the background content in the image is enhanced, and the performance of the network model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a model training method based on artificial intelligence according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a model training method based on artificial intelligence according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an application scenario of another artificial intelligence based model training method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a method for determining contribution information according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an artificial intelligence-based model training apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In order to improve the performance of a network model, the embodiment of the application provides a model training method based on artificial intelligence and a related device.

The model training method provided by the embodiment of the application is realized based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned computer vision technology, machine learning/deep learning, and the like.

For example, Image processing (ImageProcessing), Image Semantic Understanding (ISU), video processing (video processing), Video Semantic Understanding (VSU), face recognition (facerecognition), and the like in Computer Vision (Computer Vision) may be involved.

For example, Deep learning (Deep L earning) in Machine learning (M L) may be involved, including various types of Artificial Neural Networks (ANN).

In order to facilitate understanding of the technical scheme of the present application, the model training method based on artificial intelligence provided by the embodiment of the present application is introduced below with reference to an actual application scenario.

The model training method based on artificial intelligence can be applied to model training equipment with data processing capacity, such as terminal equipment and servers. The terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may specifically be an independent server, or may also be a cluster server.

The data processing equipment can have the capability of implementing a computer vision technology, wherein the computer vision is a science for researching how to enable a machine to see, and in particular, the computer vision is used for replacing human eyes to identify, track and measure a target and the like, and further performing graphic processing, so that the computer processing becomes an image which is more suitable for the human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.

In an embodiment of the present application, the data processing device may identify, detect, and track different objects in different video frames of a video through computer vision techniques.

The data processing device can have M L capability, M L is a multi-domain interdiscipline, and relates to a multi-domain discipline such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like.

The embodiment of the application provides a model training method based on artificial intelligence, which mainly relates to the application of various artificial neural networks.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a model training method based on artificial intelligence according to an embodiment of the present application, where the application scenario includes a server 101, and the server 101 deploys a pre-constructed neural network model, and after training the neural network model, the application scenario is used in the field of information processing such as images and videos. The neural network model may be a confrontation type neural network model, a twin network model, or the like. The twin network model is mainly used as an example in the subsequent embodiments of the present application.

The server 101 may train a pre-constructed network model according to a target image identifying a target object and a training image including the target object, and obtain a recognition result corresponding to the training image. The network model takes the target image and the training image as input, and takes the recognition result as output. The training image is provided with a corresponding positive label and a corresponding negative label, the positive label is used for identifying the actual position of the target object in the training image, and the negative label is used for identifying the actual position of the background content in the training image. It will be appreciated that in the training image, regions outside the target object are considered as background content. As in fig. 1, the positive and negative labels correspond to S1 and S2 in the figure, respectively. And the recognition result is that the network model processes the training image according to the target object identified in the target image to obtain the predicted position of the target object in the training image. In practical applications, the recognition result may be represented in different forms such as a response graph, a probability graph, and the like, which is not limited herein.

As shown in FIG. 1, a target image x and a training image z are input into a twin network and transformed by an identity transformation (phi)₁，φ₂) Then, to phi₂（φ₁(x) Phi) and phi₂（φ₁(z)) performing correlation processing to obtain a recognition result y, namely the predicted position of the target object in the training image z. In fig. 1, the recognition result y is represented by a three-dimensional predicted response map, and the response points with different heights in the predicted response map represent the similarity degree between the point corresponding to the position in the training image and the target object, for example, a peak in the response map represents the maximum possibility that the point corresponding to the peak in the training image is the target object.

Because the target object to be tracked is identified in the target image, the network model can detect the training image by taking the target image as a tracking template, and acquire the recognition result including the predicted position of the target object in the training image.

The server 101 can determine a first loss of the recognition result with respect to the positive tag S1 and a second loss with respect to the negative tag S2, respectively, according to the recognition result y. Wherein the first loss represents a difference between a predicted position of the recognition result for the target object in the training image and an actual position of the target object in the training image; the second loss represents a difference between the predicted position of the recognition result for the background content in the training image and the actual position of the background content in the training image.

It is understood that in deep learning, a loss function, i.e., a cost function, can be used to represent the difference between the predicted value and the actual value of the network model. Therefore, the first loss function and the second loss function may be set in advance in the model training process. The server 101 calculates a first loss and a second loss of the identification result with respect to the positive and negative tags, respectively, based on the first loss function and the second loss function.

Based on the above, the server 101 may determine a background disturbance loss according to the first loss and the second loss, where the background disturbance loss is used to represent an influence of background content in the training image on the target object in the process of identifying the target object by using the network model. For example, when the similarity between the target object in the training image and the background content is large, the influence degree of the background content on the target object is large, the difficulty of identifying the target object by using the network model on the training image is large, and the corresponding background disturbance loss is large; when the similarity between the target object in the training image and the background content is small, the influence degree of the background content on the target object is small, the difficulty of identifying the target object by utilizing the network model on the training image is small, and the corresponding background disturbance loss is small.

The server 101 may update the parameters in the network model according to the obtained background disturbance loss. In the training process of the network model, on the basis of the original loss of the network model, the background disturbance loss is increased, so that the negative influence of the background content in the training image on the identification of the target object is reduced by maximizing the background disturbance loss, the resolution capability of the network model on the target object and the background content in the training image is enhanced, and the network performance is improved.

The model training method based on artificial intelligence provided by the embodiment of the application is introduced below.

Referring to fig. 2, fig. 2 is a schematic flowchart of a model training method based on artificial intelligence according to an embodiment of the present application. For convenience of description, the server is taken as an execution subject, and the artificial intelligence based model training method provided by the embodiment of the application is described below. In the method shown in fig. 2, the following steps are included:

s201, training a network model according to a target image for identifying a target object and a training image comprising the target object to obtain a recognition result corresponding to the training image, wherein the training image has a corresponding positive label and a corresponding negative label.

The server is provided with a pre-constructed network model, and the structure of the network model can be set according to the actual application requirements. For example, the network model is set as a twin network structure for tracking targets in the video; and setting the network model as a generation countermeasure network structure for carrying out image denoising and the like on the video frames in the video. The embodiments of the present application mainly explain a network model as a twin network structure model as an example.

For better understanding, referring to fig. 3, fig. 3 is a schematic view of an application scenario of another artificial intelligence based model training method provided in an embodiment of the present application. As shown in fig. 3, the server may use a target image x identifying a target object and a training image z including the target object as two inputs of the twin network model, and obtain a recognition result y corresponding to the training image z after the target object included in the training image z is verified by the network model. The target object may be any object included in the training image z, such as a pet, a car, a person, and the like. The training image z has corresponding positive label S1 and negative label S2, where the positive label S1 is used to identify the actual location of the target object in the training image z and the negative label S2 is used to identify the actual location of the background content in the training image z. The recognition result y is used to identify the predicted position of the target object in the training image z.

It will be appreciated that the target image x and the training image z may or may not be of different sizes. In the model training process, the sizes of the target image x and the training image z can be set according to actual conditions. As shown in fig. 3, the size of the target image x is 127 × 127, the number of channels is 3, that is, the target image x is an RGB image of 127 × 3, the size of the training image is 255 × 255, and the number of channels is 3, that is, the training image z is an RGB image of 255 × 3.

As shown in fig. 3, when the recognition result is in the form of a predicted response map, pages S1 and S2, which are positive and negative labels, may also be represented in the form of a response map. The different response points in the positive label S1 identify how likely the corresponding location in the training image is to be the target object. For example, the peaks in the positive label S1 identify the highest likelihood that the corresponding point in the training image z is the target object, while the valleys identify the lowest likelihood that the corresponding point in the training image z is the target object. Likewise, the different response points in the negative label S2 identify how likely the corresponding location in the training image is to be background content. For example, the peaks in the negative label S2 identify the highest likelihood that the corresponding points in the training image z are background content, while the valleys identify the lowest likelihood that the corresponding points in the training image z are background content.

In constructing the positive tag S1 and the negative tag S2 of the training image z, the positive tag S1 is the actual position of the target object in the training image z, i.e., the true value (groudtruth) of the target object in the training image z, and the negative tag S2 is the opposite tag to groudtruth. Wherein, groudtruth refers to the classification accuracy of the training set for supervised training. For example, in positive tag S1, the point with groudtruth of 0, the corresponding point of 1 in negative tag S2, the point with groudtruth of 1, and the corresponding point of 0 in negative tag S2. In practical applications, the positive and negative label pairs can be constructed by manual labeling.

The twin network model may adopt different network structures, for example, Deeper and Wider twin neural Networks (SiamDW), perform feature extraction on the target image x and the training image z layer by layer, and obtain the recognition result y corresponding to the training image z after related operations. In the process of extracting the features of the target image x and the training image z by the twin network model, the network structure may be set according to experience and practical application requirements, and is not limited herein.

Different expression forms can be provided for the recognition result y corresponding to the training image z, such as a response graph, a probability graph, and the like. In fig. 3, the recognition result y is expressed in the form of a predicted response map. The points at different positions in the prediction response graph represent the probability of the point at the position corresponding to the training image being the target object. For example, the point of the predicted response map where the peak indicates the position corresponding to the training image is most likely to be the target object, and the point of the predicted response map where the valley indicates the position corresponding to the training image is least likely to be the target object.

The target image input into the twin network model already identifies the target object to be tracked, so the twin network model takes the target object in the target image as a target template, detects the training image, and obtains the recognition result comprising the predicted position of the target object in the training image through correlation calculation.

A first loss of the recognition result with respect to a positive tag and a second loss with respect to a negative tag are determined S202.

The server can determine a first loss with respect to the positive tag S1 and a second loss with respect to the negative tag S2 based on the recognition result y. Wherein the first loss identifies an error between a predicted position and an actual position of the target object in the training image z, i.e. an error between the recognition result y and the positive label; the second loss identifies the error between the predicted and actual position of the background content in the training image z, i.e. the error of the recognition result y and the negative label.

And S203, determining background disturbance loss according to the first loss and the second loss.

It can be understood that, in practical applications, when the network model identifies the target object in the training image, since the related information of the target object in the training image cannot be predicted, especially when the similarity between the target object in the training image and the background content is high, the network model is difficult to distinguish the target object from the background content, that is, the background content will affect the network model to accurately identify the target object to a certain extent, so that the difference between the predicted position and the actual position of the target object identified by the identification result y output by the network model is large.

Based on S202, since the first loss identifies the error of the recognition result y from the positive tag S1, i.e. the difference exhibited on the recognition target object in the recognition training image, and the second loss identifies the error of the recognition result y from the negative tag S2, i.e. the difference exhibited on the background content in the recognition training image, the server can determine a background disturbance loss (background disturbance loss) from the first loss and the second loss, the background disturbance loss being used for identifying the influence of the background content on the recognition target object. That is to say, when the influence of the background content in the training image z on the target object is large, the background disturbance loss is large, and the difficulty of distinguishing the target object from the background content by the network model is large; when the influence of the background content in the training image z on the target object is small, the background disturbance loss is small, and the difficulty of distinguishing the target object from the background content by the network model is small.

And S204, updating the parameters of the network model according to the background disturbance loss.

Because the background disturbance loss reflects the influence degree of the background content on the position of the target object in the network model prediction training image, the background disturbance loss can be used as an influencing factor in the training process of the network model to participate in the parameter updating process of the network model so as to maximize the difference between the target object and the background content, and thus, the resolution capability of the network model on the target object and the background content is improved.

The embodiment of the application provides a feasible implementation mode for updating parameters, namely, a server can determine total loss according to background disturbance loss and original loss (classification loss), and then update parameters of a network model through the total loss. Wherein the raw loss can be determined by cross-entropy loss.

It can be understood that the server determines the composite loss by performing a weighted summation on the background disturbance loss and the original loss, and the weight corresponding to the background disturbance loss and the weight corresponding to the original loss may be the same or different. In practical application, the setting can be set according to specific problems.

As shown in fig. 3, after the server calculates the background disturbance loss, the background disturbance loss and the original loss calculated in the forward propagation process of the network model are directly added and summed to obtain the comprehensive loss, and then the parameters of the network model are updated by using the comprehensive loss.

The model training method based on artificial intelligence provided by the embodiment obtains the endogenous response graphs respectively reflecting the target object and the background content by constructing the positive label pair and the negative label pair and inputting the first loss and the second loss pair obtained by the positive label pair and the negative label pair to perform partial derivation, and then measures the distance between the target object and the background content by utilizing L2 norm loss.

Therefore, in the model training process, the background disturbance loss can be determined according to the first loss and the second loss so as to identify the influence of background content on the recognition of the target object in the training image, and the parameters of the network model are updated by using the background disturbance loss, so that when the network model recognizes the object in the image, the negative influence of the background content in the image on the recognition is reduced, the resolution capability of the network model on the object to be recognized and the background content in the image is enhanced, and the performance of the network model is improved.

In view of how to determine the background disturbance loss according to the first loss and the second loss, the embodiment of the present application provides a possible implementation manner, which may determine first contribution information according to the first loss, determine second contribution information according to the second loss, and further determine the background disturbance loss according to difference information between the first contribution information and the second contribution information.

The first contribution information reflects the contribution of each characteristic point in the training image to position prediction when the position of the target object is predicted; the second contribution information represents the magnitude of the contribution of each feature point in the training image to the position prediction when predicting the position of the background content. That is, the first contribution information can reflect, more intuitively from the contribution of the feature points than the first loss, which feature points contribute to the difference (represented by the first loss) caused by the network model when recognizing the target object in the training image. The second contribution information can reflect, more intuitively from the contribution of the feature points, that the difference (represented by the second loss) caused by the network model in recognizing the background content in the training image is obtained from the contribution of the feature points, relative to the second loss.

The first contribution information and the second contribution information may be in the form of endogenous response maps (e.g. endogenous response maps Ak and Ak in fig. 4). In the endogenous response map Ak shown in fig. 4, the feature points corresponding to the darker regions contribute to the position of the prediction target object to a greater extent, and the lighter regions contribute less. The same applies to the endogenous response map Ak.

As previously mentioned, the first contribution information may reflect the attention of the network model in identifying the target object, and the second contribution information may reflect the attention of the network model in identifying the background content. Through the difference of attention in the two recognition processes, the influence of background content in the training image on the recognition of the target object can be determined when the network model recognizes the target object in the training image. Thereby, it is achieved that the aforementioned background disturbance loss is determined by the first contribution information and the second contribution information.

Based on the first contribution information and the second contribution information, in a possible implementation manner, an embodiment of the present application provides a manner of determining a background disturbance loss, that is, a characteristic distance between the first contribution information and the second contribution information is calculated, and the characteristic distance is used as difference information, so as to determine the background disturbance loss.

The characteristic distance between the first contribution information and the second contribution information may be calculated in different manners to obtain the background disturbance loss, for example, in fig. 3, the euclidean distance between the endogenous response maps Ak and Ak may be calculated, the difference between the two may be measured by using the L2 norm loss, and the difference between the two may be maximized to obtain the background disturbance loss, or the background disturbance loss may be determined by using other measurement manners such as L1 norm, matrix, and the like.

In view of the above-mentioned manner of determining the first contribution information according to the first loss and determining the second contribution information according to the second loss, in a possible implementation, the first contribution information may be determined by back-propagating the first loss in the network model through an output of the intermediate layer, and the second contribution information may be determined by back-propagating the second loss in the network model through an output of the intermediate layer.

For easy understanding, referring to fig. 4, fig. 4 is a schematic diagram illustrating a method for determining contribution information according to an embodiment of the present application.

As shown in fig. 4, the server may perform back propagation in the network model through the first loss and the second loss, for example, perform partial derivation on input data (e.g., feature maps corresponding to the target image x and the training image z) of the network model through the first loss and the second loss, and determine the first contribution information and the second contribution information respectively through output contents of back propagation in an intermediate layer, e.g., a k-th layer, in the network model. The first contribution information identifies the contribution degree of the characteristic point in the training image z to the position of the prediction target object; the second contribution information identifies the degree of contribution of the feature point to the position where the prediction target object is located in the training image z.

In the process of determining the first contribution information, firstly, a predicted response map of the twin network output is extracted, a first loss relative to a positive label S1 is determined according to the predicted response map, the first loss is subjected to partial derivation on the input data of the network model, a partial derivative of the network model at a k-th layer is obtained, and the partial derivative is used as an endogenous response map Ak, wherein the endogenous response map Ak is the first contribution information.

Similarly, in the process of determining the second contribution information, first, a predicted response map output by the twin network is extracted, a second loss relative to the negative label S2 is determined according to the predicted response map, partial derivatives of the input data of the network model are obtained for the second loss, partial derivatives of the network model at the same kth layer are obtained, and the partial derivatives are used as an endogenous response map Ak, which is the second contribution information.

The intermediate layer can be any one layer in the network model intermediate layer, and k values under different recognition and training scenes can be preset and adjusted to meet different model training requirements.

For the model training method based on artificial intelligence provided by the embodiment, the server can train the network model by executing any one of the manners provided by the embodiment of the present application, and the trained network model can be applied to various fields, for example, intelligent video monitoring, that is, suspicious objects in video frames in security monitoring videos are identified and tracked by using the trained network model; the method can also be applied to robot visual navigation, namely in an intelligent robot, the trained network model can be used for tracking and shooting an object, calculating the motion track of the object and the like. The above embodiments are merely exemplary illustrations, and in practical applications, a suitable network model structure may be designed according to different problems, so as to satisfy different requirements by executing the above model training method based on artificial intelligence.

Aiming at the above-described artificial intelligence-based model training method, the embodiment of the application also provides a corresponding artificial intelligence-based model training device.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an artificial intelligence based model training apparatus according to an embodiment of the present application. As shown in fig. 5, the model training apparatus 500 includes a training unit 501, a determining unit 502, and an updating unit 503:

the training unit 501 is configured to train a network model according to a target image for identifying a target object and a training image including the target object, to obtain a recognition result corresponding to the training image, where the recognition result is used to identify a predicted position of the target object in the training image, the training image has a corresponding positive label and a corresponding negative label, the positive label is used to identify an actual position of the target object in the training image, and the negative label is used to identify an actual position of background content in the training image;

the determining unit 502 is configured to determine a first loss of the identification result relative to the positive tag and a second loss relative to the negative tag;

the determining unit 502 is further configured to determine a background perturbation loss according to the first loss and the second loss, where the background perturbation loss is used to identify an influence of the background content on identifying the target object;

the updating unit 503 is configured to update the parameters of the network model according to the background disturbance loss.

Wherein the determining unit 502 is further configured to:

determining first contribution information based on the first loss and second contribution information based on the second loss; the first contribution information is used for identifying the degree of contribution of the characteristic point in the training image to the prediction of the position of the target object, and the second contribution information is used for identifying the degree of contribution of the characteristic point in the training image to the prediction of the position of the background content;

determining the background disturbance loss according to difference information between the first contribution information and the second contribution information.

Wherein the determining unit 502 is further configured to:

determining the first contribution information through an output of an intermediate layer by back-propagating the first loss through the network model;

determining the second contribution information through an output of an intermediate layer by back-propagating the second loss through the network model.

Wherein the determining unit 502 is further configured to:

calculating a characteristic distance between the first contribution information and the second contribution information;

and determining the background disturbance loss by using the characteristic distance as the difference information.

Wherein the updating unit 503 is further configured to:

determining the comprehensive loss according to the background disturbance loss and the original loss;

and updating the parameters of the network model through comprehensive loss.

Wherein the network model is a twin network model.

The device further comprises a tracking unit, wherein the tracking unit is used for tracking and identifying the object of the video frame in the video to be processed through the trained network model.

In the artificial intelligence-based model training apparatus provided in the above embodiment, in the training process for the network model for target tracking, the target image and the training image are used for training, and since the target image identifies the target object to be tracked and the training image includes the target object, the network model may obtain the recognition result corresponding to the training image, and the recognition result represents the predicted position of the target object in the training image. The training image is provided with a positive label identifying the actual position of the target object in the training image and a negative label identifying the actual position of the background content in the training image, and a first loss relative to the positive label and a second loss relative to the negative label can be determined according to the recognition result. And determining background disturbance loss according to the first loss and the second loss to identify the influence of background content on the identification of the target object in the training image, and updating the parameters of the network model by using the background disturbance loss, so that when the network model identifies the object in the image, the negative influence of the background content in the image on the identification is reduced, the resolution capability of the network model on the object to be identified and the background content in the image is enhanced, and the performance of the network model is improved.

The embodiment of the application also provides a server and a terminal device for model training based on artificial intelligence, and the server and the terminal device for model training based on artificial intelligence provided by the embodiment of the application are introduced from the perspective of hardware materialization.

Referring to fig. 6, fig. 6 is a schematic diagram of a server 1400 provided by an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.

The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows ServerTM, Mac OS XTM, UnixTM, and &lTtTtranslation = L "&gTtL &lTt/T &gTtinxTM, FreeBSDTM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 6.

The CPU 1422 is configured to perform the following steps:

Optionally, the CPU 1422 may further execute the method steps of any specific implementation manner of the model training method based on artificial intelligence in the embodiment of the present application.

Aiming at the model training method described above, the embodiment of the present application further provides a terminal device for model training based on artificial intelligence, so that the model training method based on artificial intelligence is implemented and applied in practice.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal device may be any terminal device including a tablet computer, a Personal digital assistant (hereinafter, referred to as "Personal digital assistant"), and the like:

fig. 7 is a block diagram illustrating a partial structure related to a terminal provided in an embodiment of the present application. Referring to fig. 7, the terminal includes: radio Frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power 1590. Those skilled in the art will appreciate that the tablet configuration shown in fig. 7 is not intended to be limiting of tablets and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the tablet pc in detail with reference to fig. 7:

the memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications of the terminal and data processing by operating the software programs and modules stored in the memory 1520. The memory 1520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1520 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 1580 is a control center of the terminal, connects various parts of the entire tablet pc using various interfaces and lines, and performs various functions of the tablet pc and processes data by operating or executing software programs and/or modules stored in the memory 1520 and calling data stored in the memory 1520, thereby integrally monitoring the tablet pc. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor may not be integrated into the processor 1580.

In the embodiment of the present application, the terminal includes a memory 1520 that can store the program code and transmit the program code to the processor.

The processor 1580 included in the terminal can execute the method for model training based on artificial intelligence provided by the above embodiments according to the instructions in the program code.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the artificial intelligence based model training method provided in the foregoing embodiment.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for artificial intelligence based model training, the method being performed by a processing device, the method comprising:

determining background disturbance loss according to difference information between the first contribution information and the second contribution information, wherein the background disturbance loss is used for identifying the influence of the background content on the identification of the target object;

2. The method of claim 1, wherein determining first contribution information based on the first loss comprises:

the determining second contribution information according to the second loss includes:

3. The method of claim 1, wherein determining the background perturbation loss according to the difference information between the first contribution information and the second contribution information comprises:

4. The method of claim 1, wherein the updating the parameters of the network model according to the background perturbation loss comprises:

and updating the parameters of the network model through comprehensive loss.

5. The method of claim 1, wherein the network model is a twin network model.

6. The method of claim 1, further comprising:

and tracking and identifying the object of the video frame in the video to be processed through the trained network model.

7. An artificial intelligence based model training device, characterized in that the device comprises a training unit, a determining unit and an updating unit:

the determining unit is further configured to determine first contribution information according to the first loss, and determine second contribution information according to the second loss; the first contribution information is used for identifying the degree of contribution of the characteristic point in the training image to the prediction of the position of the target object, and the second contribution information is used for identifying the degree of contribution of the characteristic point in the training image to the prediction of the position of the background content; determining background disturbance loss according to difference information between the first contribution information and the second contribution information, wherein the background disturbance loss is used for identifying the influence of the background content on the identification of the target object;

8. The apparatus of claim 7, wherein the determining unit is further configured to:

9. The apparatus of claim 7, wherein the determining unit is further configured to:

10. The apparatus of claim 7, wherein the updating unit is further configured to:

and updating the parameters of the network model through comprehensive loss.

11. The apparatus of claim 7, wherein the network model is a twin network model.

12. An artificial intelligence based model training apparatus, the apparatus comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-6.