CN111556337B

CN111556337B - Media content implantation method, model training method and related device

Info

Publication number: CN111556337B
Application number: CN202010412331.1A
Authority: CN
Inventors: 凌永根; 张浩贤
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2021-09-21
Anticipated expiration: 2040-05-15
Also published as: CN111556337A

Abstract

The application discloses a media content implantation method, a model training method and a related device, relates to a computer vision technology, and can be applied to intelligent advertisement implantation application. Inputting a first video frame and a second video frame in a target video into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers and can gradually refine the image transformation parameters; and then mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters, and implanting target media content into the target implantation region. The target network model carries out gradual refinement on the implantation area, thereby ensuring the matching between the target implantation area and the template, avoiding the interference of external factors and improving the accuracy of media content implantation.

Description

Media content implantation method, model training method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a media content embedding method, a model training method, and a related apparatus.

Background

With the development of internet technology, more and more media contents appear in people's lives, and how to effectively popularize the media contents becomes a difficult problem, wherein the method of embedding the media contents in videos is carried out.

In general, the determination process for the media content implantation area may use a mathematical model of geometric deformation to describe the geometric changes of the image at different times. Since the geometric model of the described course of motion is known/self-defined, the parameters of this geometric model can be solved by building optimization equations, so that the images at different times are aligned.

However, the method based on geometric changes is easily interfered by external factors, that is, the accuracy of the obtained coordinate points is not high, which affects the accuracy of media content implantation.

Disclosure of Invention

In view of this, the present application provides a method for embedding media content, which can effectively avoid an error in an embedding area caused by external factors, and improve the accuracy of the media content embedding process.

A first aspect of the present application provides a method for embedding media content, which may be applied in a system or a program containing a media content embedding function in a terminal device, and specifically includes: acquiring a target video and target media content, wherein the target video comprises a first video frame and a second video frame, the first video frame comprises a template image indicating an initial implantation area, and the second video frame comprises an area to be implanted;

inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers, the sub-network layers are used for generating sub-transformation parameters under different resolutions, the sub-transformation parameters are associated with each other, and the image transformation parameters are obtained based on the sub-transformation parameters;

mapping the region to be implanted into the target implant region in the second video frame according to the image transformation parameters, the target implant region being associated with the initial implant region;

and implanting the target media content in the target implantation area.

Optionally, in some possible implementations of the present application, the sub-network layer includes a sampling layer and an update layer, and the inputting the first video frame and the second video frame into a target network model to obtain an image transformation parameter includes:

inputting the first video frame and the second video frame into the sampling layer to obtain a plurality of resolution-sized image pairs;

acquiring the sub-transformation parameters based on the image pair;

and respectively inputting the sub-transformation parameters into a plurality of updating layers corresponding to the sub-network layers to obtain the image transformation parameters, wherein the updating layers are associated step by step.

Optionally, in some possible implementation manners of the present application, the inputting the sub-transformation parameters into update layers corresponding to a plurality of sub-network layers, respectively, to obtain the image transformation parameters includes:

respectively inputting the sub-transformation parameters into a plurality of updating layers corresponding to the sub-network layers to obtain the sub-transformation parameters under different resolutions;

sequentially acquiring image corresponding parameters corresponding to the sub-transformation parameters according to the resolution order;

and if the corresponding parameters of the image meet preset conditions, determining the corresponding sub-transformation parameters as the image transformation parameters.

Optionally, in some possible implementations of the present application, the inputting the image pair into update layers corresponding to a plurality of sub-network layers respectively to obtain the sub-transformation parameters at different resolutions includes:

inputting the image pairs into a plurality of updating layers corresponding to the sub-network layers respectively to obtain feature maps of the image pairs;

determining motion information of the image pair according to the feature map, wherein the motion information is used for indicating the region matching condition of the image pair;

and determining the sub-transformation parameters according to the motion information.

Optionally, in some possible implementations of the present application, the method further includes:

determining an occlusion region in the image pair according to the motion information;

determining mask parameters according to the occlusion region, wherein the mask parameters are used for indicating the update of the target implantation region.

adjusting the image pair according to a preset resolution to obtain a fine image pair;

performing at least one recursive cycle on the fine image pair to update the mask parameters.

inputting motion information of a plurality of the image pairs into a detection network to obtain confidence parameters;

and triggering a corresponding interface element according to the confidence parameter, wherein the interface element is positioned on an interface where the target implantation area is positioned.

Optionally, in some possible implementation manners of the present application, the target media content is an advertisement, the target network model is a pyramid network model, and the image transformation parameter is a homography matrix.

A second aspect of the present application provides an apparatus for media content implantation, comprising: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target video and target media content, the target video comprises a first video frame and a second video frame, the first video frame comprises a template image indicating an initial implantation area, and the second video frame comprises an area to be implanted;

an input unit, configured to input the first video frame and the second video frame into a target network model to obtain an image transformation parameter, where the target network model includes a plurality of sequentially associated sub-network layers, the sub-network layers are configured to generate sub-transformation parameters at different resolutions, the sub-transformation parameters are associated with each other, and the image transformation parameter is obtained based on the sub-transformation parameters;

a mapping unit, configured to map the region to be implanted into the target implantation region in the second video frame according to the image transformation parameter, where the target implantation region is associated with the initial implantation region;

and the implantation unit is used for implanting the target media content in the target implantation area.

Optionally, in some possible implementations of the present application, the input unit is specifically configured to input the first video frame and the second video frame into the sampling layer, so as to obtain a plurality of image pairs with a resolution;

the input unit is specifically configured to acquire the sub-transformation parameters based on the image pair;

the input unit is specifically configured to input the sub-transformation parameters into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain the image transformation parameters, where the update layers are associated with each other step by step.

Optionally, in some possible implementation manners of the present application, the input unit is specifically configured to input the sub-transformation parameters into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain the sub-transformation parameters at different resolutions;

the input unit is specifically configured to sequentially obtain image corresponding parameters corresponding to the sub-transform parameters according to the order of the resolution;

the input unit is specifically configured to determine, if the image corresponding parameter satisfies a preset condition, the corresponding sub-transformation parameter as the image transformation parameter.

Optionally, in some possible implementations of the present application, the input unit is specifically configured to input the image pair into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain a feature map of the image pair;

the input unit is specifically configured to determine motion information of the image pair according to the feature map, where the motion information is used to indicate a region matching condition of the image pair;

the input unit is specifically configured to determine the sub-transformation parameters according to the motion information.

Optionally, in some possible implementations of the present application, the input unit is further configured to determine an occlusion region in the image pair according to the motion information;

the input unit is further configured to determine a mask parameter according to the occlusion region, where the mask parameter is used to indicate an update of the target implantation region.

Optionally, in some possible implementations of the present application, the input unit is further configured to adjust the image pair according to a preset resolution to obtain a fine image pair;

the input unit is further configured to perform at least one recursive cycle on the fine image pair to update the mask parameters.

Optionally, in some possible implementations of the present application, the input unit is further configured to input motion information of a plurality of the image pairs into a detection network to obtain a confidence parameter;

the input unit is further configured to trigger a corresponding interface element according to the confidence parameter, where the interface element is located on an interface where the target implantation region is located.

A third aspect of the present application provides a method for training a network model, including: acquiring a data training set, wherein the data training set comprises a plurality of training images, and the training images comprise template areas;

determining a plurality of deformation images and image transformation parameters according to the template area, wherein the deformation images comprise disturbance areas corresponding to the template area;

inputting the image transformation parameters, the template region and the disturbance region into an initial network model for training to obtain a target network model, wherein the target network model is used for executing the method for implanting the media content in any one of the first aspect.

Optionally, in some possible implementation manners of the present application, the inputting the image transformation parameter, the template region, and the disturbance region into an initial network model for training to obtain a target network model includes:

determining a first loss function according to coordinate displacements corresponding to the template region and the disturbance region, wherein the coordinate displacements are associated with the image transformation parameters;

determining a corresponding shielding area in the template area and the disturbance area to obtain a second loss function;

and updating the parameters of the initial network model according to the first loss function and the second loss function to obtain a target network model.

adjusting image parameters of the training image and the deformation image to obtain a training image pair, wherein the image parameters comprise brightness, chroma, saturation or mask image layers;

and updating the parameters of the target network model according to the training image.

A fourth aspect of the present application provides an apparatus for network model training, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a data training set, the data training set comprises a plurality of training images, and the training images comprise template areas;

the determining unit is used for determining a plurality of deformation images and image transformation parameters according to the template area, wherein the deformation images comprise disturbance areas corresponding to the template area;

a training unit, configured to input the image transformation parameter, the template region, and the perturbation region into an initial network model for training, so as to obtain a target network model, where the target network model is used to execute the method for media content implantation according to any one of the first aspect.

Optionally, in some possible implementations of the present application, the training unit is specifically configured to determine a first loss function according to coordinate displacements corresponding to the template region and the perturbation region, where the coordinate displacements are associated with the image transformation parameters;

the training unit is specifically configured to determine a corresponding occlusion region in the template region and the disturbance region to obtain a second loss function;

the training unit is specifically configured to update parameters of the initial network model according to the first loss function and the second loss function, so as to obtain a target network model.

Optionally, in some possible implementation manners of the present application, the training unit is further configured to adjust image parameters of the training image and the deformation image to obtain a training image pair, where the image parameters include a luminance, a chrominance, a saturation, or a mask image layer;

the training unit is further configured to update parameters of the target network model according to the training image.

A fifth aspect of the present application provides a computer device comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to perform the method for media content implantation according to the first aspect or any one of the first aspects, or the method for network model training according to any one of the third aspects, according to instructions in the program code.

A sixth aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when executed on a computer, cause the computer to perform the method for media content implantation of the first aspect or any one of the above first aspects, or the method for network model training of any one of the third aspects or any one of the third aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

acquiring a target video and target media content, wherein the target video comprises a first video frame serving as a template and a second video frame to be implanted with the media content; then inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers and can gradually refine the image transformation parameters; and then mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters, and implanting target media content into the target implantation region. Therefore, the intelligent implantation process is realized, and the target network model gradually refines the implantation area, so that the matching between the target implantation area and the template is ensured, the interference of external factors is avoided, and the accuracy of media content implantation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram of a network architecture in which a media content implantation system operates;

fig. 2 is a flowchart of media content implantation according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a method for media content implantation according to an embodiment of the present application;

fig. 4 is a schematic view of a media content implantation scenario provided in an embodiment of the present application;

fig. 5 is a schematic view of another scenario of media content implantation provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a network model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another network model provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of another network model provided in an embodiment of the present application;

fig. 9 is a schematic view of another scenario of media content implantation provided in an embodiment of the present application;

FIG. 10 is a flowchart of a method for network model training according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a media content implanting device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a network model training apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a method and a related device for implanting media content, which can be applied to a system or a program containing a media content implanting function in terminal equipment, and can be used for acquiring a target video and target media content, wherein the target video comprises a first video frame serving as a template and a second video frame of the media content to be implanted; then inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers and can gradually refine the image transformation parameters; and then mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters, and implanting target media content into the target implantation region. Therefore, the intelligent implantation process is realized, and the target network model gradually refines the implantation area, so that the matching between the target implantation area and the template is ensured, the interference of external factors is avoided, and the accuracy of media content implantation is improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some nouns that may appear in the embodiments of the present application are explained.

Image alignment: also known as image registration, i.e. for two images in a set of image data sets, one image is mapped to the other image by finding a spatial transformation such that points in the two images corresponding to the same position in space correspond one to one.

Homography matrix: a parameter of variation between images, i.e. mapping points on one projection plane (three-dimensional homogeneous vectors) onto another projection plane and mapping straight lines into straight lines, where homography is a linear transformation on three-dimensional homogeneous vectors.

U-net: a deep convolutional neural network for extracting image features.

Singular Value Decomposition algorithm (SVD): a matrix decomposition algorithm is characterized in that a more complex matrix is represented by multiplying 3 smaller and simpler sub-matrixes, wherein the 3 smaller matrixes describe important characteristics of a large matrix.

It should be understood that the media content implanting method provided by the present application may be applied to a system or a program including a media content implanting function in a terminal device, such as a media content platform, specifically, the media content implanting system may operate in a network architecture as shown in fig. 1, which is a network architecture diagram of the operation of the media content implanting system as shown in the figure, the media content implanting system may provide media content implanting with a plurality of information sources, and the terminal establishes a connection with a server through a network to receive a video with implanted media content sent by the server, or receive a local implanting of media content sent by the server; it is understood that, fig. 1 shows various terminal devices, in an actual scenario, there may be more or fewer types of terminal devices participating in the media content implantation process, and the specific number and types depend on the actual scenario, which is not limited herein, and in addition, fig. 1 shows one server, but in an actual scenario, there may also be participation of multiple servers, especially in a scenario of multi-content application interaction, the specific number of servers depends on the actual scenario.

In this embodiment, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

It should be noted that the media content implanting method provided in this embodiment may also be performed offline, that is, without the participation of a server, at this time, the terminal is connected with other terminals locally, and then the process of implanting the media content between the terminals is performed.

It is understood that the media content implantation system described above may be operated in a personal mobile terminal, such as: the application as a media content platform can also run on a server, and can also run on a third-party device to provide media content implantation so as to obtain a media content implantation processing result of an information source; the specific media content implanting system may be operated in the device in the form of a program, may also be operated as a system component in the device, and may also be used as one of cloud service programs, and a specific operation mode is determined according to an actual scene, which is not limited herein.

The process of embedding media content into a video can be carried out based on a Computer Vision technology (CV) in artificial intelligence, wherein the Computer Vision technology is a science for researching how to enable a machine to see, and further means that a camera and a Computer are used for replacing human eyes to carry out machine Vision such as identification, tracking and measurement on a target, and further graphic processing is carried out, so that the Computer processing becomes an image which is more suitable for human eyes to observe or is transmitted to an instrument to detect; for example, in a media content placement application, a placement area in a target image can be identified through computer vision technology, and then corresponding media content is placed in the placement area. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

In order to solve the above problems, the present application provides a method for media content implantation, which is applied to a flow framework for media content implantation shown in fig. 2, and as shown in fig. 2, for a flow framework for media content implantation provided in an embodiment of the present application, a target video is automatically acquired, a template frame is set, and then the target video and the template frame are input into a target network model including a plurality of sub-network layers.

It is understood that the method provided in the present application may be a program written as a processing logic in a hardware system, or may be a media content implantation device, and the processing logic is implemented in an integrated or external manner. As one implementation manner, the media content implanting device acquires a target video and a target media content, wherein the target video comprises a first video frame as a template and a second video frame of the media content to be implanted; then inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers and can gradually refine the image transformation parameters; and then mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters, and implanting target media content into the target implantation region. Therefore, the intelligent implantation process is realized, and the target network model gradually refines the implantation area, so that the matching between the target implantation area and the template is ensured, the interference of external factors is avoided, and the accuracy of media content implantation is improved.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, and is specifically explained by the following embodiment:

with reference to the above flow architecture, a method for embedding media content in the present application will be described below, please refer to fig. 3, where fig. 3 is a flow chart of a method for embedding media content according to an embodiment of the present application, and the embodiment of the present application at least includes the following steps:

301. and acquiring the target video and the target media content.

In this embodiment, the target video includes a first video frame and a second video frame, the first video frame includes a template image indicating an initial implantation region, and the second video frame includes a region to be implanted; as shown in fig. 4, which is a schematic view of a media content embedding scene provided by the embodiment of the present application, it can be seen that target embedding areas a1-A3 in a plurality of video frames before embedding are all not identical, and the embedding areas a1-A3 can be transformed into corresponding media content a4-a6 after being processed by the media content embedding method provided by the embodiment of the present application.

It can be understood that the first video frame is generally set as the first frame of the video, and the second video frame is the other frame in the target video, and the template image is manually defined in the first frame by the user, so that the mapping area of the template image in different video frames is obtained by the method provided by the application, and then the template image is implanted. Since the area corresponding to the template image is a dynamic change process in different video frames, the detection frame needs to be continuously optimized and close to the actual implantation area, as shown in fig. 5, which is another scene diagram for media content implantation provided in the embodiment of the present application, the diagram includes the template image B1 in the first video frame and the area to be implanted B2 in the second video frame, which may also be referred to as the detection frame; and a target implantation region B3, the embodiment gradually approaches the region to be implanted B2 to the target implantation region B3 by optimizing the image transformation parameters, thereby ensuring the accuracy of the target implantation region B3.

It should be noted that, in the present application, the media content may be in the form of video, picture or a combination of the two, and specifically, may be implanted as an advertisement, and the specific form depends on the actual scene, and is not limited herein.

302. And inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters.

In this embodiment, the target network model includes a plurality of sequentially associated sub-network layers, the sub-network layers are used for generating sub-transformation parameters at different resolutions, the sub-transformation parameters are associated with each other, and the image transformation parameters are obtained based on the sub-transformation parameters; in a possible scenario, the target network model is a pyramid network model, and a specific structure of the model is described below, where the sub-transformation parameters are described by using homography, and other parameters that can be used for transformation between images are also included in the scope described in this application, and are not limited herein.

As shown in fig. 6, the schematic structural diagram of a network model provided in the embodiment of the present application includes a plurality of sequentially associated sub-network layers, where an association relationship of the sub-network layers may be obtained by processing an image pair of each layer of the network by a previous sub-network layer, that is, acquiring a homography matrix from the previous layer to the network layer and a change position of a detection frame in a video frame by an update layer, so as to gradually improve the precision of the homography matrix, and make the detection frame in the video frame approach to a target implantation region continuously, so as to ensure the accuracy of determining the target implantation region. In addition, the figure also comprises a fine Network (referred Network), namely, the image pair is adjusted according to a preset resolution ratio to obtain a fine image pair; the fine image pair is then subjected to at least one recursive cycle to update the mask parameters, a process primarily intended to further improve the accuracy of the target implant region at high resolution.

In a possible scenario, a first video frame (template image) and a second video frame are transformed to an image with a resolution of 30x30 by a homography matrix and input into a first sub-network layer of a pyramid network, and then the first sub-network layer outputs a homography matrix between two images of 30x30, and the matrix is used to update the position of a region to be implanted, wherein the region to be implanted is closer to the position of a target implantation region. And then, in the same way, the image of the green frame area and the area to be implanted, which is deformed to 60x60 and 120x120 size by the homography matrix, is input into the second sub-network layer and the third sub-network layer of the pyramid network, and the positions of the area to be implanted are sequentially updated.

Further, the last refinement network immediately following the pyramid network is iterated in the same manner a number of times, for example 3 times; the position of the region to be implanted is refined to achieve sub-pixel accuracy. The higher the input resolution of the refinement network is, the higher the refinement progress is, generally between 120x120 and 480x480, so that the accuracy of the target implantation area is further improved.

It should be noted that the figure shows 3 layers of sub-network layers and sub-network layer inputs at different resolutions, and in an actual scenario, there may be any other number of layers and resolutions, and here, the number is merely an example, and the specific number depends on the actual scenario.

Specifically, as for the processing flow in the sub-network layer, as shown in fig. 7, a schematic structural diagram of another network model provided in the embodiment of the present application is shown, in which a first video frame (template image) and a second video frame (target video) are input into a sampling layer as an image pair to obtain an image pair with multiple resolution sizes; the image pairs are respectively input into a plurality of Feature Extraction layers (Multi-Scale Feature Extraction layers) corresponding to the sub-network layers to obtain image transformation parameters. Wherein, the sub-transformation parameters are obtained by respectively inputting the image pair into a Feature Extraction Layer (Multi-Scale Feature Extraction Layer) corresponding to a plurality of sub-network layers to obtain a Feature map of the image pair; then, according to a feature map input motion information Construction Layer (Local Cost Volume Construction Layer), determining motion information of the image pair, wherein the motion information is used for indicating the area matching condition of the image pair; and then inputting a Global Motion Estimation Layer (Global Motion Estimation Layer) according to the Motion information to determine sub-transformation parameters.

Specifically, the feature extraction layer is a weight-shared U-net, so as to extract features from the two input images img1 and img2, wherein the feature extraction may also adopt other network models, such as renet and the like.

The motion information construction layer is mainly used for acquiring motion information (Cost Volume), and specifically may be acquired by using the following formula:

c(x_i,x_j)＝f_i(x_i)^Tf_j(x_j)

wherein f is_iA characteristic diagram of the template image is obtained; f. of_jA feature map of a second video frame; x is the number of_iIs f_iThe pixel position of (a); x is the number of_jIs f_jThe pixel position of (a); t is matrix transposition. For example, may be f_iEach pixel position x_iMotion information of 9x9 window size is constructed.

For the global motion estimation layer, the global motion estimation layer is a Convolutional Neural Network (CNN) with a Visual Geometry Group (VGG) structure, specifically, motion information is used as input, displacements of four vertexes of the template image relative to the second video frame are output, and then, the sub-transformation parameters are obtained by using SVD solution.

Optionally, a visualization Layer (Visibility Layer) may be further included in the sub-network Layer, which primarily determines an occlusion region in the image pair according to the motion information; and determining mask parameters according to the shielded area, wherein the mask parameters are used for indicating the updating of the target implantation area, so that the accuracy of the target implantation area is ensured. Wherein, the mask parameters represent the same area (visible is 1, invisible is 0) as the template image content in the area to be implanted. On one hand, training is effectively assisted, so that the homography matrix solving precision of each layer of pyramid is higher; on the other hand, in the last iteration of the refinement network, the region to be implanted has substantially coincided with the target implantation region to achieve alignment. At the moment, the mask parameters are changed into an occlusion mask, the visible part is not occluded, otherwise, occlusion exists, and therefore the accuracy of media content implantation is improved.

Further, a description is given to an image processing procedure of the sampling layer in the above embodiment, as shown in fig. 8, which is a schematic structural diagram of another network model provided in the embodiment of the present application. Wherein the adjustment parameter of the template image is

I.e. a homography that projects the template image onto a fixed size image, for example: the resolution of 30x30, 60x60, 120x120, n is 1, 2, 3. In addition, the adjustment parameters of the target video are

I.e. the area to be implanted is projected to a homography of the fixed size image. The sub-conversion parameters for converting the adjusted target video frame image into the adjusted template image are

I.e. a homography of the matching relations between the predicted sampled images. Sub-transformation parameter H for changing target video image into template image_ijThen it needs to utilize

And

the target implant region is projected onto the homography matrix of the region to be implanted.

Specifically, in the step-by-step interaction between the sampling layer and the updating layer, the updating layer is mainly responsible for outputting according to each layer of the generated pyramid network

Dynamic update H_ijAnd p_jUp to p_jAnd p_gtIn agreement, the specific process refers to the following formula:

p_j＝(H_ij)^-1p_i

wherein p is_iAssumed as the coordinate position of the template image, p_jAssumed as the coordinate position of the region to be implanted, p_gtAssumed to be the coordinate location of the target implant region. So the sampling layer is mainly responsible for p_iAnd p_jSolving each layer of pyramid by SVD method

And

sampling generates each layer of input of the pyramid network.

Alternatively to p_jAnd p_gtWhen the sub-conversion parameters are consistent, the model calculation can be stopped, namely, the image pairs are respectively input into the updating layers corresponding to the plurality of sub-network layers to obtain the sub-conversion parameters under different resolutions; then, sequentially acquiring image corresponding parameters corresponding to the sub-transformation parameters; if the image corresponding parameter satisfies a predetermined condition, for example: p is a radical of_jAnd p_gtAnd if the similarity reaches 95%, determining the corresponding sub-transformation parameters as image transformation parameters, thereby simplifying the process of model calculation.

In one possible scenario, motion information for a plurality of image pairs may also be input into the detection network to obtain confidence parameters; and triggering the corresponding interface elements according to the confidence parameters, wherein the interface elements are positioned on the interface where the target implantation area is positioned, and the interface elements are displayed in the forms of rainbow bars, scores and the like, so that the confidence level of the implantation result is improved.

303. And mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters.

In this embodiment, the target implant region is associated with the initial implant region; i.e., the target implant region is transformed based on the gradual approach of the initial implant region.

In particular, the placement of the second video frame may be applied to each frame in the target video, thereby enabling the placement of advertisements in the video.

Optionally, it may be detected whether there is an implantation identifier in the second video frame, that is, an advertisement implantation point in the video; if the media content embedding method is detected, the advertisements are embedded in the range of the video frames indicated by the embedding identification, so that the application range of the media content embedding method is expanded.

304. Target media content is implanted in the target implantation area.

In this embodiment, referring to fig. 9, a schematic view of another scenario for embedding media content provided in this embodiment of the application is shown in a process of embedding target media content. The figure shows that by inputting the template image and the target video in the target network model, a corresponding implantation area and an occlusion mask for indicating occlusion information can be obtained, so that media content is implanted in an unoccluded area to obtain an implantation result. Wherein. The implantation result may include a virtual indication element C1 obtained based on the detection Network indicated in step 303, i.e., a Confidence Network (Confidence Network), for representing the Confidence level of the implantation result. Even in the case of complex scenes such as scale transformation, angle deformation, motion blur and the like, the target network model in the application can still accurately solve the homography matrix, so that the robustness of the target network model can be ensured through data enhancement. In addition, the reliability of the final alignment result can be predicted by utilizing the motion information solved by each layer of the pyramid, visual indication is carried out, and the practicability of the implantation process is improved.

With reference to the foregoing embodiment, by acquiring a target video and target media content, the target video includes a first video frame serving as a template and a second video frame to be embedded with the media content; then inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers and can gradually refine the image transformation parameters; and then mapping the region to be implanted into a target implantation region in the second video frame according to the image transformation parameters, and implanting target media content into the target implantation region. Therefore, the intelligent implantation process is realized, and the target network model gradually refines the implantation area, so that the matching between the target implantation area and the template is ensured, the interference of external factors is avoided, and the accuracy of media content implantation is improved.

In the above embodiment, the process of media content embedding is described, and the target network model involved therein is trained, and in the following, the process of training the network model is described, please refer to fig. 10, fig. 10 is a flowchart of a method for training the network model provided by the embodiment of the present application, and the embodiment of the present application at least includes the following steps:

1001. a training set of data is obtained.

In this embodiment, the data training set includes a plurality of training images, and the training images include template regions. For example: the training set is generated using the MS-COCO data set, and to ensure the unity of the training process parameters, the picture can be scaled to a fixed size, e.g., 240 × 240 resolution.

1002. A plurality of deformed images and image transformation parameters are determined according to the template region.

In this embodiment, the deformation image includes a disturbance region corresponding to the template region; specifically, the deformation image may be determined by adjusting image parameters of the training image and the deformation image, where the image parameters include brightness, chromaticity, saturation, or mask layer.

In a possible scenario, a 128 × 128 window area is randomly selected as a template area, and then the template area window is randomly perturbed to correspond to four vertices, for example, the perturbation range is [ -32,32], so as to generate a random homography matrix, and a corresponding picture (a morphed image) is generated by this morphing. At this time, the deformed template image area in the deformed image is the target implantation area, thereby improving the convenience of acquiring the training sample.

Optionally, the brightness, the chromaticity and the saturation of the template area image and the deformation image can be randomly changed, and the occlusion and the motion blur are randomly added to generate the visual mask. Therefore, the corresponding training samples are generated, and the training process of the network model is more comprehensive.

1003. And inputting the image transformation parameters, the template area and the disturbance area into an initial network model for training to obtain a target network model.

In this embodiment, the target network model is a method for performing media content implantation according to the embodiment shown in fig. 3.

Specifically, the training process comprises optimization of a loss function, namely determining a first loss function according to coordinate displacement corresponding to a template region and a disturbance region, wherein the coordinate displacement is associated with an image transformation parameter; then, corresponding shielding areas in the template area and the disturbance area are determined to obtain a second loss function; and updating the parameters of the initial network model according to the first loss function and the second loss function to obtain the target network model.

In one possible scenario, the training image pair is image 1 (template image) and image 2 (deformation image). The true values of the displacements of the four vertices of image 1 with respect to image 2 are

In particular, the template image coordinates p_iOf the deformation image coordinate p_jAnd homography matrix

Obtained by solving, wherein k represents four vertexes, d_kFor the output of each global motion estimation layer of the pyramid, the first loss function l_dCan be obtained with reference to the following formula:

it is understood that, since the target network model in this embodiment includes a plurality of sub-network layers, each of the sub-network layers is a sub-network layer

Will be according to p_jIs updated.

In addition, the second loss function is used to indicate the true value of the visual mask

And predicted value m of visualization layer_k(k represents each pixel location). The process can be seen as a binary problem, then the second penalty function/_MCan be obtained with reference to the following formula:

wherein each layer is

Will be according to p_jIs updated by updating of N^kIs the number of pixels.

In summary, the loss function of the target network model is as follows:

wherein l_dIs a first loss function,/_MFor the second loss function, k represents the pyramid level, λ_dAnd λ_MAre empirical parameters and are typically set to 1.0.

In addition, the confidence network may also be trained in this embodiment. I.e. the confidence is regarded as a binary problem, i.e. l of the last pyramid_dLess than 5 is set to 1, otherwise 0, and the loss function of the specific confidence network is:

l_c＝-(p^*log(p)+(1-p^*)log(1-p))

where p is the coordinates of the target image, p^*Is the coordinates of the target implant area.

By combining the embodiment, the initial network model is trained through the automatically generated training set, so that the target network model becomes a network model based on deep learning, the confidence degrees of plane image alignment, image occlusion detection and alignment effects are integrated, and the method has the characteristics of light weight, robustness and high precision.

In order to better implement the above-mentioned aspects of the embodiments of the present application, the following also provides related apparatuses for implementing the above-mentioned aspects. Referring to fig. 11, fig. 11 is a schematic structural diagram of a media content embedding device according to an embodiment of the present disclosure, in which the media content embedding device 1100 includes:

an obtaining unit 1101, configured to obtain a target video and a target media content, where the target video includes a first video frame and a second video frame, the first video frame includes a template image indicating an initial implantation region, and the second video frame includes a region to be implanted;

an input unit 1102, configured to input the first video frame and the second video frame into a target network model to obtain an image transformation parameter, where the target network model includes a plurality of sequentially associated sub-network layers, the sub-network layers are configured to generate sub-transformation parameters at different resolutions, the sub-transformation parameters are associated with each other, and the image transformation parameter is obtained based on the sub-transformation parameters;

a mapping unit 1103, configured to map, in the second video frame, the region to be implanted into the target implantation region according to the image transformation parameter, where the target implantation region is associated with the initial implantation region;

an implanting unit 1104 for implanting the target media content in the target implantation area.

Optionally, in some possible implementations of the present application, the input unit 1102 is specifically configured to input the first video frame and the second video frame into the sampling layer, so as to obtain image pairs with multiple resolutions;

the input unit 1102 is specifically configured to acquire the sub-transformation parameters based on the image pairs;

the input unit 1102 is specifically configured to input the sub-transformation parameters into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain the image transformation parameters, where the update layers are associated with each other step by step.

Optionally, in some possible implementation manners of the present application, the input unit 1102 is specifically configured to input the sub-transformation parameters into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain the sub-transformation parameters at different resolutions;

the input unit 1102 is specifically configured to sequentially obtain image corresponding parameters corresponding to the sub-transform parameters according to the order of the resolution;

the input unit 1102 is specifically configured to determine the corresponding sub-transformation parameter as the image transformation parameter if the image corresponding parameter meets a preset condition.

Optionally, in some possible implementations of the present application, the input unit 1102 is specifically configured to input the image pair into update layers corresponding to a plurality of sub-network layers, respectively, so as to obtain a feature map of the image pair;

the input unit 1102 is specifically configured to determine, according to the feature map, motion information of the image pair, where the motion information is used to indicate a region matching condition of the image pair;

the input unit 1102 is specifically configured to determine the sub-transformation parameters according to the motion information.

Optionally, in some possible implementations of the present application, the input unit 1102 is further configured to determine an occlusion region in the image pair according to the motion information;

the input unit 1102 is further configured to determine a mask parameter according to the occlusion region, where the mask parameter is used to indicate an update of the target implantation region.

Optionally, in some possible implementations of the present application, the input unit 1102 is further configured to adjust the image pair according to a preset resolution to obtain a fine image pair;

the input unit 1102 is further configured to perform at least one recursive cycle on the fine image pair to update the mask parameters.

Optionally, in some possible implementations of the present application, the input unit 1102 is further configured to input motion information of a plurality of image pairs into a detection network to obtain a confidence parameter;

the input unit 1102 is further configured to trigger a corresponding interface element according to the confidence parameter, where the interface element is located on an interface where the target implantation region is located.

An embodiment of the present application further provides a network model training apparatus 1200, as shown in fig. 12, which is a schematic structural diagram of the network model training apparatus provided in the embodiment of the present application, and specifically includes:

an obtaining unit 1201, configured to obtain a data training set, where the data training set includes a plurality of training images, and the training images include a template region;

a determining unit 1202, configured to determine a plurality of deformation images and image transformation parameters according to the template region, where the deformation images include a disturbance region corresponding to the template region;

a training unit 1203, configured to input the image transformation parameters, the template region, and the perturbation region into an initial network model for training, so as to obtain a target network model, where the target network model is used to execute the method for media content implantation according to any one of the first aspect.

Optionally, in some possible implementations of the present application, the training unit 1203 is specifically configured to determine a first loss function according to coordinate displacements corresponding to the template region and the perturbation region, where the coordinate displacements are associated with the image transformation parameters;

the training unit 1203 is specifically configured to determine a corresponding occlusion region in the template region and the perturbation region to obtain a second loss function;

the training unit 1203 is specifically configured to update parameters of the initial network model according to the first loss function and the second loss function, so as to obtain a target network model.

Optionally, in some possible implementations of the present application, the training unit 1203 is further configured to adjust image parameters of the training image and the deformation image to obtain a training image pair, where the image parameters include a luminance, a chrominance, a saturation, or a mask layer;

the training unit 1203 is further configured to update parameters of the target network model according to the training image.

An embodiment of the present application further provides a terminal device, as shown in fig. 13, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, and for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to a method portion in the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as the mobile phone as an example:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 13, the handset includes: radio Frequency (RF) circuitry 1310, memory 1320, input unit 1330, display unit 1340, sensor 1350, audio circuitry 1360, wireless fidelity (WiFi) module 1370, processor 1380, and power supply 1390. Those skilled in the art will appreciate that the handset configuration shown in fig. 13 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

RF circuit 1310 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for processing received downlink information of a base station by processor 1380; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 1310 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1310 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 1320 may be used to store software programs and modules, and the processor 1380 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332. Touch panel 1331, also referred to as a touch screen, can collect touch operations by a user on or near the touch panel 1331 (e.g., operations by a user on or near touch panel 1331 using any suitable object or accessory such as a finger, a stylus, etc., and spaced touch operations within a certain range on touch panel 1331), and drive corresponding connected devices according to a preset program. Alternatively, the touch panel 1331 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. In addition, the touch panel 1331 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1330 may include other input devices 1332 in addition to the touch panel 1331. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1340 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The display unit 1340 may include a display panel 1341, and optionally, the display panel 1341 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, touch panel 1331 can overlay display panel 1341, and when touch panel 1331 detects a touch operation on or near touch panel 1331, processor 1380 can be configured to determine the type of touch event, and processor 1380 can then provide a corresponding visual output on display panel 1341 based on the type of touch event. Although in fig. 13, the touch panel 1331 and the display panel 1341 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1331 and the display panel 1341 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1341 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The audio circuit 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the handset. The audio circuit 1360 may transmit the electrical signal converted from the received audio data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1360, and then processes the audio data by the audio data output processor 1380, and then sends the audio data to, for example, another cellular phone via the RF circuit 1310, or outputs the audio data to the memory 1320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1370, and provides wireless broadband internet access for the user. Although fig. 13 shows the WiFi module 1370, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1380 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1320 and calling data stored in the memory 1320, thereby integrally monitoring the mobile phone. Optionally, processor 1380 may include one or more processing units; alternatively, processor 1380 may integrate an application processor, which handles primarily the operating system, user interface, and applications, and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1380.

The handset also includes a power supply 1390 (e.g., a battery) to provide power to the various components, which may optionally be logically coupled to the processor 1380 via a power management system to manage charging, discharging, and power consumption management via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present application, the processor 1380 included in the terminal further has a function of performing the respective steps of the page processing method as described above.

Also provided in embodiments of the present application is a computer-readable storage medium having stored therein media content embedding instructions, which when executed on a computer, cause the computer to perform the steps performed by the media content embedding device in the method as described in the foregoing embodiments shown in fig. 3 to 10.

Also provided in the embodiments of the present application is a computer program product including instructions for implanting media content, which when executed on a computer, causes the computer to perform the steps performed by the media content implanting apparatus in the method as described in the embodiments of fig. 3 to 10.

The embodiment of the present application further provides a media content implanting system, where the media content implanting system may include the media content implanting apparatus in the embodiment described in fig. 11, or the network model training apparatus in the embodiment described in fig. 12, or the terminal device described in fig. 13.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a media content implanting device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of media content placement, comprising:

acquiring a target video and target media content, wherein the target video comprises a first video frame and a second video frame, the first video frame comprises a template image indicating an initial implantation area, and the second video frame comprises an area to be implanted;

inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters, wherein the target network model comprises a plurality of sequentially associated sub-network layers, the plurality of sequentially associated sub-network layers are respectively used for generating sub-transformation parameters under different resolutions based on the first video frame and the second video frame, and the sub-transformation parameters are associated with each other and used for obtaining the image transformation parameters;

mapping the region to be implanted into a target implant region in the second video frame according to the image transformation parameters, the target implant region being associated with the initial implant region;

and implanting the target media content in the target implantation area.

2. The method of claim 1, wherein the sub-network layers comprise a sampling layer and an update layer, and wherein inputting the first video frame and the second video frame into a target network model to obtain image transformation parameters comprises:

acquiring sub-transformation parameters at the different resolutions based on the image pair;

and respectively inputting the sub-transformation parameters under different resolutions into updating layers corresponding to the sub-network layers to obtain the image transformation parameters, wherein the updating layers are associated step by step.

3. The method according to claim 2, wherein the inputting the sub-transformation parameters at different resolutions into the update layers corresponding to the sub-network layers respectively to obtain the image transformation parameters comprises:

sequentially and respectively inputting the sub-transformation parameters under different resolutions into an updating layer corresponding to the sub-network layer according to the order of the resolutions so as to obtain corresponding parameters of the image;

4. The method of claim 2, wherein the sub-network layer further comprises a feature extraction layer, and wherein the obtaining sub-transform parameters at the different resolutions based on the image pairs comprises:

inputting the image pairs into feature extraction layers corresponding to the sub-network layers respectively to obtain feature maps of the image pairs;

5. The method of claim 4, further comprising:

6. The method of claim 5, further comprising:

7. The method of claim 4, further comprising:

8. The method of claim 2, wherein said inputting the sample layer from the first video frame and the second video frame to obtain a plurality of resolution-sized image pairs comprises:

inputting the first video frame and the second video frame into the sampling layer; and obtaining a plurality of image pairs with the resolution sizes based on the sub-transformation parameters of the previous layer.

9. The method of claim 1, wherein the target media content is an advertisement, the target network model is a pyramid network model, and the image transformation parameters are homography matrices.

10. A method of network model training, comprising:

acquiring a data training set, wherein the data training set comprises a plurality of training images, and the training images comprise template areas;

inputting the image transformation parameters, the template region and the perturbation region into an initial network model for training to obtain a target network model, wherein the target network model is used for executing the method for media content implantation according to any one of claims 1 to 9.

11. The method of claim 10, wherein the inputting the image transformation parameters, the template region and the perturbation region into an initial network model for training to obtain a target network model comprises:

12. The method of claim 10, further comprising:

13. An apparatus for media content placement, comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target video and target media content, the target video comprises a first video frame and a second video frame, the first video frame comprises a template image indicating an initial implantation area, and the second video frame comprises an area to be implanted;

an input unit, configured to input the first video frame and the second video frame into a target network model to obtain an image transformation parameter, where the target network model includes a plurality of sequentially associated sub-network layers, the plurality of sequentially associated sub-network layers are respectively configured to generate sub-transformation parameters at different resolutions based on the first video frame and the second video frame, and the sub-transformation parameters are associated with each other and are used to obtain the image transformation parameter;

a mapping unit, configured to map the region to be implanted into a target implantation region in the second video frame according to the image transformation parameter, where the target implantation region is associated with the initial implantation region;

14. An apparatus for network model training, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a data training set, the data training set comprises a plurality of training images, and the training images comprise template areas;

a training unit, configured to input the image transformation parameters, the template region, and the perturbation region into an initial network model for training, so as to obtain a target network model, where the target network model is used to perform the method for media content implantation according to any one of claims 1 to 9.

15. A computer device, the computer device comprising a processor and a memory:

the memory is used for storing program codes; the processor is configured to perform the method for media content implantation according to any one of claims 1 to 9 or the method for network model training according to any one of claims 10 to 12 according to instructions in the program code.

16. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of media content implantation of any of the above claims 1 to 9, or the method of network model training of any of the above claims 10 to 12.