CN113810587B

CN113810587B - Image processing method and device

Info

Publication number: CN113810587B
Application number: CN202010478673.3A
Authority: CN
Inventors: 彭焕文; 宋楠; 李宏俏; 刘苑文; 曾毅华
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-04-18
Anticipated expiration: 2040-05-29
Also published as: WO2021238325A1; CN113810587A

Abstract

The application provides an image processing method and device, relates to the technical field of multimedia processing, and is used for solving the problem that a motion track special-effect video of a target shooting object cannot be generated in real time in the prior art. The method comprises the following steps: acquiring a current frame and N historical action frames, wherein the current frame and the N historical action frames comprise target bodies, scenes of the current frame and the N historical action frames are overlapped, and the positions of the target bodies in the N historical action frames are different; performing image segmentation on the N historical action frames to obtain images of N target bodies corresponding to the N historical action frames respectively; determining N reference positions in the current frame according to the positions of the N target main bodies in the scenes of the N historical action frames and the scene of the current frame; and respectively fusing the images of the N target subjects on the N reference positions of the current frame to obtain the target frame.

Description

Image processing method and device

Technical Field

The present application relates to the field of multimedia processing technologies, and in particular, to an image processing method and apparatus.

Background

At present, more and more users select to take pictures or videos by using cameras on mobile electronic equipment such as a mobile phone and the like to record life, the motion tracks of objects or people cannot be visually embodied in the same video frame in the images or videos generally taken by the cameras, and the interaction experience between the portrait and the background and between the portrait and the portrait is not rich enough and lacks interestingness.

The existing solution is to process the image data of the generated video frame, add the motion path of the target object into the processed image data, and generate a special effect video. For example, the actual motion trail path of the football or the player is shown in the football match video, that is, the motion route of the football or the player is visually embodied by image processing technology at a later stage, for example, the motion route represented by a curve or a straight line is added, so that the special effect video is generated. However, this scheme can only perform post-processing, and cannot generate a special effect video in real time.

Disclosure of Invention

The application provides an image processing method and device, and solves the problem that a motion track special-effect video of a target shooting object cannot be generated in real time in the prior art.

In order to achieve the purpose, the following technical scheme is adopted in the application:

in a first aspect, an image processing method is provided, which includes: acquiring a current frame and N historical action frames, wherein the current frame and the N historical action frames both comprise a target main body, scenes of the current frame and the N historical action frames are overlapped, the positions of the target main body in the N historical action frames are different, and N is a positive integer greater than or equal to 1; performing image segmentation on the N historical action frames to obtain images of N target bodies corresponding to the N historical action frames respectively; determining N reference positions in the current frame according to the positions of the N target main bodies in the scenes of the N historical action frames and the scene of the current frame; and respectively fusing the images of the N target subjects on the N reference positions of the current frame to obtain the target frame.

It should be noted that, after the electronic device receives a shooting start instruction of a user, the electronic device acquires a real-time video stream through a lens, where the real-time video stream is composed of a temporally continuous frame sequence, and each frame of the video stream may be a current frame at a current time. When the electronic device determines a key action frame by a specific method described below, the key action frame may be referred to as a historical action frame with respect to a current frame corresponding to a time after the determination of the key action frame. Taking a time axis t of real-time shooting as an example, the electronic device starts video shooting at a time t0, the electronic device determines a real-time video frame corresponding to the time t1 as a key action frame (historical action frame 1), and then the electronic device determines a real-time video frame corresponding to the time t2 as a key action frame (historical action frame 2), so that for a current frame corresponding to a current time t3, the acquired N historical action frames are the historical action frame 1 and the historical action frame 2.

In the above technical solution, the electronic device segments the image of the corresponding at least one target subject in at least one historical motion frame by determining at least one key motion frame as the historical motion frame in the real-time video frame stream. The key action frame refers to an image corresponding to a target subject when the target subject makes a specified action or an obvious key action in a video frame stream shot by the electronic equipment in real time. And simultaneously displaying the image of the target main body in each historical action frame in the current frame by a multi-frame fusion display method according to the position corresponding relation of the object in the multi-frame image. The main application scene of the technical scheme is the segmentation of the portrait and the fusion display of the motion trail, so that a special effect image or a special effect video of the motion trail of a shot target main body can be generated in real time, and the shooting experience of a user is enriched.

In a possible design, before the obtaining the current frame and the N historical motion frames, the method further includes: and receiving a first selection instruction of a user, wherein the first selection instruction is used for indicating to enter an automatic shooting mode or a manual shooting mode.

In the above possible implementation manner, the electronic device determines the automatic shooting mode or the manual shooting mode by receiving a selection instruction of a user. Therefore, the electronic equipment can automatically detect or manually determine the historical action frames in the currently acquired video frame stream by the user, and the special effect video effect of displaying the motion track is fused according to the plurality of historical action frames, so that the shooting pleasure of the user is increased.

In a possible design, if the first selection instruction is used to instruct to enter the automatic shooting mode, acquiring a historical motion frame specifically includes: carrying out motion detection on the real-time video stream to determine a target subject; detecting a position of a target subject in a scene in each video frame included in the real-time video stream; and determining the video frames of which the position change of the scene in the video frames included in the real-time video stream meets the preset threshold value as historical action frames.

In the above possible implementation manner, the electronic device may automatically detect a moving target subject from the real-time video frame stream according to an automatic shooting instruction instructed by a user, and determine a history action frame that meets a preset condition according to an image change of the moving target subject. Therefore, fusion display is automatically carried out in real time according to the determined at least one historical action frame and updated to the current frame, a special effect video is synthesized, and the shooting experience of a user is enriched.

In a possible design, if the first selection instruction is used to instruct to enter the manual shooting mode, acquiring a historical motion frame specifically includes: receiving a second selection instruction of the user for the video frames included in the real-time video stream; and determining that the main body of the corresponding position of the second selection instruction in the video frame is a target main body, and determining that the video frame is a historical action frame.

In the possible implementation manner, the electronic device can also perform fusion display of multi-frame images in real time through real-time interaction with the user according to the moving target subject in the current video frame stream determined by the user and at least one historical action frame determined by the user, update the fusion display to the current frame to synthesize the special effect video, and enrich the shooting experience of the user.

In a possible design manner, performing image segmentation on a historical motion frame to obtain an image of a target subject corresponding to the historical motion frame specifically includes: reducing the image area of the historical action frame corresponding to the target subject according to a motion detection technology to obtain a target image area in the historical action frame; and processing the image of the target image area through a deep learning algorithm to obtain a mask image of the target main body corresponding to the historical action frame.

In the possible implementation manner, the electronic device may perform image segmentation according to the historical motion frame to obtain a mask image of the target subject, so as to track and record the motion of the multi-frame target subject, and perform multi-frame image fusion on the current frame according to the mask image of at least one target subject, thereby generating a special effect video of a motion track. In addition, the image area of the image division is reduced before the image division is carried out, so that the accuracy of the image division can be improved, and the complexity of the algorithm can be simplified.

In one possible design, if there are multiple mask images with overlapping subjects in the mask image, the method further includes: and separating the mask image of the target subject from the mask image overlapped by the subjects according to the depth information of the subjects in the historical action frame.

In the possible implementation manner, when there is a problem that the captured image of the target subject and the other subject images are displayed in an overlapping manner, the mask image of the target subject may be obtained by separating the mask image of the target subject according to the depth information of the subjects in the historical motion frame and the mask image in which the multiple persons overlap. Besides the above mask image segmentation according to the depth image, the segmentation of the mask image overlapped by multiple people can be realized by adopting technologies such as binocular visual depth, monocular depth estimation, structured light depth or example segmentation. The mask image of the target subject is divided from the mask images overlapped by a plurality of persons, so that the image processing precision is improved, and the generated motion track special effect video of the target subject is more real and natural.

In a possible design manner, determining a reference position in a current frame according to a position of a target subject in a scene of a historical action frame and a scene of the current frame specifically includes: obtaining the corresponding relation between the position of at least one object in a historical action frame and the position of the object in the current frame according to an image registration technology or a synchronous positioning and mapping SLAM technology; and determining the reference position of the target main body in the current frame according to the corresponding relation and the position of the target main body in the historical action frame.

In the possible implementation manner, the position mapping of the multi-frame images is performed through an image registration technology or a synchronous positioning and mapping SLAM technology, and the corresponding reference position of the image of the target subject in each historical action frame in the current frame is determined according to the corresponding relation of the image positions of different objects in the multi-frame images, so that a special effect video with a real and natural motion track can be generated, and the shooting experience of a user is improved.

In a possible design, fusing images of N target subjects to N reference positions of a current frame, respectively, specifically includes: and respectively carrying out weighted fusion processing on the images of the N target subjects and the pixel information of the image in the current frame at the N reference positions of the current frame.

In the possible implementation manner, after the images of the multiple target subjects are fused and displayed, edge fusion processing may be performed on the images of the target subjects and the background image in the current frame, and the target frame is updated, so that the images of the multiple target subjects and the background image displayed in a fusion manner are in a natural transition.

In one possible design, after fusing the images of the N target subjects on the N reference positions of the current frame, respectively, the method further includes: and adding at least one gray level image to the image of the target subject in the current frame to obtain the target frame, wherein the gray level value of the gray level image is larger if the distance between the gray level image and the image of the target subject in the current frame is shorter.

In the possible implementation manner, a plurality of image persistence images are superposed behind the motion direction of the target main body in the current frame, the image persistence images can be displayed through gray level images, and the motion track is embodied through different gray level values, so that the motion direction and the track of the target main body can be more visually represented, the interest and the intuitiveness of the special effect video are increased, and the shooting experience of a user is further improved.

In a second aspect, there is provided an image processing apparatus comprising: the acquisition module is used for acquiring a current frame and N historical action frames, wherein the current frame and the N historical action frames both comprise a target main body, scenes of the current frame and the N historical action frames are overlapped, the positions of the target main body in the N historical action frames are different, and N is a positive integer greater than or equal to 1; the image segmentation module is used for carrying out image segmentation on the N historical action frames to obtain images of N target bodies corresponding to the N historical action frames respectively; the mapping module is used for determining N reference positions in the current frame according to the positions of the N target main bodies in the scenes of the N historical action frames and the scene of the current frame; and the image fusion module is used for respectively fusing the images of the N target subjects on the N reference positions of the current frame to obtain the target frame.

In one possible embodiment, the device further comprises: the receiving module is used for receiving a first selection instruction of a user, wherein the first selection instruction is used for indicating to enter an automatic shooting mode or a manual shooting mode.

In a possible design manner, if the first selection instruction is used to instruct to enter the automatic shooting mode, the obtaining module is specifically configured to: performing motion detection on the real-time video stream to determine a target subject; detecting a position of a target subject in a scene in each video frame included in the real-time video stream; and determining the video frames of which the position change of the scene in the video frames included in the real-time video stream meets the preset threshold value as historical action frames.

In a possible design manner, if the first selection instruction is used to instruct to enter the manual shooting mode, the receiving module is further configured to receive a second selection instruction of the user for a video frame included in the real-time video stream; the obtaining module is specifically further configured to: and determining that the main body of the corresponding position of the second selection instruction in the video frame is a target main body, and determining that the video frame is a historical action frame.

In a possible design, the image segmentation module is specifically configured to: reducing an image area corresponding to a target subject in the historical action frame according to a motion detection technology to obtain a target image area in the historical action frame; and processing the image of the target image area through a deep learning algorithm to obtain a mask image of the target main body corresponding to the historical action frame.

In a possible design, if there are multiple mask images with overlapped subjects in the mask images, the image segmentation module is further specifically configured to: and separating the mask image of the target subject from the mask image overlapped by the plurality of subjects according to the depth information of the plurality of subjects in the historical action frame.

In one possible design, the mapping module is specifically configured to: obtaining the corresponding relation between the position of at least one object in a historical action frame and the position of the object in the current frame according to an image registration technology or a synchronous positioning and mapping SLAM technology; and determining the reference position of the target main body in the current frame according to the corresponding relation and the position of the target main body in the historical action frame.

In one possible design, the image fusion module is specifically configured to: and respectively carrying out weighted fusion processing on the images of the N target subjects and the pixel information of the image in the current frame at the N reference positions of the current frame.

In a possible design, the image fusion module is further specifically configured to: and adding at least one gray level image to the image of the target subject in the current frame to obtain the target frame, wherein the gray level value of the gray level image is larger if the distance between the gray level image and the image of the target subject in the current frame is shorter.

In a third aspect, an electronic device is provided, which includes: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement any of the possible embodiments of the first aspect and the first aspect as described above.

In a fourth aspect, a computer-readable storage medium is provided, in which instructions that, when executed by a processor of an electronic device, enable the electronic device to perform any of the possible implementations of the first aspect and the first aspect as described above.

In a fifth aspect, a computer program product is provided, which, when run on a computer, causes the computer to perform any of the possible implementations of the first aspect and the first aspect as described above.

It is understood that any one of the image processing apparatus, the electronic device, the computer readable storage medium and the computer program product provided above can be implemented by the corresponding method provided above, and therefore, the beneficial effects achieved by the method can refer to the beneficial effects in the corresponding method provided above, and are not described herein again.

Drawings

Fig. 1A is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 1B is a diagram illustrating a software architecture of an electronic device according to an embodiment of the present disclosure;

fig. 1C is a schematic flowchart of an image processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic interface diagram of special-effect video shooting of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a schematic interface diagram of special-effect video shooting of another electronic device according to an embodiment of the present application;

fig. 4 is a schematic view of a user interaction of a shooting preview interface according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram of an algorithm for determining a current frame as a key action frame according to an embodiment of the present application;

fig. 7 is a schematic diagram of an image segmentation processing method according to an embodiment of the present application;

fig. 8 is a schematic diagram of a completion mask image according to an embodiment of the present application;

FIG. 9A is a schematic view of a separated overlapping portrait provided by an embodiment of the present application;

FIG. 9B is a schematic illustration of another separated overlapping human image provided by an embodiment of the present application;

FIG. 10 is a diagram illustrating a multi-frame image mapping according to an embodiment of the present disclosure;

fig. 11 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 12 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 13 is a schematic flowchart of another image processing method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified.

It is noted that the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "such as" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an image processing method and device, which can be applied to a video shooting scene and can generate a special effect video or a special effect image of a motion track of a target shooting object in real time based on a video frame stream shot in real time. The motion track special effect can be used for recording key actions of a target shooting object once occurring on a time axis or positions where the target shooting object once occurs, fusing and displaying images of the target shooting object in recorded historical key actions in a current frame, and fusing the images with a background image, the ground and the like of the current frame. A user can see the special effect video shooting effect in real time when shooting the preview picture in the video shooting process, unique user experience of staggered time and space is formed, and meanwhile, the special effect video can be generated in real time. Therefore, the problem that a motion track special-effect video cannot be generated in real time in the prior art is solved, the interestingness of video shooting is enriched, and the shooting and watching experience of a user is improved.

The image processing method provided in the embodiment of the present application may be applied to an electronic device with shooting capability and image processing capability, where the electronic device may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, a vehicle-mounted device, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) \ Virtual Reality (VR) device, and the like.

Fig. 1A shows a schematic structural diagram of an electronic device 100. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

Wherein the controller may be a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The MIPI interface may be used to connect the processor 110 with peripheral devices such as the display screen 194, the camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, and the like.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement a photographing function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, and the application processor, etc.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to be converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV and other formats. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, and the like) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a Universal Flash Storage (UFS), and the like.

In the embodiment of the present application, the internal memory 121 may store computer program codes for implementing the steps of the embodiment of the method of the present application. The processor 110 may execute the computer program code stored in the memory 121 for the steps of the method embodiments of the present application. The display screen 194 may be used to display a subject of the camera, a real-time video frame referred to in the embodiment of the present application, and the like.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of the electronic device 100.

Fig. 1B is a block diagram of a software structure of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages. As shown in fig. 1B, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc. The embodiment of the application is mainly realized by improving the camera application program of the application program layer, for example, adding plug-in to the camera to expand the function of the camera.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 1B, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like. In the embodiment of the present application, the application framework layer may improve the program of the camera of the application layer, so that when a shooting object shoots, a special effect image or a special effect video of the motion trajectory of the target object may be displayed in the display screen 194, and the special effect image or the special effect video is synthesized by the background of the electronic device through real-time calculation and processing.

Wherein, the window manager is used for managing the window program. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software, and may also be referred to as a driver layer. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The following describes exemplary workflow of the software and hardware of the electronic device 100 in connection with capturing a photo scene.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 193.

In the embodiment of the present application, in the process of capturing a video by using an electronic device, when a still image or a video is captured by using the camera 193, the captured image or video may be temporarily stored in the content provider, and when a photographing operation or a video capturing operation is performed, a shot picture or video may be displayed by the view system.

Based on the above hardware and software, embodiments of the present application will be described in detail below with reference to the accompanying drawings. As shown in fig. 1C, the method may include:

s01: the electronic device acquires a current frame and a historical action frame, both of which include a target subject.

First, it should be noted that, in the shooting scene applied in the embodiment of the present application, a user needs to open a camera application of an electronic device to perform video shooting on a target subject, where the target subject is a shooting object of the electronic device and is a target subject that has relative motion with respect to the shooting scene, and the target subject may be, for example, a person, an animal, or a moving device. The motion may specifically refer to a movement, rotation, jumping, limb stretching or a specified action of the target subject position, etc. The camera of the electronic equipment shoots along with the moving target body in real time, so that the technical method provided by the application can be used for processing images according to the real-time video stream in the shooting process, generating a special-effect video of the motion trail in real time and previewing in real time.

The electronic device may obtain a current frame and N historical motion frames according to the obtained real-time video stream, where N may be a positive integer greater than or equal to 1. The real-time video stream refers to a stream of image frames captured by a camera of the electronic device in real time, and may also be referred to as a video frame stream, and may include a plurality of historical action frames. Depending on the nature of the real-time acquisition of the real-time video stream, the frame currently displayed or currently processed by the electronic device may be referred to as the current frame.

The real-time video stream comprises a plurality of images, the action frame refers to the plurality of images, and when the target body is judged to make key actions like dancing, jumping, turning around or limb stretching, the current frame is recorded as a key action frame, which can be referred to as an action frame for short. The key action frames determined prior to the current frame may all be referred to as historical action frames.

The target subject refers to a photographic subject which has a motion state and is determined as a motion target subject, among one or more photographic subjects photographed by a camera of the electronic device. The target subject can be determined by automatic detection of the electronic device or manually by the user.

Therefore, in one embodiment, before the electronic device acquires the current frame and the at least one historical action frame, the method further comprises: receiving a first selection instruction of a user, wherein the first selection instruction may include an automatic shooting instruction or a manual shooting instruction, which is used for instructing the electronic device to enter an automatic shooting mode or a manual shooting mode, respectively.

If the first selection instruction is used for indicating the electronic equipment to enter the automatic shooting mode, the electronic equipment can automatically detect a target shooting object and automatically detect a key action frame to generate a special effect video of a motion track. If the first selection instruction is used for instructing the electronic device to enter the manual shooting mode, the electronic device determines the target shooting object by further receiving a second selection instruction of the user, namely, the user manually operates the electronic device, and determines an instruction of a specified shooting action frame of the target shooting object, namely, the electronic device can receive at least one second selection instruction input by the user. Next, a scene of application will be described in detail with reference to the drawings.

In one embodiment, the first selection instruction of the user may include an automatic shooting instruction, and the user may determine to automatically shoot the special effect video by operating the electronic device, that is, to start the automatic shooting mode.

For example, taking the electronic device as a mobile phone as an example, a user may open a camera application of the mobile phone through a touch or click operation, as shown in fig. 2, click a "special effect video shooting" icon to switch to a shooting interface of the special effect video. The electronic device may pre-configure the default state of special effect video shooting as automatic shooting, or may manually select "automatic shooting" or "manual shooting" by the user, that is, shooting of the special effect video may be started and the target shot image may be viewed in real time on the preview interface.

Furthermore, after the special effect video shooting icon is clicked, a typical motion track special effect video segment can be displayed through a thumbnail on the preview interface of the electronic equipment to be played, and a user can click to view the special effect video segment, so that the user can be familiar with a shooting operation method, a shooting effect and the like of the special effect video in advance.

In the automatic shooting mode, the electronic device can automatically detect a target subject according to a real-time shot image and according to technologies such as a moving object detection technology or a frame difference method, and determine at least one key action frame. The specific methods of determining the target subject, determining at least one historical motion frame, and determining an image of the target subject in the historical motion frame will be described in detail below and will not be described in detail herein.

In another embodiment, the first selection instruction of the user may include a manual shooting instruction, and the user may determine to manually shoot the special effect video by operating the electronic device, that is, to start a manual shooting mode, and determine, according to at least one second selection instruction input by the user, at least one target subject and at least one key action frame corresponding to the at least one second selection instruction. Specifically, the electronic device may determine the corresponding target subject according to the corresponding position of the second selection instruction in the video frame, and determine that the video frame is the key action frame.

For example, taking the electronic device as a mobile phone as an example, a user may open a camera application of the mobile phone through a touch or click operation, as shown in fig. 3, click a "special effect video shooting" icon, switch to a shooting interface of a special effect video, and click and select a "manual shooting" option, that is, shooting of the special effect video may be started and a target shooting image may be viewed in real time on a preview interface.

Further, in order to prompt the user to operate the electronic device to determine the target subject and the key action frame, after receiving an operation of "manually shooting" by the user, the electronic device may display a prompt message "please click on the selection subject portrait" on the interface to instruct the user to input the second selection instruction. When the user clicks or touches the display area of the electronic device to select a target body, the electronic device may continuously display a prompt message on the interface, such as "please click a favorite action", prompting the user to continue inputting at least one second selection instruction through a touch operation or a click operation, and further determining a plurality of key action frames.

In the manual shooting mode, a user can determine a target subject according to prompt information or actively clicking a certain portrait or object in a preview picture in the process of previewing a video frame stream. The user may also click on the preview screen to determine a number of key action frames during the subsequent continuous stream of video frames.

In addition, after the user manually determines the target subject, in the subsequent shooting process, when more than one subject appears in the shooting interface, the user can also freely switch to other target subjects. At this time, the electronic device may display a prompt message such as "click-selectable switch body" on the interface. Illustratively, as shown in fig. 4, the user initially determines a person a as a target subject, and then clicks a person B in the shooting preview interface to select the person a as the target subject, so as to subsequently generate a special effect video of the target subject B.

The image of the target subject in the historical action frame (key action frame) refers to an image of a partial region of the image where the target subject is displayed, and specifically refers to an image of a region corresponding to the display target subject, which is obtained by segmentation or matting after certain image segmentation or matting processing is performed on the historical action frame. For example, as shown in fig. 2, an image of a target subject moving in the current frame is detected and determined to be a portrait, in addition to a background image and a still image in the captured picture. Specifically, the image of the target subject in the key action frame may be distinguished through an image segmentation technique.

It should be noted that, scenes of the current frame and the multiple historical action frames acquired by the electronic device overlap, and the positions of the scenes of the target subject in the multiple historical action frames are different. That is, there is a portion overlapping with the shooting scene in the current frame in any one of the historical motion frames, where the shooting scene may refer to a shooting object, such as a tree, lawn, or building, etc., that is present around the target subject in the video frame.

Overlapping means that the same part as the scene in the current frame exists in any one of the historical action frames, and exemplarily, as shown in fig. 4, the same tree in the historical action frame is also displayed at the same or different position in the current frame shooting scene, the building in the historical action frame is also displayed at the same or different position in the current frame shooting scene, the position of the target subject a in the historical action frame is in the front left of the tree, and the position of the target subject a in the current frame is moved to the front right of the building. Therefore, the embodiment of the application can be implemented on the premise that a part overlapping with a scene in a current frame exists in any determined historical action frame, and if the scene of one historical action frame does not have any overlapped scene or object with the current frame, the electronic device cannot obtain an image mapping relationship according to the historical action frame and the current frame, so that multi-frame fusion display cannot be performed.

In summary, after the electronic device receives a shooting start instruction from a user, the electronic device acquires a real-time video stream through a lens, where each frame of video frame included in the real-time video stream may be regarded as a current frame at a corresponding time. Whether the electronic device automatically acquires the key action frame or determines the key action frame according to a method of user instruction acquisition in a manual mode, the key action frame may be referred to as a historical action frame with respect to a current frame corresponding to a time after the determination of the key action frame. Taking a time axis t of real-time shooting as an example shown in fig. 5, the electronic device starts video shooting at a time t0, the electronic device determines a real-time video frame corresponding to the time t1 as a key action frame (a first action frame 01), and then the electronic device determines a real-time video frame corresponding to the time t2 as a key action frame (a second action frame 02), so that for a current frame corresponding to the current time t3, the acquired N historical action frames are the first action frame 01 and the second action frame 02.

S02: and the electronic equipment performs image segmentation on the historical action frame to obtain an image of the target main body corresponding to the historical action frame.

In the shooting process, when the electronic device acquires each historical motion frame, in order to obtain an image of a target subject in each historical motion frame according to the historical motion frame, the electronic device may perform image segmentation on the historical motion frames one by one, and determine a target subject image in the historical motion frame, specifically, a mask image. Thus, the electronic device can record the N historical motion frames included in the real-time video stream and the images of the N target subjects corresponding to the N historical motion frames one by one.

The image segmentation is a technique and a process for dividing an original image into a plurality of specific or unique regions and extracting an interested target object. Image segmentation is a key step from image processing to image recognition and analysis. Specifically, the image segmentation process based on the portrait in the original image may also be referred to as a portrait segmentation technique, and the portrait portion in the original image may be extracted.

The mask image is used to mark a specific target area in the image by using different mask (mask) values, for example, to mark an image area of a target subject by using a mask value different from that of the background image, so as to separate the image area of the target subject from other background image areas. For example, in a common mask image, the mask value of the pixel point in the target subject image region may be set to 255, and the mask values of the pixel points in the other regions may be set to 0. So that the image of the target subject in the historical action frame can be separated according to the mask image.

For example, the target image region of each historical motion frame may be processed by a deep learning algorithm to obtain a mask image of a target subject corresponding to each historical motion frame, for example, by a neural network algorithm or a support vector machine algorithm, and the application does not specifically limit the algorithm for implementing image segmentation.

S03: the electronic equipment determines a reference position in the current frame according to the position of the target main body in the scene of the historical action frame and the scene of the current frame.

The electronic device can map the reference positions of the N target bodies in the current frame respectively according to the positions of the N target bodies in the scenes of the N historical action frames and in combination with the scene of the current frame.

Specifically, the electronic device may obtain an image mapping relationship between each historical action frame and the current frame according to the position of the background image in each historical action frame and the position of the background image in the current frame, so that the relative position of the image of the target subject in the target frame may be obtained according to the image position of the target subject in the historical action frame and the mapping relationship, and perform fusion processing on the image of the target subject in the current frame according to the determined relative position. Wherein the relative position is used to indicate the position of the image of the target subject in the target frame at the image of the target subject in the historical action frame.

S04: and the electronic equipment respectively fuses the images of the target main body on the reference positions of the current frame to obtain the target frame.

After the electronic device determines at least one historical motion frame, the images of the plurality of target subjects obtained in S02 may be drawn into the current frame by an image fusion technique, and fused to generate the target frame.

Illustratively, as shown in fig. 5, a first action frame 01 and a second action frame 02 in the real-time video frame stream are determined, and each frame image displayed in real time after the first action frame 01 is fused with the image of the first target subject in the first action frame 01 for display. Taking the second motion frame 02 as an example, it is displayed as shown in fig. 5 through fusion, that is, including the image (1) of the first target subject in the first motion frame 01 and the entire image in the second motion frame 02. And the current frame after the N-th action frame 0N is determined to be displayed as shown in fig. 5 through fusion, that is, the current frame includes an image (1) of the first target subject in the first action frame 01, an image (2) ...and the entire image in the N-th action frame 0N, that is, an image (N) of the N-th target subject in the N-th action frame 0N. When N is 5, the image (1) of the first target subject corresponding to the first action frame 01, the image (2) ..., and the image (5) of the 5 th target subject corresponding to the 5 th action frame 05 are displayed in a fusion manner at corresponding reference positions in the current frame. The specific multi-frame image fusion process, i.e., the algorithm, will be described in detail below, and will not be described herein again.

Further, after the shooting of the special effect video is finished, the electronic device may store the generated special effect video in the gallery. In order to distinguish from a common video, a specific mark can be displayed at one corner of a thumbnail of the special effect video, for example, four characters of a 'motion track' are superimposed on a play button of the special effect video, so that a special effect video file of the motion track is distinguished from a common video file, and the user can conveniently view the special effect video file.

According to the embodiment of the application, at least one key action frame is automatically detected or manually determined in a real-time video frame stream, and an image of at least one target subject in the at least one key action frame is displayed in a current frame simultaneously through a multi-frame fusion display method, so that a special effect image or video of a motion track of the target subject can be generated in real time. Meanwhile, the currently generated target image can be transmitted to a shooting preview picture of the mobile phone and a video generation stream in real time, so that a user can preview the effect of the motion track in real time on line, the complete motion track special-effect video can be checked after shooting is finished, and the shooting experience of the user is enriched.

In one embodiment, in the step S01, if the first selection instruction of the user includes an automatic shooting instruction, that is, the electronic device is instructed to enter an automatic shooting mode, the electronic device can automatically detect a moving target subject according to an algorithm and automatically detect at least one historical motion frame (key motion frame).

First, the electronic device may determine a target subject for a video frame in a real-time video stream according to a motion detection technique. The motion detection of the target subject can be determined through portrait recognition or other target recognition technologies, and can automatically detect moving objects in real-time video frames, such as people, animals, moving devices, vehicles or football. Since the main application scene of the present application is special effect shooting of a motion trajectory of a person, the embodiment is described by taking portrait recognition and detection as an example.

Specifically, the electronic device determines a target subject in the real-time video frame, and may obtain a mask image of the target subject by performing image segmentation on the image, such as portrait segmentation or example segmentation. If the obtained mask image only has one portrait mask, determining the portrait mask as a target main body; if the mask images are obtained by segmentation, the electronic equipment can determine that the mask area is the largest as a target main body; if the portrait mask is not obtained, the electronic device can prompt the user that the portrait is not detected by displaying prompt information on the preview interface, and ask the user to move the camera to be close to the shot person.

Then, the electronic device may detect a position of the target subject in each video frame included in the real-time video stream, and obtain a scene position change of the target subject between the multiple frames. The scene position change of the target subject may be a position change of the target subject with respect to the shooting scene, or a change in the body posture, body angle, or body position of the target subject.

After the electronic device determines the target subject, it determines which frames are key action frames one by one during the continuous shooting. The electronic device can determine the key action frame in the real-time video frame by a frame difference method, wherein the frame difference method is to obtain information such as scene position change between adjacent video frames by comparing pixel point positions in the adjacent video frames. That is, the electronic device may determine the video frame as the key action frame by detecting a video frame in which a change in position of a scene in the video frame included in the real-time video stream of the target subject satisfies a preset threshold.

The electronic device may determine the first frame image successfully segmented into the target subject as the first key action frame, because the first key action frame is not preceded by the reference frame. Or, to ensure the time delay of the image processing algorithm, the electronic device may determine a third frame or a fourth frame after the first frame of image of the target subject is successfully segmented as the first key action frame.

The second and subsequent key action frames may be determined by comparison with the previous key action frame. Specifically, the electronic device may determine that an image of a target subject in the real-time video frame simultaneously satisfies the following two conditions as a key action frame:

the first condition is as follows: the image position area of the target subject in the current frame does not coincide with the position area mapped to the current frame by the image of the target subject in the previous key action frame.

And a second condition: the image change of the target subject in the current frame and the image change of the target subject in the previous key action frame meet a preset threshold value.

That is, the electronic device may automatically determine, as the historical action frame, a video frame in which an image of a target subject in a current frame in the real-time video frame does not coincide with an image of a target subject in a previous key action frame, and an image change of the target subject in the current frame satisfies a preset threshold through motion detection.

When the detection determines that the image change of the target subject in the current video frame meets a preset threshold, the target subject is determined to be a key action frame (historical action frame). For example, when the detection determines that the image change of the target subject in the current video frame is greater than or equal to a preset threshold, determining that the current video frame is a key action frame; and when the detection determines that the image change of the target subject in the current video frame is smaller than a preset threshold value, determining that the current video frame is not the key action frame.

For example, whether the change between the target subject image in the current frame and the target subject image in the previous key action frame meets a preset threshold may be determined through a centroid overlapping algorithm. The specific algorithm is as follows:

the electronic equipment calculates the gravity center coordinate of the target main body mask image of the previous key action frame and the gravity center coordinate of the target main body mask image of the current frame, and calculates the non-overlapping area of the target main body mask image of the current frame and the target main body mask image of the previous key action frame after the gravity centers of the two images are overlapped. And when the area of the non-overlapping area exceeds a preset threshold value, determining the current frame as a key action frame, otherwise, determining that the current frame is not the key action frame. The preset threshold may be configured as a certain proportion, for example, 30% of the area of the merged two target subject mask images.

It should be noted that the setting of the preset threshold may be preset by a person skilled in the art according to the image detection accuracy and in combination with the requirement and technical experience of the special effect video, and this is not specifically limited in this application.

The formula for calculating the barycentric coordinates is as follows (the barycentric coordinates can be rounded):

the specific calculation method of the gravity center coincidence can be as follows: if the barycentric coordinates of the target main body of the current frame are equal to the barycentric coordinates of the target main body of the previous key action frame after adding the coordinate offset (delta x, delta y), the coordinates of all pixel points in the target main body region of the current frame are added with the coordinate offset (delta x, delta y) to obtain a new coordinate set of the target main body region of the current frame, and then the number of the pixel points of which the coordinates of the target main body region coordinate set in the previous key action frame are not equal to the coordinates of the target main body region coordinate set in the new current frame is judged. See the following formula for a specific calculation.

The new set of coordinates for the current frame target subject region is:

new coordinates (x ', y') = original coordinates (x, y) + (Δ x, Δ y),

wherein, (Δ x, Δ y) = barycentric coordinate (x) ₀ ，y ₀ ) _{Previous key action frame} -barycentric coordinates (x) ₀ ，y ₀ ) _{Current frame} 。

After the centers of gravity are overlapped, calculating the proportion of the non-overlapping area of the current frame target main body mask image and the previous key action frame target main body mask image, namely calculating the area of the non-overlapping area of the current frame target main body mask image and the previous key action frame target main body mask image, and taking the area of the union set of the two target main body mask images relatively. The non-overlapping area proportion calculation formula is as follows:

wherein the target body region _{Previous key action frame} Reverse target subject area _{Current frame} The intersection of the region representing the target subject in the previous key action frame and the region of the target subject in the current frame, the target subject region _{Previous key action frame} U-shaped target body region _{Current frame} Represents the union of the region of the target subject in the previous key action frame and the region of the target subject in the current frame.

As shown in fig. 6, if the target main body area in the current key action frame overlaps with the target main body area in the current frame 1, the condition one described above is not satisfied, and the current frame 1 is not a key action frame. And after the gravity center of the target subject in the current key action frame is superposed with the gravity center of the target subject in the current frame 2, if the proportion of the non-overlapping area does not meet the preset threshold, the condition two is not met, and the current frame 2 is not the key action frame. And after the target main body area in the current key action frame is not overlapped with the target main body area in the current frame 3, and the gravity center of the target main body in the previous key action frame is overlapped with the gravity center of the target main body in the current frame 3, if the proportion of the non-overlapped area exceeds a preset threshold value, the current frame 3 simultaneously meets the first condition and the second condition, and the current frame 3 is determined to be the key action frame.

In the embodiment, through the algorithm, the electronic device can automatically detect the target moving object in the video in real time and automatically detect and determine the key action frame, so that a special effect video of a motion track can be generated in real time according to the target main body in the recorded key action frame, the interestingness and flexibility of video shooting are increased, and the shooting experience of a user is improved.

In one embodiment, before performing image segmentation on the historical motion frame, a moving target subject may be identified by a motion detection technique, and then an image region of the corresponding target subject in the historical motion frame is reduced, that is, only a partial image region of the moving subject of interest in the historical motion frame is captured to perform the image segmentation algorithm. This reduces the image area to be subjected to the image segmentation processing, thereby improving the accuracy of image segmentation and simplifying the data processing complexity of the image segmentation algorithm.

The motion detection technique may be implemented by a frame difference method, a background difference method, an optical flow method, or the like. For example, the frame difference method is to perform a difference on every two adjacent three frames of images, and then obtain a difference image of the adjacent frames through two difference images, so as to approximately detect a moving object in the image.

Illustratively, as shown in fig. 7, an image region of interest, such as the portrait region in fig. 7, may be first reduced by motion detection. And then, the portrait is divided according to the roughly obtained portrait area to obtain a mask image of the target subject.

Through the implementation mode, the mask image of the target main body in the historical action frame can be obtained through separation according to the historical action frame, the mask image of the target main body can be accurately separated, and the motion tracking and recording of the target main body are realized, so that multi-frame image fusion is performed on the current frame according to the mask image of at least one target main body, a special effect video of a motion track is generated, and the shooting experience of a user is improved.

In the above embodiment, in the process of image segmentation of the key motion frame, the mask image of the segmented target subject may be incomplete or missing, as shown in fig. 7. In order to obtain a complete mask image of the target subject, the mask image of the target subject may be supplemented in combination with motion detection.

The specific processing procedure for completing the target subject mask image may be: after a moving target subject in the key action frame is detected, separating an image area of the target subject in the key action frame image by selecting a proper threshold value; and repairing the mask image of the segmented target main body by using the image area of the target main body, thereby obtaining the complete mask image of the target main body. Illustratively, as shown in fig. 8, a mask image a of a target portrait is obtained according to portrait segmentation, and the mask image a is complemented according to the target portrait in an adjacent frame to obtain a mask image B.

In one embodiment, the object captured by the real-time video frame may include more than one moving subject, and a plurality of target captured objects may overlap with the image of the target subject, for example, the target subject is a portrait 1, and in the key action frame, there is a case where the portrait 1 and the portrait 2 are partially overlapped or mutually occluded. Therefore, the electronic device needs to separate the mask image of the target subject from the mask images overlapped by the plurality of subjects, and continuously and automatically perform the trace recording on the same target subject. Specifically, the overlapped target photographic subjects can be divided in the following manner.

In the first method, a mask image in which a plurality of subjects overlap is divided from a depth map.

The mask image of the target subject can be obtained by the electronic device according to the mask image overlapped by the plurality of subjects in the historical motion frame and the depth information corresponding to the plurality of subjects by combining the depth map corresponding to the two-dimensional image. That is, the electronic device may separate a mask image of the target subject from the mask image in which the plurality of subjects overlap, based on the depth information of the plurality of subjects and the depth information of the target subject in the historical motion frame.

The depth map is an image or an image channel containing information on the distance between the shooting point and the surface of the target shooting object. The depth map is similar to a grayscale image except that each pixel value of the depth map reflects the actual distance of the shot point from the target photographic object. Usually, the RGB image and the depth map are registered, so there is a one-to-one correspondence between the pixels of the RGB image and the pixels of the depth map.

The depth map may be obtained specifically according to a ranging camera based on Time of Flight (ToF), or the depth map of the original two-dimensional image may be obtained by calculating through an artificial neural network algorithm, to obtain a depth value corresponding to each pixel point, and restoring to obtain the depth map of the original two-dimensional image.

By processing the depth map, a plurality of different target photographic subjects can be distinguished. For example, as shown in fig. 9A, the electronic device needs to distinguish multiple overlapped human images into human images of a target subject, and may make one-to-one correspondence between pixel points of an obtained depth map and pixel points of a current key action frame, and count an average value or a median value of depth values of pixel points in a mask region of the corresponding target subject human image in the depth map. The electronic equipment processes the depth map according to the average value or the median of the depth values of the target subject portrait, extracts the depth value range covered by the subject portrait in the depth map, and then intersects the depth value range with the corresponding portrait mask, thereby separating the portrait mask of the target subject from the multiple overlapped portrait masks. And ensuring that the separated portrait mask of the target subject is always a single portrait.

And secondly, example segmentation of the overlapped target shooting objects.

The instance refers to an object, and the object represents a specific instance in a type of shooting objects.

Example segmentation means that different examples need to be distinguished on the basis of specific categories on the basis of classifying each pixel in an image into a corresponding category, i.e., realizing classification of pixel levels. For example, a person and a background object are divided according to each pixel in an image. Distinguishing different persons from multiple persons, e.g. a, b and c, is to perform instance segmentation.

Specifically, the electronic device may perform instance segmentation through a deep learning algorithm. Referring to fig. 9B, in the example segmentation mask, mask values of different faces are different, and the face mask region of the target subject can be directly separated.

It should be noted that, besides the above-mentioned technology for separating the multi-person overlapping masks, the existing methods of binocular visual depth, monocular depth estimation, structured light depth, etc. may also be used for separating the multi-person overlapping masks, and this application is not described herein again.

Through the embodiment, the electronic equipment can separate the target main body mask from a plurality of overlapped target shooting objects, so that the target main bodies of different frames are accurately tracked and recorded, and the special-effect video of the motion track of the specific target main body is generated.

In an embodiment, in step S03, the electronic device determines, according to the position of the target subject in the scene of each historical motion frame and the scene of the current frame, the reference position in the current frame, which may specifically include:

the electronic equipment can obtain the corresponding relation between the position of at least one object in each historical action frame and the position of the object in the current frame according to an image registration technology or a synchronous positioning and mapping technology; and then obtaining an image position area corresponding to each target main body in the current frame, namely a reference position according to the obtained corresponding relation and the corresponding relation between the image position of each target main body in each historical action frame and the determined corresponding relation. Therefore, the electronic equipment can draw the image of each target main body corresponding to each historical action frame to each corresponding reference position in the current frame, and the target frame can be obtained.

For example, the historical action frames including the first action frame 01 and the second action frame 02 will be described below with reference to fig. 5.

As shown in fig. 5, if the recorded historical action frames include the first action frame 01, and the target subject corresponding to the first action frame 01 is the first target subject. Then, each subsequent frame of image draws the image of the first target subject into the current frame 03 according to the mapping relationship between the position of the at least one object in the first action frame 01 and the position of the at least one object in the current frame.

As shown in fig. 5, if the recorded historical motion frame further includes a second motion frame 02, and the target subject corresponding to the second motion frame 02 is a second target subject, when it is determined that each image of the subsequent frame after the second motion frame 02 is the second target subject, the image of the first target subject and the image of the second target subject are drawn into the current frame 03 according to the mapping relationship between the position of the at least one object in the first motion frame 01 and the position of the at least one object in the current frame 03, and the mapping relationship between the position of the at least one object in the second motion frame 02 and the position of the at least one object in the current frame 03.

The drawing refers to a process of generating a two-dimensional image by a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) of the electronic device according to a drawing instruction, pixel point information, and the like. After the electronic equipment finishes drawing the image, the target image can be displayed on a display screen of the electronic equipment through the display device.

According to the described embodiment, the electronic device performs the fusion rendering processing on the determined key action frames one by one, and displays the determined key action frames in real time, that is, the generated motion track special effect video can be previewed online, and a final motion track special effect video is generated.

In the above embodiment, all the historical motion frames recorded during the real-time video frame stream need to be mapped to the corresponding positions of the current frame, and a specific Mapping method that can be adopted is an image registration technique or a Simultaneous Localization And Mapping (SLAM) technique. Therefore, the electronic device may draw the image of the target subject in each historical action frame into the current frame according to the image mapping relationship between at least one historical action frame and the current frame, and specifically, may generate the target image through the following processing.

Step1: and obtaining the corresponding relation between the image position of at least one object in each historical action frame and the image position of at least one object in the current frame according to an image registration technology or an SLAM technology.

The image registration is a process of matching, mapping or superimposing a plurality of images acquired at different times and under different imaging devices or under different conditions (such as weather, brightness, camera positions or angles and the like), and can be widely applied to the fields of data analysis, computer vision, image processing and the like.

As shown in fig. 10, the electronic device may obtain a corresponding relationship, which may also be referred to as a mapping relationship, between the position of the object in the first action frame and the position of the object in the current frame according to the position of at least one object in the first action frame and the position of the same object in the current frame. The electronic device may obtain the reference position of the target subject in the current frame again according to the position of the target subject in the first action frame and by combining the corresponding relationship of the positions, and the position indicated by the dotted line in fig. 10 may be the reference position.

When the image registration technology is adopted, features in the historical action frame need to be extracted, for example, the features can be Semantic Kernel Binary (SKB) features. And finally mapping the historical key action frame to a corresponding position in the current frame according to the obtained homography matrix. The SKB feature is a descriptor of the image feature. Image registration techniques may enable mapping matching between two-dimensional images.

The SLAM technique is a technique that allows a device to trace out three-dimensional positional information of the surrounding environment while moving. Specifically, the device starts from an unknown place of an unknown environment, positions its own position and posture through repeatedly observed map features (such as wall corners, columns and the like) in the moving process, and builds a map incrementally according to its own position, so that the purposes of synchronous positioning and map building are achieved.

When the SLAM technology is adopted, the three-dimensional position information of the object in the historical action frame needs to be obtained through calculation by a SLAM module in the electronic equipment, and the historical action frame is mapped to the corresponding position in the current frame according to the three-dimensional position information of the object.

Since the SLAM technique performs position mapping based on three-dimensional position information, the three-dimensional position information is applicable to inter-frame three-dimensional motion. Therefore, when the motion trajectory of the target subject photographed by the electronic device involves a three-dimensional motion, the mapping may be performed using the SLAM technique.

Step2: and obtaining the reference position of each target subject in the current frame according to the image position and the corresponding relation of each target subject in each historical action frame.

I.e. maps the image of each target subject in each historical action frame to the corresponding image location area in the current frame.

Step3: and drawing an image of each target subject in each historical action frame to a corresponding reference position of each target subject in the current frame.

And drawing the image of each target subject to the corresponding reference position in the current frame according to the reference position of the image of each target subject in the current frame obtained by mapping, thereby obtaining a fused image of the multi-frame images, and updating and displaying the fused image as the current frame.

Illustratively, as shown in fig. 5, a first target subject in a first action frame 01 is mapped to a corresponding reference position in a second action frame 02 and is drawn into the second action frame 02; and mapping the first target body in the first action frame 01 to a corresponding reference position in the current frame and drawing the first target body into the current frame, mapping the second target body in the second action frame 02 to a corresponding reference position in the current frame and drawing the second target body into the current frame, and updating the current frame.

In the embodiment, the image registration technology or the SLAM technology is used for mapping the multiple frames of images, so that the fusion display of the target main body image in the multiple frames of images is completed, the motion track of the target main body can be accurately and naturally displayed at the corresponding position in the same frame of image, the time-staggered and space-staggered motion track special-effect video is formed, and the shooting experience of a user is enriched.

In one embodiment, after mapping all the historical motion frames to the corresponding positions of the current frame by using an image registration technique or a SLAM technique, and mapping the mask image of the target subject in each historical motion frame to the corresponding positions of the current frame by combining the mask image of the target subject in each historical motion frame, in order to make the display transition of the added image of the target subject and the background image of the current frame more natural, the method may further include: and performing edge fusion processing on the image of the target subject of each historical action frame in the target image, and updating the target image to enable the image of the target subject and the background image to be in natural transition.

The fusion processing of the multi-frame images is to fuse and display an image (an image of a target subject in a historical action frame) which does not belong to the current frame into the current frame; therefore, it is necessary to further perform weighted fusion processing on the images of the N target subjects and the pixel information of the image in the current frame at the N reference positions of the current frame, so that the image of the target subject added by fusion and the image before the current frame are displayed naturally, and the boundary transition is more real.

Illustratively, the weighted fusion technique employed may be alpha fusion. The specific processing procedure may be to adjust the mask value from the original vertical transition of 255-0 to the gentle transition of 255-0 according to the edge mask value 255 of the target subject image and the edge mask value 0 of the background image, for example, the mask value of the transition may be adjusted by a linear or non-linear function. And then the adjusted mask value of the smooth transition is used as a weight to perform weighted superposition on the image of the target subject and the background image. Optionally, the boundary line may also be weakened by processing the edge region by using a gaussian filtering method. The gaussian filtering is a nonlinear smooth filtering method that selects a weight value according to the shape of a gaussian function.

In addition to the alpha fusion technique, image fusion techniques such as Poisson fusion (Poisson Blending) technique and Laplacian fusion (Laplacian Blending) technique may also be used in the above embodiments, and the present application does not limit the specific image fusion techniques.

In an embodiment, after the images of the multiple frames of key action frames are fused and displayed to obtain the target image, in order to more intuitively display the motion trajectory of the target subject in the current frame, the method may further include: and superposing at least one image of the target main body in the current frame. The shading image is generated from images of the target subject several consecutive frames before the current frame.

Specifically, at least one of the shading images can be represented by a gray scale image, wherein the gray scale value of each shading image may be the same or different.

For example, as shown in fig. 11, at least one of the shading images may be superimposed behind the second target subject image in the second action frame 02, and a plurality of the shading images may be superimposed behind the moving direction of the target subject in the current frame 03. The farther the image of the target subject in the current frame 03 is from the image of the target subject, the weaker the intensity of the image of the left shadow may be; the closer the shading image is to the image of the target subject in the current frame 03, the stronger the intensity of the shading image may be. The image to be kept in shadow may gradually decrease in intensity up to 0 as the distance from the target subject image in the current frame 03 gradually increases.

The number of the image to be shaded is not limited, and the number can be set by a person skilled in the art according to design requirements.

When the shadow image is represented by a gray image, wherein the closer the distance between at least one gray image and the image of the target subject in the current frame is, the larger the gray value of the gray image is; the farther the distance between at least one grayscale image and the image of the target subject in the current frame, the smaller the grayscale value of the grayscale image.

According to the embodiment, the plurality of image-keeping images are superposed behind the motion direction of the target main body in the current frame, so that the motion direction and the track of the target main body can be more visually shown, the interestingness and the intuitiveness of the special-effect video are increased, and the shooting experience of a user is further improved.

According to any of the embodiments described above, after the image of the target subject in all the recorded historical motion frames is mapped into the image of the current frame in real time, the video frame stream is continuously updated, and the image output by the current frame is displayed to the video capture preview screen of the electronic device. As shown in fig. 12, after the user starts shooting the special effect video, the user can see the shooting effect of the special effect video in real time in the video shooting preview screen of the electronic device. In addition, the video frames generated in real time can be output to the final video generation stream, and the generated complete motion track special effect video can be watched after the user finishes shooting.

With reference to any one of the foregoing possible implementation manners, as shown in fig. 13, a detailed implementation flow for generating a motion trajectory special effect video is provided in this application embodiment. The process mainly comprises the following steps: 1. shooting preview interface interaction, and determining a target main body and a key action frame; 2. obtaining an image of a target main body by image segmentation; 3. mapping the key action frame to the current frame, and drawing an image of a target subject in the key action frame to the current frame; 4. a stream of video frames is generated for online preview and real-time.

In the processing flow shown in fig. 13, not all of the processing flows, nor all of the processing flows are optional processing flows, and a person skilled in the art can adjust and set the detailed processing procedures and sequences according to design requirements. Meanwhile, the technical scheme of the application is not only suitable for generating the special effect video of the motion trail, but also can be used for rapidly developing other similar special effect videos, such as multi-person image special effect synthesis or growth special effect and the like, and the application is not particularly limited to the above.

An embodiment of the present application further provides an image processing apparatus, as shown in fig. 14, the apparatus 1400 may include: an acquisition module 1401, an image segmentation module 1402, a mapping module 1403, and an image fusion module 1404.

An obtaining module 1401, configured to obtain a current frame and N historical action frames, where the current frame and the N historical action frames both include a target main body, scenes of the current frame and the N historical action frames overlap, the target main body is located at different positions in the N historical action frames, and N is a positive integer greater than or equal to 1.

An image segmentation module 1402, configured to perform image segmentation on the N historical motion frames to obtain N images of the target subject corresponding to the N historical motion frames, respectively.

A mapping module 1403, configured to determine N reference positions in the current frame according to the positions of the N target subjects in the scenes of the N historical action frames and the scene of the current frame respectively

An image fusion module 1404, configured to fuse the images of the N target subjects on the N reference positions of the current frame, respectively, to obtain a target frame.

In one possible embodiment, the device may further include: the receiving module is used for receiving a first selection instruction of a user, and the first selection instruction is used for indicating to enter an automatic shooting mode or a manual shooting mode.

In a possible design manner, if the first selection instruction is used to instruct to enter the automatic shooting mode, the obtaining module 1401 is specifically configured to: performing motion detection on the real-time video stream to determine a target subject; detecting a position of a target subject in a scene in each video frame included in the real-time video stream; and determining the video frames of which the position change of the scene in the video frames included in the real-time video stream meets the preset threshold value as historical action frames.

In a possible design manner, if the first selection instruction is used to instruct to enter the manual shooting mode, the receiving module is further configured to receive a second selection instruction of the user for a video frame included in the real-time video stream; the obtaining module 1401 is further specifically configured to: and determining that the main body of the corresponding position of the second selection instruction in the video frame is a target main body, and determining that the video frame is a historical action frame.

In one possible design, the image segmentation module 1402 is specifically configured to: reducing an image area corresponding to a target subject in the historical action frame according to a motion detection technology to obtain a target image area in the historical action frame; and processing the image of the target image area through a deep learning algorithm to obtain a mask image of the target main body corresponding to the historical action frame.

In a possible design, if there are multiple mask images with overlapped subjects in the mask images, the image segmentation module 1402 is further specifically configured to: and separating the mask image of the target subject from the mask image overlapped by the subjects according to the depth information of the subjects in the historical action frame.

In one possible design, the mapping module 1403 is specifically configured to: obtaining the corresponding relation between the position of at least one object in a historical action frame and the position of the object in the current frame according to an image registration technology or a synchronous positioning and mapping SLAM technology; and determining the reference position of the target main body in the current frame according to the corresponding relation and the position of the target main body in the historical action frame.

In one possible design, the image fusion module 1404 is specifically configured to: and respectively carrying out weighted fusion processing on the images of the N target subjects and the pixel information of the image in the current frame at the N reference positions of the current frame.

In one possible design, the image fusion module 1404 is specifically further configured to: and adding at least one gray level image to the image of the target subject in the current frame to obtain the target frame, wherein the gray level value of the gray level image is larger if the distance between the gray level image and the image of the target subject in the current frame is shorter.

In addition, for the specific implementation process and embodiment of the apparatus 1400, reference may be made to the steps executed by the electronic device in the foregoing method embodiment and the related description, and for the technical problem to be solved and the technical effect to be brought about, reference may also be made to the contents described in the foregoing embodiment, and details are not repeated here.

In the present embodiment, the test apparatus is presented in a form in which the respective functional modules are divided in an integrated manner. A "module" herein may refer to a specific circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the functionality described above. In a simple embodiment, one skilled in the art will appreciate that the device may take the form shown in FIG. 15 below.

Fig. 15 is a schematic structural diagram of an electronic device 1500 according to an exemplary embodiment, where the electronic device 1500 may be used to generate a motion trajectory special effect video of a photographic subject according to the foregoing embodiments. As shown in fig. 15, the electronic device 1500 may include at least one processor 1501, communication lines 1502, and memory 1503.

The processor 1501 may be a Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs in accordance with the present disclosure.

Communication link 1502 may include a path, such as a bus, for communicating information between the aforementioned components.

The memory 1503 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be separate and coupled to the processor via a communication link 1502. The memory may also be integral to the processor. The memory provided by the embodiments of the present disclosure may generally have a non-volatile property. The memory 1503 is used for storing computer-executable instructions related to the implementation of the present disclosure, and is controlled by the processor 1501 to execute the instructions. The processor 1501 is configured to execute computer-executable instructions stored in the memory 1503, thereby implementing the methods provided by the embodiments of the present disclosure.

Optionally, the computer-executable instructions in the embodiments of the present disclosure may also be referred to as application program codes, which are not specifically limited in the embodiments of the present disclosure.

In particular implementations, processor 1501 may include one or more CPUs, such as CPU0 and CPU1 in fig. 15, as one embodiment.

In particular implementations, electronic device 1500 may include multiple processors, such as processor 1501 and processor 1507 in fig. 15, for example, as an example. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores that process data (e.g., computer program instructions).

In particular implementations, electronic device 1500 may also include communications interface 1504, as one embodiment. The communication interface 1504 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet interface, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

In particular implementations, electronic device 1500 may also include an output device 1505 and an input device 15015, as one embodiment. Output device 1505 is in communication with processor 1501 and may display information in a variety of ways. For example, the output device 1505 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. An input device 1506 communicates with the processor 1501 and may receive user input in a variety of ways. For example, the input device 1506 may be a mouse, a keyboard, a touch screen device or a sensing device, etc.

In a specific implementation, the electronic device 1500 may be a desktop, a laptop, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet, a wireless terminal device, an embedded device, or a device with a similar structure as in fig. 15. The disclosed embodiments do not limit the type of electronic device 1500.

In some embodiments, the processor 1501 in fig. 15 may cause the electronic device 1500 to perform the methods in the above-described method embodiments by calling the computer-executable instructions stored in the memory 1503.

Illustratively, the functions/implementation processes of the acquisition module 1401, the image segmentation module 1402, the mapping module 1403, and the image fusion module 1404 in fig. 14 may be implemented by the processor 1501 in fig. 15 calling computer-executable instructions stored in the memory 1503.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as the memory 1503 comprising instructions, which are executable by the processor 1501 of the electronic device 1500 to perform the above-described method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, characterized in that the method comprises:

receiving a first selection instruction of a user, wherein the first selection instruction is used for indicating to enter an automatic shooting mode or a manual shooting mode;

acquiring a current frame and N historical action frames according to the acquired real-time video stream, wherein the current frame and the N historical action frames both comprise target bodies, scenes of the current frame and the N historical action frames are overlapped, the positions of the target bodies in the scenes of the N historical action frames are different, and N is a positive integer greater than or equal to 1;

performing image segmentation on the N historical action frames to obtain images of N target main bodies corresponding to the N historical action frames respectively;

determining N reference positions in the current frame according to the positions of the N target main bodies in the scenes of the N historical action frames and the scene of the current frame;

respectively fusing the images of the N target subjects on N reference positions of the current frame to obtain a target frame;

if the first selection instruction is used for indicating to enter the automatic shooting mode, acquiring the historical action frame, specifically including:

performing motion detection on a real-time video stream to determine the target subject;

detecting the position of the target main body in the scene in each video frame included in the real-time video stream to obtain the scene position change of the target main body among multiple frames;

determining an image of the target subject as a key action frame, wherein the image of the target subject simultaneously satisfies the following two conditions:

the first condition is as follows: the image position area of the target main body in the current frame is not overlapped with the position area mapped to the current frame by the image of the target main body in the previous key action frame;

and a second condition: the image change of the target main body in the current frame and the image change of the target main body in the previous key action frame meet a preset threshold value;

the historical action frame is the key action frame determined before the current frame.

2. The method according to claim 1, wherein if the first selection instruction is used to indicate that the manual shooting mode is entered, acquiring the historical motion frame specifically includes:

receiving a second selection instruction of the user on the video frames included in the real-time video stream;

and determining a subject of the corresponding position of the second selection instruction in the video frame as the target subject, and determining the video frame as the historical action frame.

3. The method according to claim 1 or 2, wherein the image segmentation is performed on the historical motion frame to obtain an image of the target subject corresponding to the historical motion frame, and specifically comprises:

reducing an image area corresponding to a target subject in the historical action frame according to a motion detection technology to obtain a target image area in the historical action frame;

and processing the image of the target image area through a deep learning algorithm to obtain a mask image of the target main body corresponding to the historical action frame.

4. The method of claim 3, wherein if there are multiple mask images in which the subject overlaps in the mask image, the method further comprises:

and separating the mask image of the target subject from the mask image overlapped by the subjects according to the depth information of the subjects in the historical action frame.

5. The method according to any one of claims 1 to 4, wherein determining a reference position in the current frame according to the position of the target subject in the scene of the historical action frame and the scene of the current frame specifically includes:

obtaining the corresponding relation between the position of at least one object in the historical action frame and the position of the object in the current frame according to an image registration technology or a synchronous positioning and mapping SLAM technology;

and determining the reference position of the target main body in the current frame according to the corresponding relation and the position of the target main body in the historical action frame.

6. The method according to any one of claims 1 to 5, wherein the fusing the images of the N target subjects on the N reference positions of the current frame respectively comprises:

and respectively carrying out weighted fusion processing on the images of the N target main bodies and the pixel information of the image in the current frame at the N reference positions of the current frame.

7. The method according to any one of claims 1-6, wherein after fusing the images of the N target subjects on the N reference positions of the current frame, respectively, the method further comprises:

and adding at least one gray level image to the image of the target main body in the current frame to obtain the target frame, wherein the gray level value of the gray level image is larger if the distance between the gray level image and the image of the target main body in the current frame is shorter.

8. An image processing apparatus, characterized in that the apparatus comprises:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a current frame and N historical action frames according to an acquired real-time video stream, the current frame and the N historical action frames both comprise target bodies, scenes of the current frame and the N historical action frames are overlapped, the positions of the target bodies in the N historical action frames are different, and N is a positive integer greater than or equal to 1;

the image segmentation module is used for carrying out image segmentation on the N historical action frames to obtain images of N target bodies corresponding to the N historical action frames respectively;

a mapping module, configured to determine N reference positions in the current frame according to positions of the N target subjects in scenes of the N historical action frames and a scene of the current frame, respectively;

the image fusion module is used for respectively fusing the images of the N target subjects on the N reference positions of the current frame to obtain a target frame;

the device further comprises:

the device comprises a receiving module, a first selection module and a second selection module, wherein the receiving module is used for receiving a first selection instruction of a user, and the first selection instruction is used for indicating to enter an automatic shooting mode or a manual shooting mode;

if the first selection instruction is used to instruct to enter the automatic shooting mode, the obtaining module is specifically configured to:

9. The apparatus according to claim 8, wherein if the first selection instruction is used to instruct entering the manual shooting mode, the receiving module is further configured to receive a second selection instruction of a user for a video frame included in a real-time video stream;

the obtaining module is specifically further configured to: and determining a subject of the corresponding position of the second selection instruction in the video frame as the target subject, and determining the video frame as the historical action frame.

10. The apparatus according to claim 8 or 9, wherein the image segmentation module is specifically configured to:

11. The apparatus according to claim 10, wherein if there are a plurality of mask images with overlapping subjects in the mask images, the image segmentation module is further configured to:

12. The apparatus according to any one of claims 8-11, wherein the mapping module is specifically configured to:

13. The apparatus according to any one of claims 8-12, wherein the image fusion module is specifically configured to:

and respectively carrying out weighted fusion processing on the images of the N target subjects and the pixel information of the image in the current frame at the N reference positions of the current frame.

14. The apparatus according to any one of claims 8 to 13, wherein the image fusion module is further configured to:

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method of any one of claims 1 to 7.

16. A computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method of any one of claims 1 to 7.