WO2023131327A1 - 视频合成方法、装置及系统 - Google Patents

视频合成方法、装置及系统 Download PDF

Info

Publication number
WO2023131327A1
WO2023131327A1 PCT/CN2023/071293 CN2023071293W WO2023131327A1 WO 2023131327 A1 WO2023131327 A1 WO 2023131327A1 CN 2023071293 W CN2023071293 W CN 2023071293W WO 2023131327 A1 WO2023131327 A1 WO 2023131327A1
Authority
WO
WIPO (PCT)
Prior art keywords
imaging
target object
video
camera
target
Prior art date
Application number
PCT/CN2023/071293
Other languages
English (en)
French (fr)
Inventor
张莉娜
张明
屈小刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023131327A1 publication Critical patent/WO2023131327A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2624Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects for obtaining an image which is composed of whole input images, e.g. splitscreen

Definitions

  • the present application relates to the technical field of video processing, in particular to a video synthesis method, device and system.
  • the present application provides a video synthesis method, device and system.
  • a video stream synthesized for a single moving object can provide a clear picture of the moving object moving in a scene.
  • a video synthesis method is provided, and the method is applied to a management device.
  • the management device acquires N frames of video images respectively collected by N cameras deployed in the target scene at each of the multiple collection moments, where N ⁇ 2.
  • the management device acquires a frame of target video image from N frames of video images corresponding to each collection moment, where the target video image includes the imaging of the target object.
  • the management device synthesizes a video stream corresponding to the target object based on multiple frames of target video images corresponding to multiple acquisition moments, and the video stream is used to reflect activity information of the target object in the target scene.
  • a plurality of cameras are fixedly deployed in the target scene, and the shooting areas of these cameras are different, and these cameras can capture clear video images for different areas in the target scene.
  • the management device selects a frame of video images containing the imaging of the target object from the multi-frame video images collected by multiple cameras at the same acquisition time for video synthesis, because these cameras can respectively capture clear images of the corresponding areas in the target scene Therefore, when the target object is moving in the shooting area of different cameras, there is always a camera that can capture a clear moving picture of the target object, so that the synthesized video stream can provide a clear picture of the target object moving in the target scene, that is The clarity of the moving picture of the target object in the synthesized video stream is guaranteed.
  • the camera parameters can be preset according to the required shooting area, and there is no need to adjust the camera parameters during the shooting process, and the implementation method is simple.
  • the management device obtains a frame of target video image from the N frames of video images corresponding to each collection moment, including: the management device obtains the image of the target object in the N frames of video images corresponding to each collection moment all the video images to be selected, and then obtain the target video image from all the video images to be selected.
  • the N cameras include a first camera and a second camera, and the first camera and the second camera have a common viewing area.
  • the management device acquires all candidate video images including the imaging of the target object in the N frames of video images corresponding to each collection moment, including: when the target object is located in the common view of the first camera and the second camera at the first collection moment area, the management device uses the first video image captured by the first camera at the first collection moment and the second video image collected by the second camera at the first collection moment as candidate video images corresponding to the first collection moment.
  • the implementation manner in which the management device obtains the target video image from all video images to be selected may include: the management device obtains the first image of the target object in the first video image and the second image of the target object in the second video image. imaging. In response to the fact that the imaging effect of the first imaging is better than that of the second imaging, the management device uses the first video image as the target video image corresponding to the first acquisition moment.
  • the management device may use the video image including the imaging of the target object and having the best imaging effect of the target object among the N frames of video images acquired at the same acquisition time as the target video image for synthesizing the video stream corresponding to the target object.
  • the clarity of the moving picture of the target object in the synthesized video stream can be further improved, so that the synthesized video stream can better reflect the activity characteristics of the target object, which is beneficial to the analysis of the activity characteristics of the target object.
  • the imaging effect of the first imaging is better than that of the second imaging, and one or more of the following conditions are met: the imaging area of the first imaging is larger than the imaging area of the second imaging.
  • the number of skeleton points included in the first imaging is greater than the number of skeleton points included in the second imaging.
  • the confidence level of the first imaged bone data is greater than the confidence level of the second imaged bone data.
  • the larger the imaging area the more details can usually be reflected, the more the number of bone points included in the imaging or the higher the confidence of the bone data, the better it can reflect the activity characteristics of the target object, so the larger the imaging area, The more skeletal points included in the imaging, the higher the confidence of the imaged skeletal data, and it can be determined that the imaging effect of the imaging is better.
  • the management device acquires the second imaging of the target object in the second video image, comprising: the management device acquires the first key of the target object after acquiring the first imaging of the target object in the first video image. Point at the first imaging location in the first video image. The management device determines the second imaging position of the first key point in the second video image according to the first imaging position based on the pixel coordinate mapping relationship between the first camera and the second camera. The management device determines the second imaging of the target object in the second video image according to the second imaging position.
  • the management device can capture The correlation of the imaging geometric position in the video image realizes the cross-camera tracking recognition of the target object.
  • the solution of this application does not depend on the unique characteristics of the target object, and can be applied to various scenarios through the flexible deployment and calibration of the camera.
  • M cameras are deployed in the target scene. Any two adjacent cameras among the M cameras have a common view area.
  • a plurality of homography matrices are stored in the management device, and each homography matrix is used to reflect the pixel coordinate mapping relationship between a group of M cameras and between two adjacent cameras.
  • the accuracy of cross-camera tracking and recognition of the target object can be improved by deploying more cameras in the target scene, and the synthesis efficiency can be improved by selecting video images collected by fewer cameras for synthesizing video streams. Smoothness of video streaming. That is, M>N, so that the accuracy and fluency of the synthesized video stream can be guaranteed at the same time.
  • the management device may perform cropping processing on the target video image, so that the imaging of the target object is located in a central area of the cropped video image. Then, based on the multiple collection moments, the management device arranges the cropped video images of multiple frames in chronological order, so as to obtain the video stream corresponding to the target object.
  • the management device may perform cropping processing on each frame of the acquired target video image, so that in all video images of the finally synthesized video stream, the imaging of the target object is in the central area. In this way, the focus effect on the target object can be realized, and the display effect of the synthesized video stream can be better, and the playing picture of the video stream can be smoother and smoother, thereby improving the viewing experience of the user.
  • the management device may also determine the horizontal position of the second key point in the world coordinate system according to the imaging position of the second key point of the target object in the target video image, and determine the The horizontal position in the world coordinate system at all times generates the trajectory of the target object.
  • the management device after the management device acquires the skeleton data of the target object, it can also perform motion analysis on the target object based on the skeleton data, including but not limited to determining the trajectory of the target object, calculating the number of steps of the target object, and calculating the displacement of the target object Or calculate the movement speed of the target object, etc.
  • the management device may also acquire the imaging position of the target object's skeletal point in the target video image, and display the playback screen of the video stream on the playback interface, in which the target object's Skeletal points of the target object are displayed on the imaging.
  • the management device when the management device synthesizes the video stream corresponding to the target object, it can encode the imaging position of the skeleton point of the target object and the corresponding video image together, and then can display the playback screen of the video stream corresponding to the target object , displaying the skeletal points of the target object on the image of the target object in the playback screen, which is helpful for analyzing the activity of the target object.
  • a management device in a second aspect, includes multiple functional modules, and the multiple functional modules interact to implement the methods in the above first aspect and various implementation manners thereof.
  • the multiple functional modules can be implemented based on software, hardware or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
  • a management device including: a processor and a memory;
  • the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to invoke the computer program to implement the methods in the above first aspect and various implementation manners thereof.
  • a computer-readable storage medium In a fourth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the above-mentioned first aspect and the methods in each implementation manner thereof are realized.
  • a computer program product including a computer program.
  • the computer program is executed by a processor, the method in the above first aspect and its various implementation manners is realized.
  • a chip is provided, and the chip includes a programmable logic circuit and/or program instructions, and when the chip is running, implements the method in the above first aspect and various implementation manners thereof.
  • FIG. 1 is a schematic structural diagram of a video synthesis system provided in an embodiment of the present application
  • Fig. 2 is a schematic diagram of the relative positions of two adjacent cameras provided by the embodiment of the present application.
  • Fig. 3 is a schematic diagram of a distribution position of cameras provided by an embodiment of the present application.
  • FIG. 4 is a schematic flow chart of a video synthesis method provided in an embodiment of the present application.
  • Fig. 5 is a schematic diagram of the distribution of human skeleton points provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of pixel coordinate mapping between two cameras provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of comparison before and after cropping of a video image provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of a playback interface provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a movement trajectory of a target object provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a management device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a management device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a management device provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a management device provided by an embodiment of the present application.
  • Fig. 14 is a block diagram of a management device provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a video synthesis system provided by an embodiment of the present application. As shown in FIG. 1 , the video synthesis system includes: a media source 101 and a management device 102 .
  • the media source 101 is used to provide multiple video streams.
  • a media source 101 includes a plurality of cameras 1011 .
  • Each camera 1011 is used to collect one video stream.
  • the multiple cameras 1011 collect images at the same moment and frequency.
  • a camera synchronization technology may be used to realize synchronous shooting by multiple cameras 1011 .
  • the number of cameras in FIG. 1 is only used as an example, and is not intended to limit the video composition system provided in the embodiment of the present application.
  • any two adjacent cameras in the plurality of cameras 1011 have a common viewing area.
  • the two cameras have a common viewing area, which means that the shooting areas of the two cameras have overlapping areas.
  • FIG. 2 is a schematic diagram of relative positions of two adjacent cameras provided in an embodiment of the present application. As shown in FIG. 2 , the shooting area of camera A is area a, and the shooting area of camera B is area b. Area a and area b have an overlapping area c, which is the common view area of camera A and camera B.
  • the plurality of cameras 1011 may be arranged in a circular arrangement, in a fan shape, in a straight line, or in other irregular arrangements, and a corresponding camera arrangement may be designed according to actual deployment scenarios. For example, if multiple cameras are used to collect motion videos of athletes on a circular speed skating track, multiple cameras may be deployed around the speed skating track in a circular arrangement.
  • FIG. 3 is a schematic diagram of camera distribution positions provided by an embodiment of the present application. As shown in Figure 3, there are 20 cameras deployed near the speed skating track, which are respectively recorded as cameras 1-20. The 20 cameras are arranged in a circular manner, and the shooting directions of the 20 cameras are all facing the speed skating track.
  • the entire collection of shooting areas of the 20 cameras can completely cover the entire speed skating track, that is, when a certain athlete moves in the speed skating track, the 20 cameras always There is at least one camera capable of capturing video images containing imagery of the athlete.
  • the management device 102 is configured to analyze and process the multiple video streams from multiple cameras 1011 in the media source 101, so as to extract the imaged video images containing the target object in the multiple video streams, and then synthesize the video stream corresponding to the target object.
  • Each frame of video image in the video stream includes the imaging of the target object, and the video stream may also be referred to as a synthesized video stream corresponding to the target object.
  • the frame rate of the video stream synthesized by the management device 102 is the same as the frequency at which the camera 1011 collects images.
  • the management device 102 can acquire each video image from multiple video streams.
  • Each acquisition moment contains a frame of video image of the target object, and finally synthesizes a video stream with the same frame rate as the image acquisition frequency of the camera.
  • the management device 102 may be one device or multiple devices.
  • the management device 102 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the management device 102 may use a target detection algorithm to identify a target object in a video image captured by a single camera, and use a target tracking algorithm to determine the imaging of the target object in video images subsequently captured by the camera.
  • the management device 102 may Correlation determines the imaging of the target object in the video images collected by the adjacent camera, and then realizes the cross-camera tracking and identification of the target object.
  • multiple cameras 1011 in the media source 101 are fixedly deployed, and camera parameters of each camera are preset.
  • the shooting area and shooting focus of each camera are fixed, so the image coordinate system of each camera is fixed, and then the common view area of two adjacent cameras is in the adjacent
  • the imaging pixel coordinates of the two cameras have a fixed mapping relationship.
  • Multiple homography matrices may be stored in the management device 102, and each homography matrix is used to reflect a set of pixel coordinate mapping relationships between two adjacent cameras.
  • the homography matrix here can be understood as the transformation matrix between the image coordinate systems of two adjacent cameras.
  • the management device 102 can generate reflections based on the pixel coordinates of multiple pixels in the common view area of two adjacent cameras in the image coordinate system of the two adjacent cameras.
  • the homography matrix of the pixel coordinate mapping relationship between the two adjacent cameras For example, with reference to the example shown in FIG. 3 , the homography matrix from camera 1 to camera 2 is H12, and there is a marker point M in the common view area of camera 1 and camera 2, and the marker point M is in the video image collected by camera 1.
  • the image coordinate system is a coordinate system with the upper left vertex of the image collected by the camera as the coordinate origin.
  • the x-axis and y-axis of the image coordinate system are respectively the length and width directions of the collected images.
  • the management device 102 can also select any one of the M cameras, and generate a transformation matrix from the image coordinate system of the camera to the two-dimensional world coordinate system, the transformation matrix is the transformation matrix from the image coordinate system of the camera to the two-dimensional world The homography matrix of the coordinate system.
  • the transformation matrix is the transformation matrix from the image coordinate system of the camera to the two-dimensional world The homography matrix of the coordinate system.
  • multiple markers can be placed in the shooting area of the camera, and the horizontal positions of these markers in the world coordinate system can be identified, and the management device can The pixel coordinates of are calculated to obtain the transformation matrix.
  • the management device can calculate the image of each camera based on the transformation matrix from the camera's image coordinate system to the two-dimensional world coordinate system and the above-mentioned multiple homography matrices that respectively reflect the pixel coordinate mapping relationship between two adjacent cameras.
  • the transformation matrix from the image coordinate system of camera i to the two-dimensional world coordinate system is known as Hiw
  • Both i and j are positive integers
  • camera i represents the i-th camera among the M cameras
  • camera j represents the j-th camera among the M cameras.
  • the world coordinate system can describe the position of the camera in the real world, and can also describe the position of the object in the image collected by the camera in the real world.
  • the x-axis and y-axis of the world coordinate system are on the horizontal plane, and the z-axis is perpendicular to the horizontal plane.
  • the two-dimensional world coordinate system in the embodiment of the present application refers to a horizontal coordinate system composed of an x-axis and a y-axis.
  • the horizontal position in the world coordinate system can be represented by two-dimensional horizontal coordinates (x, y).
  • Fig. 4 is a schematic flowchart of a video synthesis method provided by an embodiment of the present application. This method can be applied to the management device 102 in the video composition system shown in FIG. 1 . As shown in Figure 4, the method includes:
  • Step 401 the management device acquires N frames of video images respectively collected by N cameras deployed in the target scene at each of multiple collection moments.
  • N N ⁇ 2.
  • M cameras are deployed in the target scene, and any two adjacent cameras among the M cameras have a common-view area, M ⁇ N.
  • a plurality of homography matrices are stored in the management device, and each homography matrix is used to reflect the pixel coordinate mapping relationship between a group of M cameras and between two adjacent cameras.
  • the selected N cameras are evenly deployed in the target scene, so that the entire set of shooting areas of the N cameras can cover the entire target scene as much as possible.
  • the target scene is a speed skating track
  • the 8 cameras may include, for example, camera 2 , camera 4 , camera 6 , camera 9 , camera 12 , camera 14 , camera 16 and camera 19 .
  • a camera can be selected every 50 meters.
  • the homography matrix calculated to reflect the pixel coordinate mapping relationship between adjacent two cameras is generally more accurate, so it can be deployed in the target scene. More cameras, by increasing the camera deployment density to improve the accuracy of the calculated homography matrix, and then improve the accuracy of cross-camera tracking and recognition of the target object.
  • the management device is synthesizing the video stream, if the frequency of position switching is too high when selecting video images, the angle of view will switch too quickly, which will lead to poor video fluency and affect the user's viewing experience. Therefore, the video images captured by fewer cameras It is used for synthesizing video streams, which can improve the fluency of the synthesized video streams, thereby improving the viewing experience of users.
  • the accuracy of cross-camera tracking and recognition of the target object can be improved by deploying more cameras in the target scene, and the video images collected by fewer cameras can be used to synthesize video streams to improve The fluency of the composite video stream. This improves both the accuracy and smoothness of the synthesized video stream.
  • Step 402 the management device acquires a frame of target video image from N frames of video images corresponding to each collection moment, where the target video image includes the imaging of the target object.
  • the N frames of video images corresponding to each acquisition moment come from N cameras respectively.
  • the target object is located in the shooting area of at least one camera among the N cameras at each collection moment.
  • step 402 includes the following steps 4021 to 4022.
  • step 4021 the management device acquires all candidate video images including imaging of the target object in the N frames of video images corresponding to each collection moment.
  • the N frames of video images corresponding to the collection moment include a frame of video images to be selected.
  • the N frames of video images corresponding to the collection moment include two or more video images to be selected.
  • the above N cameras include a first camera and a second camera.
  • the first camera and the second camera have a common viewing area.
  • the management device combines the first video image collected by the first camera at the first collection moment with the second video image collected by the second camera at the first collection moment
  • the obtained second video images are all used as candidate video images corresponding to the first collection moment.
  • step 4022 the management device acquires a frame of target video image from all video images to be selected.
  • the management device may select the candidate with the best imaging effect of the target object among all the video images to be selected corresponding to the collection time.
  • the video image is used as the target video image.
  • the management device may also use any candidate video image among all candidate video images corresponding to the collection moment as the target video image.
  • the management device may acquire the first image of the target object in the first video image and the target object in the second video image of the second image.
  • the management device uses the first video image as the target video image corresponding to the first acquisition moment.
  • the imaging effect of the first imaging is better than that of the second imaging, and one or more of the following conditions are met: the imaging area of the first imaging is larger than the imaging area of the second imaging.
  • the number of skeleton points included in the first imaging is greater than the number of skeleton points included in the second imaging.
  • the confidence level of the first imaged bone data is greater than the confidence level of the second imaged bone data.
  • the imaging area of the first imaging refers to the imaging area of the target object in the first video image
  • the imaging area of the second imaging refers to the imaging area of the target object in the second video image.
  • the bone points included in the first imaging and the second imaging all refer to the bone points directly reflected on the imaging, and do not include the bone points inferred.
  • the confidence of bone data refers to the overall confidence of all bone points, including the bone points directly reflected on the imaging and the bone points that cannot be reflected on the imaging.
  • the corresponding positions of the bone points that cannot be reflected on the imaging can be inferred by relevant algorithms , the confidence of the bone points whose positions are obtained by inference is generally low.
  • the larger the imaging area the more details can usually be reflected, the more the number of bone points included in the imaging or the higher the confidence of the bone data, the better it can reflect the activity characteristics of the target object, so the larger the imaging area, The more skeletal points included in the imaging, the higher the confidence of the imaged skeletal data, and it can be determined that the imaging effect of the imaging is better.
  • the target object is a human body.
  • Bones of the human body include but are not limited to nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles.
  • FIG. 5 is a schematic diagram of the distribution of human skeleton points provided by the embodiment of the present application.
  • the human body can include 17 bone points, namely nose 0, left eye 1, right eye 2, left ear 3, right ear 4, left shoulder 5, right shoulder 6, left elbow 7, right elbow 8.
  • the following embodiments of the present application are described by taking the target object as an example.
  • the management device may use the N frames of video images acquired at the same acquisition time that include the imaging of the target object and have the best imaging effect of the target object as the target video image, so as to synthesize the video corresponding to the target object flow.
  • the clarity of the moving picture of the target object in the synthesized video stream can be improved, so that the synthesized video stream can better reflect the activity characteristics of the target object, which is beneficial to the analysis of the activity characteristics of the target object.
  • the target object first arrives at the shooting area of the first camera and then arrives at the shooting area of the second camera as an example, and the management device acquires the first image of the target object in the first video image and the target object
  • the implementation process of the second imaging in the second video image is described, and the implementation process includes the following steps S11 to S14.
  • step S11 the management device acquires a first image of a target object in a first video image.
  • the first camera is the first camera for tracking and identifying the target object
  • the management device may identify the target object in the collected video images using a target detection algorithm. After identifying the target object, the management device may also assign a globally unique identifier to the target object, and use the identifier to distinguish the imaging of the target object in the video images of each camera. Finally, based on the identification of the target object, the global task mapping relationship can be unified through the union search algorithm idea, and the multi-camera tracking and recognition of the target object can be realized.
  • the management device first acquires the target object captured by the first camera
  • the implementation process of the imaging in the video image may refer to the implementation process of the management device acquiring the second imaging of the target object in the second video image collected by the second camera described below.
  • the management device After the target object arrives at the shooting area of the first camera, and the management device obtains the imaging of the target object in the video image collected by the first camera, during the process of the target object moving in the shooting area of the first camera, the management device may adopt A target tracking algorithm determines how the target object is imaged in video images subsequently captured by the first camera.
  • the management device may also determine the imaging position of each skeletal point of the target object, and encode and package the imaging position of the skeletal point with the corresponding video image, for subsequent analysis.
  • the imaging position of the skeleton point can be represented by pixel coordinates.
  • step S12 the management device acquires the first imaging position of the first key point of the target object in the first video image.
  • the first key point of the target object may be obtained based on one or more bone points of the target object.
  • the height of the hips of the human body is used as the calibration height, and the midpoint of the left and right hips of the human body (ie, the center point of the human body) can be used as the first key point.
  • the first key point may generally refer to one or more key points.
  • step S13 the management device determines the second imaging position of the first key point in the second video image according to the first imaging position based on the pixel coordinate mapping relationship between the first camera and the second camera.
  • the first camera and the second camera are two adjacent cameras among the M cameras.
  • the first camera and the second camera are not two adjacent cameras among the M cameras.
  • the management device can determine the A second imaging position of the first key point in the second video image.
  • the first camera is camera 1
  • the second camera is camera 3
  • the homography matrix from camera 1 to camera 2 is H12
  • the homography matrix from camera 2 to camera 3 is H23
  • the first imaging position is (x' 1p , y' 1p )
  • step S14 the management device determines a second imaging of the target object in the second video image according to the second imaging position.
  • the management device may determine the human body detection frame according to the second imaging position, and use the human body image within the human body detection frame in the second video image as the second image.
  • the management device may detect all human body images in the second video image, and use the human body image whose imaging position of the first key point is closest to the second imaging position among these human body images as the second imaging. If the management device detects that there is only one human body image in the second video image, then the human body image can be directly used as the second image of the target object in the second video image, without performing the steps S12 to S14 above.
  • FIG. 6 is a schematic diagram of pixel coordinate mapping between two cameras provided by an embodiment of the present application.
  • the left picture is the first video image
  • the right picture is the second video image.
  • the first video image includes a first imaging of the target object.
  • the first imaging position of the first key point p in the first video image is p1
  • the second position of the first key point p in the second video image is obtained based on the pixel coordinate mapping relationship between the first camera and the second camera.
  • the imaging position is p2.
  • the second imaging of the target object in the second video image can be determined based on the second imaging position p2.
  • the target scene is an athlete's training ground or competition ground.
  • the video images captured by a single camera may include multiple human images.
  • the camera cannot capture the face in many cases, so it is difficult to track and identify a single target through face recognition.
  • the management device can The correlation of the imaging geometric position in the video image collected by the camera realizes the cross-camera tracking recognition of the target object.
  • the solution of this application does not depend on the unique characteristics of the target object, and can be applied to various scenarios through the flexible deployment and calibration of the camera.
  • step 403 the management device synthesizes a video stream corresponding to the target object according to multiple frames of target video images corresponding to multiple acquisition moments.
  • the video stream of the target object is used to reflect the activity information of the target object in the target scene.
  • the video images in the video stream corresponding to the target object are arranged in chronological order.
  • the management device may perform cropping processing on the target video image, so that the imaging of the target object is located in a central area of the cropped video image.
  • the implementation process of step 403 includes: the management device arranges the cropped video images of multiple frames in chronological order based on the multiple collection moments, so as to obtain the video stream corresponding to the target object.
  • the size of the cropping window may be preset, and the management device uses the imaging of the center point of the human body of the target object as the center of the cropping window to perform cropping processing on the original video image.
  • FIG. 7 is a schematic diagram of a video image before and after cropping provided by an embodiment of the present application. As shown in FIG.
  • the imaging of the target object in the cropped video image is located in the central area of the video image.
  • the cropped video image can better highlight the imaging of the target object.
  • the management device may perform cropping processing on each frame of the acquired target video image, so that in all the video images of the finally synthesized video stream, the imaging of the target object is in the central area. It can not only achieve the effect of following the focus on the target object, but also make the display effect of the synthesized video stream better, and the playing picture of the video stream is smoother and smoother, thereby improving the viewing experience of the user.
  • the management device may also perform smoothing and filtering processing on the cropped video image.
  • the shooting areas of these cameras are different, and these cameras can capture clear video images for different areas in the target scene.
  • the management device selects a frame of video images containing the imaging of the target object from the multi-frame video images collected by multiple cameras at the same acquisition time for video synthesis, because these cameras can respectively capture clear images of the corresponding areas in the target scene Therefore, when the target object is moving in the shooting area of different cameras, there is always a camera that can capture a clear moving picture of the target object, so that the synthesized video stream can provide a clear picture of the target object moving in the target scene, that is The clarity of the moving picture of the target object in the synthesized video stream is guaranteed.
  • the camera parameters can be preset according to the required shooting area, and there is no need to adjust the camera parameters during the shooting process, and the implementation method is simple.
  • the following step 404 may also be performed.
  • Step 404 the management device outputs the video stream corresponding to the target object.
  • the management device has a display function. Then the management device outputs the video stream corresponding to the target object, which may be that the management device displays the playback screen of the video stream corresponding to the target object on the playback interface.
  • the management device can also obtain the imaging position of the target object's skeletal point in the target video image, then when the management device displays the playback screen of the video stream corresponding to the target object on the playback interface, the target object's position in the playback screen The skeletal points of the target object can be displayed on the imaging.
  • the management device outputs the video stream corresponding to the target object, or the management device may send the video stream corresponding to the target object to the terminal, so that the terminal displays the playback screen of the video stream corresponding to the target object on the playback interface.
  • the management device in response to receiving a play request from the terminal, sends the video stream corresponding to the target object to the terminal.
  • the play request may carry the identifier of the target object.
  • FIG. 8 is a schematic diagram of a playback interface provided by an embodiment of the present application.
  • the playback interface Z displays the playback screen of the video stream corresponding to the target object.
  • multiple skeletal points are displayed on the imaging of the target object (only 9 skeletal points are shown in the figure for schematic illustration).
  • the management device when the management device synthesizes the video stream corresponding to the target object, it can encode the imaging position of the skeleton point of the target object and the corresponding video image together, and then can display the video stream corresponding to the target object.
  • the skeleton points of the target object are displayed on the imaging of the target object in the playback screen, which is helpful for analyzing the activity of the target object.
  • the management device can also perform motion analysis on the target object based on the skeleton data, including but not limited to determining the trajectory of the target object, calculating the number of steps of the target object, and calculating the displacement of the target object Or calculate the movement speed of the target object, etc.
  • the management device or terminal displays the playback screen of the video stream corresponding to the target object, it can also superimpose and display the real-time motion analysis results of the target object on the playback screen, such as superimposing and displaying the real-time motion track and real-time step of the target object on the playback screen. Number, real-time displacement and real-time velocity, etc., to further facilitate the motion analysis of the target object.
  • the management device may also send the skeleton data of the target object to the analysis device, and the analysis device performs motion analysis on the target object. That is, the synthesis of the video stream and the motion analysis of the target object may be completed by one device, or may be completed by multiple devices in division of labor, which is not limited in this embodiment of the present application.
  • the implementation process of the management device determining the movement track of the target object includes: the management device determines the horizontal position of the second key point in the world coordinate system according to the imaging position of the second key point of the target object in the target video image.
  • the management device generates the movement trajectory of the target object according to the horizontal positions of the second key point in the world coordinate system at multiple acquisition moments.
  • the second key point of the target object can be obtained based on one or more bone points of the target object.
  • the ground is used as the calibration height
  • the midpoint of the left and right ankles of the human body is used as the second key point.
  • the second key point may generally refer to one or more key points.
  • the management device can determine the position of the second key point in the world based on the transformation matrix from the image coordinate system of the camera that captures the target video image to the two-dimensional world coordinate system, and according to the imaging position of the second key point of the target object in the target video image. The horizontal position in the coordinate system.
  • FIG. 9 is a schematic diagram of a movement trajectory of a target object provided in an embodiment of the present application.
  • the target object is moving on the speed skating track.
  • the two-dimensional horizontal coordinates of the second key point of the target object at the collection time t1 are (x t1 , y t1 ), and the two-dimensional horizontal coordinates at the collection time t2 are (x t2 , y t2 ), the two-dimensional horizontal coordinates at the collection time t3 are (x t3 , y t3 ), the two-dimensional horizontal coordinates at the collection time t4 are (x t4 , y t4 ), and the two-dimensional coordinates at the collection time t5
  • the dimensional horizontal coordinates are (x t5 , y t5 ), and finally a motion trajectory on the horizontal plane is obtained.
  • the two-dimensional horizontal coordinates reflect the horizontal position in the world coordinate system.
  • the management device may determine each crossing of the left and right ankles as a step based on the synthesized video stream of the target object, so as to realize the calculation of the number of steps.
  • the order of the steps of the video synthesis method provided in the embodiment of the present application can be adjusted appropriately, and the steps can also be increased or decreased according to the situation.
  • Any person familiar with the technical field within the technical scope disclosed in this application can easily think of changing methods, which should be covered within the scope of protection of this application.
  • this application scheme can not only be applied to athletes' training scenes or game scenes to synthesize athletes' full-range motion videos, but also can be applied to emergency escape command scenes. By synthesizing real-time videos of each person, it is helpful for individual Escape routes should be developed to increase the chances of escape. It can also be applied to tourist attractions for synthesizing the whole tour video of tourists in the scenic spots.
  • the above-mentioned target objects may also be animals, and the solution of the present application may also be applied to animal protection scenarios, and so on.
  • the embodiment of the present application does not limit the application scenarios of the above methods, and details are not described here one by one.
  • the cameras deployed in the target scene can also be implemented by remote control of drones and other methods.
  • the shooting areas of these cameras are different, and these cameras can capture clear images for different areas in the target scene.
  • video image The management device selects a frame of video images containing the imaging of the target object from the multi-frame video images collected by multiple cameras at the same acquisition time for video synthesis, because these cameras can respectively capture clear images of the corresponding areas in the target scene Therefore, when the target object is moving in the shooting area of different cameras, there is always a camera that can capture a clear moving picture of the target object, so that the synthesized video stream can provide a clear picture of the target object moving in the target scene, that is The clarity of the moving picture of the target object in the synthesized video stream is guaranteed.
  • the camera parameters can be preset according to the required shooting area, and there is no need to adjust the camera parameters during the shooting process, and the implementation method is simple.
  • the management device pre-determines the pixel coordinate mapping relationship between the two adjacent cameras deployed. When the target object moves to the common view area of the two adjacent cameras, the management device can capture the image captured by the two adjacent cameras according to the target object The correlation of the imaging geometric position in the video image realizes the cross-camera tracking recognition of the target object.
  • the solution of this application does not depend on the unique characteristics of the target object, and can be applied to various scenarios through the flexible deployment and calibration of the camera.
  • FIG. 10 is a schematic structural diagram of a management device provided by an embodiment of the present application.
  • the management device may be the management device 102 in the video synthesis system shown in FIG. 1 .
  • the management device 1000 includes:
  • the first acquiring module 1001 is configured to acquire N frames of video images respectively acquired by N cameras deployed in the target scene at each of multiple acquisition moments, where N ⁇ 2.
  • the second obtaining module 1002 is configured to obtain a frame of target video image from N frames of video images corresponding to each collection moment, where the target video image includes the imaging of the target object.
  • the video synthesis module 1003 is configured to synthesize a video stream corresponding to the target object according to multiple frames of target video images corresponding to multiple acquisition moments, and the video stream is used to reflect the activity information of the target object in the target scene.
  • the second obtaining module 1002 is configured to: obtain all candidate video images including imaging of the target object in N frames of video images corresponding to each collection moment. Obtain the target video image from all the video images to be selected.
  • the N cameras include a first camera and a second camera, and the first camera and the second camera have a common viewing area.
  • the second acquisition module 1002 is configured to: when the target object is located in the common view area of the first camera and the second camera at the first acquisition moment, the first video image acquired by the first camera at the first acquisition moment and the second The second video images collected by the camera at the first collection moment are all used as candidate video images corresponding to the first collection moment.
  • the second obtaining module 1002 is configured to: obtain a first imaging of the target object in the first video image and a second imaging of the target object in the second video image.
  • the first video image is used as the target video image corresponding to the first acquisition moment.
  • the imaging effect of the first imaging is better than that of the second imaging, and one or more of the following conditions are met: the imaging area of the first imaging is larger than the imaging area of the second imaging.
  • the number of skeleton points included in the first imaging is greater than the number of skeleton points included in the second imaging.
  • the confidence level of the first imaged bone data is greater than the confidence level of the second imaged bone data.
  • the second obtaining module 1002 is configured to: after obtaining the first imaging of the target object in the first video image, obtain a first imaging position of the first key point of the target object in the first video image. Based on the pixel coordinate mapping relationship between the first camera and the second camera, the second imaging position of the first key point in the second video image is determined according to the first imaging position. A second imaging of the target object in the second video image is determined according to the second imaging position.
  • M cameras are deployed in the target scene, any two adjacent cameras among the M cameras have a common-view area, M ⁇ N, N cameras belong to M cameras, and multiple homography is stored in the management device matrix, and each homography matrix is used to reflect the pixel coordinate mapping relationship between a group of adjacent two cameras among the M cameras.
  • the management device 1000 further includes: an image processing module 1004, configured to perform cropping processing on the target video image, so that the imaging of the target object is located in the central area of the cropped video image.
  • the video synthesis module 1003 is configured to arrange multiple frames of video images that have been cropped respectively in chronological order based on multiple acquisition moments, so as to obtain a video stream.
  • the management device 1000 further includes: a determination module 1005, configured to determine the position of the second key point in the world coordinate system according to the imaging position of the second key point of the target object in the target video image. horizontal position.
  • the trajectory generation module 1006 is configured to generate the movement trajectory of the target object according to the horizontal positions of the second key point in the world coordinate system at multiple acquisition moments.
  • the management device 1000 further includes: a third acquiring module 1007, configured to acquire the imaging position of the skeletal point of the target object in the target video image.
  • the display module 1008 is configured to display the playing screen of the video stream on the playing interface, and the imaging of the target object in the playing screen displays the skeleton points of the target object.
  • the embodiment of the present application also provides a video synthesis system, including: a management device and multiple cameras.
  • the camera is used to collect video images
  • the management device is used to execute the method steps shown in FIG. 4 .
  • Fig. 14 is a block diagram of a management device provided by an embodiment of the present application.
  • a management device 1400 includes: a processor 1401 and a memory 1402 .
  • memory 1402 configured to store computer programs, the computer programs including program instructions
  • the processor 1401 is configured to call the computer program to implement the method steps shown in FIG. 4 in the above method embodiment.
  • the management device 1400 further includes a communication bus 1403 and a communication interface 1404 .
  • the processor 1401 includes one or more processing cores, and the processor 1401 executes various functional applications and data processing by running computer programs.
  • Memory 1402 may be used to store computer programs.
  • the memory may store an operating system and application program units required for at least one function.
  • the operating system can be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS or OS X.
  • RTX Real Time eXecutive
  • LINUX LINUX
  • UNIX UNIX
  • OS X OS X
  • the communication interfaces 1404 may be used to communicate with other devices.
  • the communication interface of the management device 1400 may be used for the terminal to send video streams.
  • the memory 1402 and the communication interface 1404 are respectively connected to the processor 1401 through the communication bus 1403 .
  • the embodiment of the present application also provides a computer-readable storage medium, where instructions are stored on the computer-readable storage medium, and when the instructions are executed by a processor, the method steps shown in FIG. 4 are implemented.
  • An embodiment of the present application also provides a computer program product, including a computer program, and when the computer program is executed by a processor, the method steps shown in FIG. 4 are implemented.
  • the program can be stored in a computer-readable storage medium.
  • the above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Studio Devices (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种视频合成方法、装置及系统,属于视频处理技术领域。管理设备获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像。管理设备从每个采集时刻对应的N帧视频图像中获取一帧包括目标对象的成像的目标视频图像,然后根据多个采集时刻对应的多帧目标视频图像合成目标对象对应的视频流。通过多个相机分别针对目标场景中的不同区域采集清晰的视频图像。管理设备选取每个采集时刻对应的包含目标对象的成像的一帧视频图像进行视频合成,由于目标对象在不同相机的拍摄区域内活动时,总有相机能够采集到目标对象清晰的活动画面,因此合成的视频流能够提供目标对象在目标场景中全程活动的清晰画面。

Description

视频合成方法、装置及系统
本申请要求于2022年01月10日提交的申请号为202210022166.8、发明名称为“视频合成方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频处理技术领域,特别涉及一种视频合成方法、装置及系统。
背景技术
在运动员的训练过程和比赛过程中通常都会录制视频,以便教练员分析运动员的运动数据,进而能够针对性地制定单个运动员的训练计划。但是由于运动场地通常较大,一个固定机位的相机无法清晰地拍摄到一个运动员的全程运动画面,因此在通过一个固定机位的相机采集的视频流中,无法保证单个运动员的全程运动画面的清晰度。
发明内容
本申请提供了一种视频合成方法、装置及系统,针对单个活动对象合成的视频流能够提供该活动对象在场景中全程活动的清晰画面。
第一方面,提供了一种视频合成方法,该方法应用于管理设备。管理设备获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像,N≥2。管理设备从每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,该目标视频图像包括目标对象的成像。管理设备根据多个采集时刻对应的多帧目标视频图像,合成目标对象对应的视频流,该视频流用于反映目标对象在目标场景中的活动信息。
本申请通过在目标场景中固定部署多个相机,这些相机的拍摄区域不同,这些相机分别能够针对目标场景中的不同区域拍摄到清晰的视频图像。管理设备从多个相机在同一采集时刻分别采集得到的多帧视频图像中,选取一帧包含目标对象的成像的视频图像进行视频合成,由于这些相机分别能拍摄到目标场景中对应区域的清晰的视频图像,因此目标对象在不同相机的拍摄区域内活动时,总有相机能够采集到目标对象清晰的活动画面,进而使得合成的视频流能够提供目标对象在目标场景中全程活动的清晰画面,即保证了合成的视频流中目标对象的活动画面的清晰度。另外,由于相机是固定部署的,可以根据其所需的拍摄区域预先设置相机参数,在拍摄过程中无需调整相机参数,实现方式简单。
可选地,管理设备从每个采集时刻对应的N帧视频图像中获取一帧目标视频图像的实现方式,包括:管理设备获取每个采集时刻对应的N帧视频图像中包括目标对象的成像的所有待选视频图像,然后从所有待选视频图像中获取目标视频图像。
可选地,N个相机包括第一相机和第二相机,第一相机和第二相机具有共视区域。管理设备获取每个采集时刻对应的N帧视频图像中包括目标对象的成像的所有待选视频图像的实现方式,包括:当目标对象在第一采集时刻位于第一相机和第二相机的共视区域时,管理设备将第一相机在第一采集时刻采集得到的第一视频图像和第二相机在第一采集时刻采集得到 的第二视频图像均作为第一采集时刻对应的待选视频图像。
相应地,管理设备从所有待选视频图像中获取目标视频图像的实现方式,可以包括:管理设备获取目标对象在第一视频图像中的第一成像和目标对象在第二视频图像中的第二成像。响应于第一成像的成像效果优于第二成像的成像效果,管理设备将第一视频图像作为第一采集时刻对应的目标视频图像。
本申请中,管理设备可以将同一采集时刻获取的N帧视频图像中包括目标对象的成像且目标对象的成像效果最优的视频图像作为目标视频图像,以用于合成目标对象对应的视频流。可以进一步提高合成的视频流中目标对象的活动画面的清晰度,使合成的视频流更好地反映目标对象的活动特征,有利于分析目标对象的活动特性。
可选地,第一成像的成像效果优于第二成像的成像效果,满足以下一种或多种条件:第一成像的成像面积大于第二成像的成像面积。第一成像包括的骨骼点的数量大于第二成像包括的骨骼点的数量。第一成像的骨骼数据的置信度大于第二成像的骨骼数据的置信度。
由于成像面积越大,则通常能够体现的细节越多,成像包括的骨骼点数量越多或骨骼数据的置信度越高,则能够更好的反映目标对象的活动特征,因此成像面积越大,成像包括的骨骼点数量越多,成像的骨骼数据的置信度越高,可以判定该成像的成像效果越优。
可选地,管理设备获取目标对象在第二视频图像中的第二成像的实现方式,包括:管理设备在获取目标对象在第一视频图像中的第一成像之后,获取目标对象的第一关键点在第一视频图像中的第一成像位置。管理设备基于第一相机与第二相机之间的像素坐标映射关系,根据第一成像位置确定第一关键点在第二视频图像中的第二成像位置。管理设备根据第二成像位置确定目标对象在第二视频图像中的第二成像。
本申请中,通过预先确定相邻两个相机之间的像素坐标映射关系,当目标对象活动至相邻两个相机的共视区域时,管理设备可以根据目标对象在这相邻两个相机采集的视频图像中的成像几何位置的相关性,实现对目标对象的跨相机跟踪识别。本申请方案不依赖于目标对象的唯一性特征,通过对相机的灵活部署和标定,可以适用于各种场景。
可选地,目标场景中部署有M个相机。M个相机中的任意相邻两个相机具有共视区域。M≥N,N个相机属于M个相机。管理设备中存储有多个单应矩阵,每个单应矩阵用于反映M个相机中的一组相邻两个相机之间的像素坐标映射关系。
本申请中,可以通过在目标场景中部署较多的相机以提高对目标对象的跨相机跟踪识别的精确度,并且通过选取其中较少的相机采集的视频图像用于合成视频流以提高合成的视频流的流畅性。即M>N,这样可以同时保证合成的视频流的准确性和流畅性。
可选地,管理设备在获取目标视频图像之后,可以对目标视频图像进行裁剪处理,使目标对象的成像位于经过裁剪处理的视频图像的中心区域。然后管理设备基于多个采集时刻,按照时间先后顺序对多帧分别经过裁剪处理的视频图像进行排列,以得到目标对象对应的视频流。
本申请中,管理设备可以对获取的每帧目标视频图像分别进行裁剪处理,使最终合成的视频流的所有视频图像中,目标对象的成像都在中心区域。这样既能实现对目标对象的跟焦效果,又能使合成的视频流的显示效果较好,视频流的播放画面更加流畅和平滑,从而提高用户观看体验。
可选地,管理设备还可以根据目标对象的第二关键点在目标视频图像中的成像位置,确 定第二关键点在世界坐标系下的水平位置,并根据第二关键点分别在多个采集时刻在世界坐标系下的水平位置,生成目标对象的运动轨迹。
本申请中,管理设备在获取目标对象的骨骼数据之后,还可以基于骨骼数据对目标对象进行运动分析,包括但不限于确定目标对象的运动轨迹、计算目标对象的步数、计算目标对象的位移或计算目标对象的运动速度等。
可选地,管理设备在获取目标视频图像之后,还可以获取目标对象的骨骼点在目标视频图像中的成像位置,并在播放界面上显示视频流的播放画面,该播放画面中的目标对象的成像上显示有目标对象的骨骼点。
本申请中,管理设备在合成目标对象对应的视频流时,可以将目标对象的骨骼点的成像位置与对应的视频图像编码封装在一起,进而可以在显示目标对象对应的视频流的播放画面时,在该播放画面中的目标对象的成像上显示该目标对象的骨骼点,这样有助于分析目标对象的活动情况。
第二方面,提供了一种管理设备。所述管理设备包括多个功能模块,所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。
第三方面,提供了一种管理设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现上述第一方面及其各实施方式中的方法。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现上述第一方面及其各实施方式中的方法。
第五方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现上述第一方面及其各实施方式中的方法。
第六方面,提供了一种芯片,芯片包括可编程逻辑电路和/或程序指令,当芯片运行时,实现上述第一方面及其各实施方式中的方法。
附图说明
图1是本申请实施例提供的一种视频合成系统的结构示意图;
图2是本申请实施例提供的一种相邻两个相机的相对位置示意图;
图3是本申请实施例提供的一种相机分布位置示意图;
图4是本申请实施例提供的一种视频合成方法的流程示意图;
图5是本申请实施例提供的一种人体骨骼点的分布示意图;
图6是本申请实施例提供的一种两个相机之间的像素坐标映射示意图;
图7是本申请实施例提供的一种视频图像的裁剪前后对比示意图;
图8是本申请实施例提供的一种播放界面示意图;
图9是本申请实施例提供的一种目标对象的运动轨迹示意图;
图10是本申请实施例提供的一种管理设备的结构示意图;
图11是本申请实施例提供的一种管理设备的结构示意图;
图12是本申请实施例提供的一种管理设备的结构示意图;
图13是本申请实施例提供的一种管理设备的结构示意图;
图14是本申请实施例提供的一种管理设备的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1是本申请实施例提供的一种视频合成系统的结构示意图。如图1所示,该视频合成系统包括:媒体源101和管理设备102。
媒体源101用于提供多路视频流。参见图1,媒体源101包括多个相机1011。每个相机1011用于采集一路视频流。该多个相机1011采集图像的时刻和频率相同。可选地,可以采用相机同步技术实现多个相机1011的同步拍摄。图1中相机的数量仅用作示例性说明,不作为对本申请实施例提供的视频合成系统的限制。
可选地,多个相机1011中的任意相邻两个相机具有共视区域。其中,两个相机具有共视区域,是指该两个相机的拍摄区域具有重合区域。例如,图2是本申请实施例提供的一种相邻两个相机的相对位置示意图。如图2所示,相机A的拍摄区域为区域a,相机B的拍摄区域为区域b。区域a和区域b具有重合区域c,该重合区域c为相机A和相机B的共视区域。
可选地,多个相机1011可以采用环形排布方式、扇形排布方式、直线排布方式或其它不规则排布方式等,可根据实际部署场景设计相应的相机排布方式。例如,多个相机用于采集运动员在环形的速滑赛道中的运动视频,则可以采用环形排布方式围绕速滑赛道部署多个相机。图3是本申请实施例提供的一种相机分布位置示意图。如图3所示,速滑赛道附近部署有20个相机,分别记为相机1-20。该20个相机采用环形排布方式,且该20个相机的拍摄方向均朝向速滑赛道。可选地,该20个相机的拍摄区域的全集可完整地覆盖整个速滑赛道,也即是,当某个运动员在该速滑赛道中运动时,每个采集时刻该20个相机中始终存在至少一个相机能够采集到包含该运动员的成像的视频图像。
管理设备102用于对来自媒体源101中多个相机1011的多路视频流进行分析处理,以提取多路视频流中包含目标对象的成像的视频图像,进而合成该目标对象对应的视频流。该视频流中的每帧视频图像都包括目标对象的成像,该视频流也可称为目标对象对应的合成视频流。可选地,管理设备102合成的视频流的帧率与相机1011采集图像的频率相同。由于目标对象在多个相机1011部署的场景中运动时,每个采集时刻始终存在至少一个相机能够采集到包含该目标对象的成像的视频图像,因此管理设备102可以从多路视频流中获取每个采集时刻包含该目标对象的成像的一帧视频图像,最终合成帧率与相机采集图像的频率相同的视频流。可选地,管理设备102可以是一台设备或多台设备。例如管理设备102可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。
可选地,管理设备102可以采用目标检测算法在单个相机采集的视频图像中识别出目标对象,并采用目标跟踪算法确定该目标对象在该相机后续采集的视频图像中的成像。当目标 对象从该相机的拍摄区域移动至该相机与该相机的相邻相机的共视区域时,管理设备102可以根据目标对象在这相邻两个相机采集的视频图像中的成像几何位置的相关性,确定目标对象在该相邻相机采集的视频图像中的成像,进而实现对目标对象的跨相机跟踪识别。
本申请实施例中,媒体源101中的多个相机1011都是固定部署的,且每个相机的相机参数都是预先设置好的。在拍摄过程中,每个相机的拍摄区域和拍摄焦点都是固定不变的,因此每个相机的图像坐标系都是固定不变的,进而相邻两个相机的共视区域在该相邻两个相机中的成像的像素坐标是具有固定的映射关系的。管理设备102中可以存储有多个单应矩阵,每个单应矩阵用于反映一组相邻两个相机之间的像素坐标映射关系。这里的单应矩阵可理解为相邻两个相机的图像坐标系之间的变换矩阵。
在完成多个相机的部署和标定之后,管理设备102可以基于相邻两个相机的共视区域中的多个像素点分别在该相邻两个相机的图像坐标系中的像素坐标,生成反映该相邻两个相机之间的像素坐标映射关系的单应矩阵。例如参考图3示出的例子,相机1到相机2的单应矩阵为H12,相机1和相机2中的共视区域内存在标记点M,标记点M在相机1采集到的视频图像中的像素坐标为(x 1m,y 1m),标记点M在相机2采集到的视频图像中的像素坐标为(x 2m,y 2m),则满足:(x 2m,y 2m)=H12*(x 1m,y 1m)。相机2到相机3的单应矩阵为H23,相机2和相机3中的共视区域内存在标记点N,标记点N在相机2采集到的视频图像中的像素坐标为(x 2n,y 2n),标记点N在相机3采集到的视频图像中的像素坐标为(x 3n,y 3n),则满足:(x 3n,y 3n)=H23*(x 2n,y 2n)。其中,图像坐标系是以相机采集到的图像的左上顶点为坐标原点的坐标系。图像坐标系的x轴和y轴分别为采集到的图像的长宽方向。
可选地,管理设备102还可以选定M个相机中的任一相机,生成该相机的图像坐标系到二维世界坐标系的变换矩阵,该变换矩阵即相机的图像坐标系到二维世界坐标系的单应矩阵。例如可以在该相机的拍摄区域中摆放多个标志物,并标识这些标志物在世界坐标系下的水平位置,管理设备根据这些标志物在世界坐标系下的水平位置以及在图像坐标系下的像素坐标,计算得到该变换矩阵。然后,管理设备可以基于该相机的图像坐标系到二维世界坐标系的变换矩阵以及上述分别反映相邻两个相机之间的像素坐标映射关系的多个单应矩阵,计算每个相机的图像坐标系到二维世界坐标系的变换矩阵。例如参考图3示出的例子,相机1到相机2的单应矩阵为H12,相机2到相机3的单应矩阵为H23,已知相机2的图像坐标系到二维世界坐标系的变换矩阵为H2w,则相机1的图像坐标系到二维世界坐标系的变换矩阵H1w满足:H1w=H12*H2w,相机3的图像坐标系到二维世界坐标系的变换矩阵H3w满足:H3w=H32*H2w,H32为H23的逆矩阵。即,已知相机i的图像坐标系到二维世界坐标系的变换矩阵为Hiw,则相机j的图像坐标系到二维世界坐标系的变换矩阵Hjw满足:Hjw=Hji*Hiw,若i>j,则Hji=Hj*Hj+1*…*Hi,若i<j,则Hji=Hj*Hj-1*…*Hi。i和j均为正整数,相机i表示M个相机中的第i个相机,相机j表示M个相机中的第j个相机。其中,世界坐标系能够描述相机在现实世界中的位置,同样还能够描述相机采集到的图像中的物体在现实世界中的位置。世界坐标系的x轴和y轴位于水平面上,z轴垂直于水平面。本申请实施例中的二维世界坐标系指的是x轴和y轴组成的水平坐标系。世界坐标系下的水平位置可以采用二维水平坐标(x,y)表示。
图4是本申请实施例提供的一种视频合成方法的流程示意图。该方法可以应用于如图1 所示的视频合成系统中的管理设备102。如图4所示,该方法包括:
步骤401、管理设备获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像。
其中,N≥2。可选地,目标场景中部署有M个相机,M个相机中的任意相邻两个相机具有共视区域,M≥N。管理设备中存储有多个单应矩阵,每个单应矩阵用于反映M个相机中的一组相邻两个相机之间的像素坐标映射关系。该N个相机属于该M个相机。若M=N,则表示该N个相机包括部署在目标场景中的所有相机。若M>N,则表示该N个相机包括部署在目标场景中的部分相机。在M>N的情况下,N个相机中的相邻两个相机可以具有共视区域,或者也可以不具有共视区域。所选取的N个相机均匀部署在目标场景中,以使该N个相机的拍摄区域的全集尽可能地覆盖整个目标场景。
例如参考图3示出的例子,目标场景为速滑场地,速滑赛道附近部署有20个相机(即M=20),可以选取其中8个相机采集的视频图像用于合成视频流(即N=8),该8个相机例如可以包括相机2、相机4、相机6、相机9、相机12、相机14、相机16和相机19。假设该速滑赛道的长度为400米,则可以每50米选取1个相机。
由于相邻两个相机的共视区域越大,通常计算得到的用于反映相邻两个相机之间的像素坐标映射关系的单应矩阵的准确性越高,因此可以在目标场景中部署较多的相机,通过提高相机部署密度来提高计算得到的单应矩阵的准确性,进而提高对目标对象的跨相机跟踪识别的精确度。而管理设备在合成视频流时,如果选用视频图像时机位切换频率过高会导致视角切换过快,进而导致视频流畅性较差,影响用户观看体验,因此通过选取较少的相机采集的视频图像用于合成视频流,可以提高合成的视频流的流畅性,从而提高用户观看体验。
本申请实施例中,可以通过在目标场景中部署较多的相机以提高对目标对象的跨相机跟踪识别的精确度,并且通过选取其中较少的相机采集的视频图像用于合成视频流以提高合成的视频流的流畅性。这样可以同时提高合成的视频流的准确性和流畅性。
步骤402、管理设备从每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,该目标视频图像包括目标对象的成像。
其中,每个采集时刻对应的N帧视频图像分别来自N个相机。可选地,目标对象在每个采集时刻位于N个相机中至少一个相机的拍摄区域内。
可选地,步骤402的实现过程包括以下步骤4021至步骤4022。
在步骤4021中,管理设备获取每个采集时刻对应的N帧视频图像中包括目标对象的成像的所有待选视频图像。
可选地,当目标对象在某个采集时刻仅位于一个相机的拍摄区域内时,该采集时刻对应的N帧视频图像中包括一帧待选视频图像。当目标对象在某个采集时刻位于两个或两个以上相机的拍摄区域内时,该采集时刻对应的N帧视频图像中包括两帧或两帧以上待选视频图像。
可选地,上述N个相机包括第一相机和第二相机。第一相机和第二相机具有共视区域。当目标对象在第一采集时刻位于第一相机和第二相机的共视区域时,管理设备将第一相机在第一采集时刻采集得到的第一视频图像和第二相机在第一采集时刻采集得到的第二视频图像均作为第一采集时刻对应的待选视频图像。
在步骤4022中,管理设备从所有待选视频图像中获取一帧目标视频图像。
可选地,管理设备获取的某个采集时刻对应的待选视频图像的数量大于1,则管理设备 可以将该采集时刻对应的所有待选视频图像中,目标对象的成像效果最优的待选视频图像作为目标视频图像。或者,管理设备也可以将该采集时刻对应的所有待选视频图像中的任一待选视频图像作为目标视频图像。
可选地,结合参考步骤4021的相关描述,管理设备在获取第一视频图像和第二视频图像之后,可以获取目标对象在第一视频图像中的第一成像和目标对象在第二视频图像中的第二成像。响应于第一成像的成像效果优于第二成像的成像效果,管理设备将第一视频图像作为第一采集时刻对应的目标视频图像。
可选地,第一成像的成像效果优于第二成像的成像效果,满足以下一种或多种条件:第一成像的成像面积大于第二成像的成像面积。第一成像包括的骨骼点的数量大于第二成像包括的骨骼点的数量。第一成像的骨骼数据的置信度大于第二成像的骨骼数据的置信度。其中,第一成像的成像面积指目标对象在第一视频图像中的成像面积,第二成像的成像面积指目标对象在第二视频图像中的成像面积。第一成像和第二成像包括的骨骼点均指成像上直接体现的骨骼点,并不包括推断得到的骨骼点。骨骼数据的置信度指所有骨骼点的整体置信度,该所有骨骼点包括成像上直接体现的骨骼点以及成像上无法体现的骨骼点,成像上无法体现的骨骼点可以由相关算法推断得到相应位置,通过推断得到位置的骨骼点的置信度一般较低。
由于成像面积越大,则通常能够体现的细节越多,成像包括的骨骼点数量越多或骨骼数据的置信度越高,则能够更好的反映目标对象的活动特征,因此成像面积越大,成像包括的骨骼点数量越多,成像的骨骼数据的置信度越高,可以判定该成像的成像效果越优。
可选地,目标对象为人体。人体的骨骼点包括但不限于鼻子、眼睛、耳朵、肩膀、手肘、手腕、髋部、膝盖和脚踝等。例如,图5是本申请实施例提供的一种人体骨骼点的分布示意图。如图5所示,人体可以包括17个骨骼点,分别为鼻子0、左眼1、右眼2、左耳3、右耳4、左肩5、右肩6、左胳膊肘7、右胳膊肘8、左手腕9、右手腕10、左髋11、右髋12、左膝13、右膝14、左脚踝15、右脚踝16。本申请以下实施例以目标对象为人为例进行说明。
本申请实施例中,管理设备可以将同一采集时刻获取的N帧视频图像中包括目标对象的成像且目标对象的成像效果最优的视频图像作为目标视频图像,以用于合成目标对象对应的视频流。可以提高合成的视频流中目标对象的活动画面的清晰度,使合成的视频流较好地反映目标对象的活动特征,有利于分析目标对象的活动特性。
可选地,本申请实施例以目标对象先到达第一相机的拍摄区域,后到达第二相机的拍摄区域为例,对管理设备获取目标对象在第一视频图像中的第一成像和目标对象在第二视频图像中的第二成像的实现过程进行说明,该实现过程包括以下步骤S11至步骤S14。
在步骤S11中,管理设备获取目标对象在第一视频图像中的第一成像。
可选地,第一相机为跟踪识别目标对象的第一个相机,则管理设备可以采用目标检测算法在采集的视频图像中识别出目标对象。管理设备在识别出目标对象之后,还可以为目标对象分配一个全局唯一的标识,并采用该标识来区分目标对象在各个相机的视频图像中的成像。最终可以基于目标对象的标识,通过并查集算法思想统一全局的任务映射关系,实现对目标对象的多相机跟踪识别。或者,第一相机不为跟踪识别目标对象的第一个相机,即目标对象是从其它相机的拍摄区域活动至第一相机的拍摄区域的,则管理设备首次获取目标对象在第一相机采集的视频图像中的成像的实现过程可参考以下所述管理设备获取目标对象在第二相机采集的第二视频图像中的第二成像的实现过程。
在目标对象到达第一相机的拍摄区域,且管理设备获取了目标对象在第一相机采集的视频图像中的成像之后,目标对象在第一相机的拍摄区域内活动的过程中,管理设备可以采用目标跟踪算法确定该目标对象在第一相机后续采集的视频图像中的成像。
本申请实施例中,管理设备在获取目标对象在视频图像中的成像之后,还可以确定目标对象的各个骨骼点的成像位置,并将骨骼点的成像位置与对应的视频图像编码封装在一起,以便后续分析使用。其中,骨骼点的成像位置可以采用像素坐标表示。
在步骤S12中,管理设备获取目标对象的第一关键点在第一视频图像中的第一成像位置。
可选地,目标对象的第一关键点可以基于目标对象的一个或多个骨骼点得到。例如以人体髋部所在高度作为标定高度,可以将人体左右髋部的中点(即人体中心点)作为第一关键点。第一关键点可以泛指一个或多个关键点。
在步骤S13中,管理设备基于第一相机与第二相机之间的像素坐标映射关系,根据第一成像位置确定第一关键点在第二视频图像中的第二成像位置。
可选地,第一相机和第二相机是M个相机中的相邻两个相机。管理设备可以基于第一相机到第二相机的单应矩阵,根据第一成像位置确定第一关键点在第二视频图像中的第二成像位置。例如参考图3示出的例子,第一相机为相机1,第二相机为相机2,相机1到相机2的单应矩阵为H12,第一成像位置为(x 1p,y 1p),则第二成像位置(x 2p,y 2p)满足:(x 2p,y 2p)=H12*(x 1p,y 1p)。
或者,第一相机和第二相机不是M个相机中的相邻两个相机。假设第一相机和第二相机之间还存在第三相机,管理设备可以基于第一相机到第三相机的单应矩阵以及第三相机到第二相机的单应矩阵,根据第一成像位置确定第一关键点在第二视频图像中的第二成像位置。例如参考图3示出的例子,第一相机为相机1,第二相机为相机3,相机1到相机2的单应矩阵为H12,相机2到相机3的单应矩阵为H23,第一成像位置为(x’ 1p,y’ 1p),则第二成像位置(x’ 2p,y’ 2p)满足:(x’ 2p,y’ 2p)=H12*H23*(x’ 1p,y’ 1p)。
在步骤S14中,管理设备根据第二成像位置确定目标对象在第二视频图像中的第二成像。
可选地,管理设备可以根据第二成像位置确定人体检测框,并将第二视频图像中位于该人体检测框内的人体成像作为第二成像。或者,管理设备可以检测出第二视频图像中的所有人体成像,并将这些人体成像中第一关键点的成像位置与第二成像位置距离最近的人体成像作为第二成像。如果管理设备检测出第二视频图像中只有一个人体成像,那么可以直接将该人体成像作为目标对象在第二视频图像中的第二成像,无需再执行上述步骤S12至步骤S14。
例如,图6是本申请实施例提供的一种两个相机之间的像素坐标映射示意图。如图6所示,左图为第一视频图像,右图为第二视频图像。第一视频图像包括目标对象的第一成像。第一关键点p在第一视频图像中的第一成像位置为p1,基于第一相机与第二相机之间的像素坐标映射关系得到的第一关键点p在第二视频图像中的第二成像位置为p2。参见图6,基于第二成像位置p2可以确定目标对象在第二视频图像中的第二成像。
可选地,目标场景为运动员的训练场地或比赛场地。目标场景中通常有多个运动员在活动,因此单个相机采集的视频图像中可能包括多个人体成像。而受限于场地规模和拍摄角度,很多情况下相机是无法采集到人脸的,因此很难通过人脸识别的方式实现对单个目标的跟踪识别。本申请实施例中,通过预先确定相邻两个相机之间的像素坐标映射关系,当目标对象活动至相邻两个相机的共视区域时,管理设备可以根据目标对象在这相邻两个相机采集的视 频图像中的成像几何位置的相关性,实现对目标对象的跨相机跟踪识别。本申请方案不依赖于目标对象的唯一性特征,通过对相机的灵活部署和标定,可以适用于各种场景。
步骤403、管理设备根据多个采集时刻对应的多帧目标视频图像,合成目标对象对应的视频流。
目标对象的视频流用于反映该目标对象在目标场景中的活动信息。目标对象对应的视频流中的视频图像是按照时间先后顺序排列的。
可选地,管理设备在获取目标视频图像之后,可以对目标视频图像进行裁剪处理,使目标对象的成像位于经过裁剪处理的视频图像的中心区域。相应地,步骤403的实现过程包括:管理设备基于多个采集时刻,按照时间先后顺序对多帧分别经过裁剪处理的视频图像进行排列,以得到目标对象对应的视频流。可选地,可以预先设置裁剪窗口的大小,管理设备将目标对象的人体中心点的成像作为裁剪窗口的中心,对原始视频图像进行裁剪处理。例如,图7是本申请实施例提供的一种视频图像的裁剪前后对比示意图。如图7所示,裁剪后的视频图像中目标对象的成像位于视频图像的中心区域,相较于裁剪前的视频图像,裁剪后的视频图像可以更好地突出目标对象的成像。管理设备可以对获取的每帧目标视频图像分别进行裁剪处理,使最终合成的视频流的所有视频图像中,目标对象的成像都在中心区域。既能实现对目标对象的跟焦效果,又能使合成的视频流的显示效果较好,视频流的播放画面更加流畅和平滑,从而提高用户观看体验。另外,管理设备还可以对经过裁剪处理的视频图像进行平滑滤波处理等。
本申请实施例中,通过在目标场景中固定部署多个相机,这些相机的拍摄区域不同,这些相机分别能够针对目标场景中的不同区域拍摄到清晰的视频图像。管理设备从多个相机在同一采集时刻分别采集得到的多帧视频图像中,选取一帧包含目标对象的成像的视频图像进行视频合成,由于这些相机分别能拍摄到目标场景中对应区域的清晰的视频图像,因此目标对象在不同相机的拍摄区域内活动时,总有相机能够采集到目标对象清晰的活动画面,进而使得合成的视频流能够提供目标对象在目标场景中全程活动的清晰画面,即保证了合成的视频流中目标对象的活动画面的清晰度。另外,由于相机是固定部署的,可以根据其所需的拍摄区域预先设置相机参数,在拍摄过程中无需调整相机参数,实现方式简单。
可选地,管理设备合成目标对象的视频流之后,还可以执行以下步骤404。
步骤404、管理设备输出目标对象对应的视频流。
可选地,管理设备具有显示功能。则管理设备输出目标对象对应的视频流,可以是管理设备在播放界面上显示目标对象对应的视频流的播放画面。可选地,管理设备还可以获取目标对象的骨骼点在目标视频图像中的成像位置,则管理设备在播放界面上显示目标对象对应的视频流的播放画面时,该播放画面中的目标对象的成像上可以显示有该目标对象的骨骼点。
可选地,管理设备输出目标对象对应的视频流,还可以是管理设备向终端发送该目标对象对应的视频流,以供终端在播放界面上显示目标对象对应的视频流的播放画面。例如,响应于接收到来自终端的播放请求,管理设备向终端发送目标对象对应的视频流。该播放请求中可以携带目标对象的标识。
例如,图8是本申请实施例提供的一种播放界面示意图。如图8所示,该播放界面Z上显示有目标对象对应的视频流的播放画面。其中,目标对象的成像上显示有多个骨骼点(图中仅示出9个骨骼点用作示意图说明)。
本申请实施例中,管理设备在合成目标对象对应的视频流时,可以将目标对象的骨骼点的成像位置与对应的视频图像编码封装在一起,进而可以在显示目标对象对应的视频流的播放画面时,在该播放画面中的目标对象的成像上显示该目标对象的骨骼点,这样有助于分析目标对象的活动情况。
可选地,管理设备在获取目标对象的骨骼数据之后,还可以基于骨骼数据对目标对象进行运动分析,包括但不限于确定目标对象的运动轨迹、计算目标对象的步数、计算目标对象的位移或计算目标对象的运动速度等。管理设备或终端在显示目标对象对应的视频流的播放画面时,还可以在播放画面上叠加显示对目标对象的实时运动分析结果,例如在播放画面上叠加显示目标对象的实时运动轨迹、实时步数、实时位移和实时速度等,以进一步助于对目标对象的运动分析。或者,管理设备在获取目标对象的骨骼数据之后,还可以将目标对象的骨骼数据发送给分析设备,由分析设备对目标对象进行运动分析。也即是,视频流的合成以及对目标对象的运动分析可以由一台设备完成,或者也可以多台设备分工完成,本申请实施例对此不做限定。
可选地,管理设备确定目标对象的运动轨迹的实现过程包括:管理设备根据目标对象的第二关键点在目标视频图像中的成像位置,确定第二关键点在世界坐标系下的水平位置。管理设备根据第二关键点分别在多个采集时刻在世界坐标系下的水平位置,生成目标对象的运动轨迹。
可选地,目标对象的第二关键点可以基于目标对象的一个或多个骨骼点得到。例如以地面作为标定高度,将人体左右脚踝的中点作为第二关键点。第二关键点可以泛指一个或多个关键点。管理设备可以基于采集得到目标视频图像的相机的图像坐标系到二维世界坐标系的变换矩阵,根据目标对象的第二关键点在该目标视频图像中的成像位置,确定第二关键点在世界坐标系下的水平位置。
例如,图9是本申请实施例提供的一种目标对象的运动轨迹示意图。如图9所示,目标对象在速滑赛道上运动,目标对象的第二关键点在采集时刻t1的二维水平坐标为(x t1,y t1),在采集时刻t2的二维水平坐标为(x t2,y t2),在采集时刻t3的二维水平坐标为(x t3,y t3),在采集时刻t4的二维水平坐标为(x t4,y t4),在采集时刻t5的二维水平坐标为(x t5,y t5),最终得到一条位于水平面上的运动轨迹。其中,二维水平坐标反映世界坐标系下的水平位置。
可选地,管理设备在计算目标对象的步数时,可以基于合成的目标对象的视频流,将左右脚踝每交叉一次判定成一步,以此实现步数的计算。
本申请实施例提供的视频合成方法的步骤的先后顺序能够进行适当调整,步骤也能够根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内。例如,本申请方案除了可以应用于运动员的训练场景或比赛场景中合成运动员的全程运动视频以外,还可以应用于应急逃生指挥场景中,通过合成每个人员的实时视频,有助于针对个体实际情况制定逃生路线,以提高逃生几率。还可以应用于旅游景点,用于合成游客在景点的全程旅游视频。上述目标对象除了是人以外,也可以是动物,本申请方案还可以应用于动物保护场景,等等。本申请实施例对上述方法的应用场景不做限定,在此不再一一赘述。另外,目标场景中部署的相机也可以由远程操控无人机等方式实现。
综上所述,在本申请实施例提供的视频合成方法中,通过在目标场景中固定部署多个相 机,这些相机的拍摄区域不同,这些相机分别能够针对目标场景中的不同区域拍摄到清晰的视频图像。管理设备从多个相机在同一采集时刻分别采集得到的多帧视频图像中,选取一帧包含目标对象的成像的视频图像进行视频合成,由于这些相机分别能拍摄到目标场景中对应区域的清晰的视频图像,因此目标对象在不同相机的拍摄区域内活动时,总有相机能够采集到目标对象清晰的活动画面,进而使得合成的视频流能够提供目标对象在目标场景中全程活动的清晰画面,即保证了合成的视频流中目标对象的活动画面的清晰度。另外,由于相机是固定部署的,可以根据其所需的拍摄区域预先设置相机参数,在拍摄过程中无需调整相机参数,实现方式简单。管理设备通过预先确定部署的相邻两个相机之间的像素坐标映射关系,当目标对象活动至相邻两个相机的共视区域时,管理设备可以根据目标对象在这相邻两个相机采集的视频图像中的成像几何位置的相关性,实现对目标对象的跨相机跟踪识别。本申请方案不依赖于目标对象的唯一性特征,通过对相机的灵活部署和标定,可以适用于各种场景。
图10是本申请实施例提供的一种管理设备的结构示意图。该管理设备可以是如图1所示的视频合成系统中的管理设备102。如图10所示,管理设备1000包括:
第一获取模块1001,用于获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像,N≥2。
第二获取模块1002,用于从每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,目标视频图像包括目标对象的成像。
视频合成模块1003,用于根据多个采集时刻对应的多帧目标视频图像,合成目标对象对应的视频流,视频流用于反映目标对象在目标场景中的活动信息。
可选地,第二获取模块1002,用于:获取每个采集时刻对应的N帧视频图像中包括目标对象的成像的所有待选视频图像。从所有待选视频图像中获取目标视频图像。
可选地,N个相机包括第一相机和第二相机,第一相机和第二相机具有共视区域。第二获取模块1002,用于:当目标对象在第一采集时刻位于第一相机和第二相机的共视区域时,将第一相机在第一采集时刻采集得到的第一视频图像和第二相机在第一采集时刻采集得到的第二视频图像均作为第一采集时刻对应的待选视频图像。
可选地,第二获取模块1002,用于:获取目标对象在第一视频图像中的第一成像和目标对象在第二视频图像中的第二成像。响应于第一成像的成像效果优于第二成像的成像效果,将第一视频图像作为第一采集时刻对应的目标视频图像。
可选地,第一成像的成像效果优于第二成像的成像效果,满足以下一种或多种条件:第一成像的成像面积大于第二成像的成像面积。第一成像包括的骨骼点的数量大于第二成像包括的骨骼点的数量。第一成像的骨骼数据的置信度大于第二成像的骨骼数据的置信度。
可选地,第二获取模块1002,用于:在获取目标对象在第一视频图像中的第一成像之后,获取目标对象的第一关键点在第一视频图像中的第一成像位置。基于第一相机与第二相机之间的像素坐标映射关系,根据第一成像位置确定第一关键点在第二视频图像中的第二成像位置。根据第二成像位置确定目标对象在第二视频图像中的第二成像。
可选地,目标场景中部署有M个相机,M个相机中的任意相邻两个相机具有共视区域,M≥N,N个相机属于M个相机,管理设备中存储有多个单应矩阵,每个单应矩阵用于反映M个相机中的一组相邻两个相机之间的像素坐标映射关系。
可选地,如图11所示,管理设备1000还包括:图像处理模块1004,用于对目标视频图像进行裁剪处理,使目标对象的成像位于经过裁剪处理的视频图像的中心区域。视频合成模块1003,用于基于多个采集时刻,按照时间先后顺序对多帧分别经过裁剪处理的视频图像进行排列,以得到视频流。
可选地,如图12所示,管理设备1000还包括:确定模块1005,用于根据目标对象的第二关键点在目标视频图像中的成像位置,确定第二关键点在世界坐标系下的水平位置。轨迹生成模块1006,用于根据第二关键点分别在多个采集时刻在世界坐标系下的水平位置,生成目标对象的运动轨迹。
可选地,如图13所示,管理设备1000还包括:第三获取模块1007,用于获取目标对象的骨骼点在目标视频图像中的成像位置。显示模块1008,用于在播放界面上显示视频流的播放画面,播放画面中的目标对象的成像上显示有目标对象的骨骼点。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请实施例还提供了一种视频合成系统,包括:管理设备和多个相机。相机用于采集视频图像,管理设备用于执行如图4所示的方法步骤。
图14是本申请实施例提供的一种管理设备的框图。如图14所示,管理设备1400包括:处理器1401和存储器1402。
存储器1402,用于存储计算机程序,所述计算机程序包括程序指令;
处理器1401,用于调用所述计算机程序,实现如上述方法实施例中图4示出的方法步骤。
可选地,该管理设备1400还包括通信总线1403和通信接口1404。
其中,处理器1401包括一个或者一个以上处理核心,处理器1401通过运行计算机程序,执行各种功能应用以及数据处理。
存储器1402可用于存储计算机程序。可选地,存储器可存储操作系统和至少一个功能所需的应用程序单元。操作系统可以是实时操作系统(Real Time eXecutive,RTX)、LINUX、UNIX、WINDOWS或OS X之类的操作系统。
通信接口1404可以为多个,通信接口1404用于与其它设备进行通信。例如在本申请实施例中,管理设备1400的通信接口可以用于终端发送视频流。
存储器1402与通信接口1404分别通过通信总线1403与处理器1401连接。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现如图4所示的方法步骤。
本申请实施例还提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如图4所示的方法步骤。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中, 上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (24)

  1. 一种视频合成方法,其特征在于,应用于管理设备,所述方法包括:
    获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像,N≥2;
    从所述每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,所述目标视频图像包括目标对象的成像;
    根据所述多个采集时刻对应的多帧所述目标视频图像,合成所述目标对象对应的视频流,所述视频流用于反映所述目标对象在所述目标场景中的活动信息。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,包括:
    获取所述每个采集时刻对应的N帧视频图像中包括所述目标对象的成像的所有待选视频图像;
    从所述所有待选视频图像中获取所述目标视频图像。
  3. 根据权利要求2所述的方法,其特征在于,所述N个相机包括第一相机和第二相机,所述第一相机和所述第二相机具有共视区域,所述获取所述每个采集时刻对应的N帧视频图像中包括所述目标对象的成像的所有待选视频图像,包括:
    当所述目标对象在第一采集时刻位于所述第一相机和所述第二相机的共视区域时,将所述第一相机在所述第一采集时刻采集得到的第一视频图像和所述第二相机在所述第一采集时刻采集得到的第二视频图像均作为所述第一采集时刻对应的待选视频图像。
  4. 根据权利要求3所述的方法,其特征在于,所述从所述所有待选视频图像中获取所述目标视频图像,包括:
    获取所述目标对象在所述第一视频图像中的第一成像和所述目标对象在所述第二视频图像中的第二成像;
    响应于所述第一成像的成像效果优于所述第二成像的成像效果,将所述第一视频图像作为所述第一采集时刻对应的目标视频图像。
  5. 根据权利要求4所述的方法,其特征在于,所述第一成像的成像效果优于所述第二成像的成像效果,满足以下一种或多种条件:
    所述第一成像的成像面积大于所述第二成像的成像面积;
    所述第一成像包括的骨骼点的数量大于所述第二成像包括的骨骼点的数量;
    所述第一成像的骨骼数据的置信度大于所述第二成像的骨骼数据的置信度。
  6. 根据权利要求4或5所述的方法,其特征在于,获取所述目标对象在所述第二视频图像中的第二成像,包括:
    在获取所述目标对象在所述第一视频图像中的第一成像之后,获取所述目标对象的第一关键点在所述第一视频图像中的第一成像位置;
    基于所述第一相机与所述第二相机之间的像素坐标映射关系,根据所述第一成像位置确定所述第一关键点在所述第二视频图像中的第二成像位置;
    根据所述第二成像位置确定所述目标对象在所述第二视频图像中的所述第二成像。
  7. 根据权利要求6所述的方法,其特征在于,所述目标场景中部署有M个相机,所述M个相机中的任意相邻两个相机具有共视区域,M≥N,所述N个相机属于所述M个相机,所述管理设备中存储有多个单应矩阵,每个所述单应矩阵用于反映所述M个相机中的一组相邻两个相机之间的像素坐标映射关系。
  8. 根据权利要求1至7任一所述的方法,其特征在于,在获取所述目标视频图像之后,所述方法还包括:
    对所述目标视频图像进行裁剪处理,使所述目标对象的成像位于经过裁剪处理的视频图像的中心区域;
    所述根据所述多个采集时刻对应的多帧所述目标视频图像,合成所述目标对象对应的视频流,包括:
    基于所述多个采集时刻,按照时间先后顺序对多帧分别经过裁剪处理的视频图像进行排列,以得到所述视频流。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述方法还包括:
    根据所述目标对象的第二关键点在所述目标视频图像中的成像位置,确定所述第二关键点在世界坐标系下的水平位置;
    根据所述第二关键点分别在所述多个采集时刻在所述世界坐标系下的水平位置,生成所述目标对象的运动轨迹。
  10. 根据权利要求1至9任一所述的方法,其特征在于,在获取所述目标视频图像之后,所述方法还包括:
    获取所述目标对象的骨骼点在所述目标视频图像中的成像位置;
    在播放界面上显示所述视频流的播放画面,所述播放画面中的所述目标对象的成像上显示有所述目标对象的骨骼点。
  11. 一种管理设备,其特征在于,所述管理设备包括:
    第一获取模块,用于获取部署在目标场景中的N个相机在多个采集时刻中的每个采集时刻分别采集得到的N帧视频图像,N≥2;
    第二获取模块,用于从所述每个采集时刻对应的N帧视频图像中获取一帧目标视频图像,所述目标视频图像包括目标对象的成像;
    视频合成模块,用于根据所述多个采集时刻对应的多帧所述目标视频图像,合成所述目标对象对应的视频流,所述视频流用于反映所述目标对象在所述目标场景中的活动信息。
  12. 根据权利要求11所述的管理设备,其特征在于,所述第二获取模块,用于:
    获取所述每个采集时刻对应的N帧视频图像中包括所述目标对象的成像的所有待选视频图像;
    从所述所有待选视频图像中获取所述目标视频图像。
  13. 根据权利要求12所述的管理设备,其特征在于,所述N个相机包括第一相机和第二相机,所述第一相机和所述第二相机具有共视区域,所述第二获取模块,用于:
    当所述目标对象在第一采集时刻位于所述第一相机和所述第二相机的共视区域时,将所述第一相机在所述第一采集时刻采集得到的第一视频图像和所述第二相机在所述第一采集时刻采集得到的第二视频图像均作为所述第一采集时刻对应的待选视频图像。
  14. 根据权利要求13所述的管理设备,其特征在于,所述第二获取模块,用于:
    获取所述目标对象在所述第一视频图像中的第一成像和所述目标对象在所述第二视频图像中的第二成像;
    响应于所述第一成像的成像效果优于所述第二成像的成像效果,将所述第一视频图像作为所述第一采集时刻对应的目标视频图像。
  15. 根据权利要求14所述的管理设备,其特征在于,所述第一成像的成像效果优于所述第二成像的成像效果,满足以下一种或多种条件:
    所述第一成像的成像面积大于所述第二成像的成像面积;
    所述第一成像包括的骨骼点的数量大于所述第二成像包括的骨骼点的数量;
    所述第一成像的骨骼数据的置信度大于所述第二成像的骨骼数据的置信度。
  16. 根据权利要求14或15所述的管理设备,其特征在于,所述第二获取模块,用于:
    在获取所述目标对象在所述第一视频图像中的第一成像之后,获取所述目标对象的第一关键点在所述第一视频图像中的第一成像位置;
    基于所述第一相机与所述第二相机之间的像素坐标映射关系,根据所述第一成像位置确定所述第一关键点在所述第二视频图像中的第二成像位置;
    根据所述第二成像位置确定所述目标对象在所述第二视频图像中的所述第二成像。
  17. 根据权利要求16所述的管理设备,其特征在于,所述目标场景中部署有M个相机,所述M个相机中的任意相邻两个相机具有共视区域,M≥N,所述N个相机属于所述M个相机,所述管理设备中存储有多个单应矩阵,每个所述单应矩阵用于反映所述M个相机中的一组相邻两个相机之间的像素坐标映射关系。
  18. 根据权利要求11至17任一所述的管理设备,其特征在于,所述管理设备还包括:
    图像处理模块,用于对所述目标视频图像进行裁剪处理,使所述目标对象的成像位于经过裁剪处理的视频图像的中心区域;
    所述视频合成模块,用于基于所述多个采集时刻,按照时间先后顺序对多帧分别经过裁剪处理的视频图像进行排列,以得到所述视频流。
  19. 根据权利要求11至18任一所述的管理设备,其特征在于,所述管理设备还包括:
    确定模块,用于根据所述目标对象的第二关键点在所述目标视频图像中的成像位置,确定所述第二关键点在世界坐标系下的水平位置;
    轨迹生成模块,用于根据所述第二关键点分别在所述多个采集时刻在所述世界坐标系下的水平位置,生成所述目标对象的运动轨迹。
  20. 根据权利要求11至19任一所述的管理设备,其特征在于,所述管理设备还包括:
    第三获取模块,用于获取所述目标对象的骨骼点在所述目标视频图像中的成像位置;
    显示模块,用于在播放界面上显示所述视频流的播放画面,所述播放画面中的所述目标对象的成像上显示有所述目标对象的骨骼点。
  21. 一种视频合成系统,其特征在于,包括:管理设备和多个相机,所述相机用于采集视频图像,所述管理设备用于执行如权利要求1至10任一所述的视频合成方法。
  22. 一种管理设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
    所述处理器,用于调用所述计算机程序,实现如权利要求1至10任一所述的视频合成方法。
  23. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,当所述指令被处理器执行时,实现如权利要求1至10任一所述的视频合成方法。
  24. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时,实现如权利要求1至10任一所述的视频合成方法。
PCT/CN2023/071293 2022-01-10 2023-01-09 视频合成方法、装置及系统 WO2023131327A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210022166.8A CN116456039A (zh) 2022-01-10 2022-01-10 视频合成方法、装置及系统
CN202210022166.8 2022-01-10

Publications (1)

Publication Number Publication Date
WO2023131327A1 true WO2023131327A1 (zh) 2023-07-13

Family

ID=87073279

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071293 WO2023131327A1 (zh) 2022-01-10 2023-01-09 视频合成方法、装置及系统

Country Status (2)

Country Link
CN (1) CN116456039A (zh)
WO (1) WO2023131327A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007133660A (ja) * 2005-11-10 2007-05-31 Nippon Hoso Kyokai <Nhk> 多視点映像合成装置及び多視点映像合成システム
JP2011077820A (ja) * 2009-09-30 2011-04-14 Casio Computer Co Ltd 画像合成装置、画像合成方法及びプログラム
JP2017011598A (ja) * 2015-06-25 2017-01-12 株式会社日立国際電気 監視システム
CN113115110A (zh) * 2021-05-20 2021-07-13 广州博冠信息科技有限公司 视频合成方法、装置、存储介质与电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007133660A (ja) * 2005-11-10 2007-05-31 Nippon Hoso Kyokai <Nhk> 多視点映像合成装置及び多視点映像合成システム
JP2011077820A (ja) * 2009-09-30 2011-04-14 Casio Computer Co Ltd 画像合成装置、画像合成方法及びプログラム
JP2017011598A (ja) * 2015-06-25 2017-01-12 株式会社日立国際電気 監視システム
CN113115110A (zh) * 2021-05-20 2021-07-13 广州博冠信息科技有限公司 视频合成方法、装置、存储介质与电子设备

Also Published As

Publication number Publication date
CN116456039A (zh) 2023-07-18

Similar Documents

Publication Publication Date Title
US5850352A (en) Immersive video, including video hypermosaicing to generate from multiple video views of a scene a three-dimensional video mosaic from which diverse virtual video scene images are synthesized, including panoramic, scene interactive and stereoscopic images
JP6621063B2 (ja) カメラ選択方法及び映像配信システム
JP7132730B2 (ja) 情報処理装置および情報処理方法
WO2019111817A1 (ja) 生成装置、生成方法及びプログラム
US20080192116A1 (en) Real-Time Objects Tracking and Motion Capture in Sports Events
CN110544301A (zh) 一种三维人体动作重建系统、方法和动作训练系统
KR20010074508A (ko) 스포츠 경기의 가상시야를 생성하기 위한 방법 및 장치
US20100271368A1 (en) Systems and methods for applying a 3d scan of a physical target object to a virtual environment
WO2010011317A1 (en) View point representation for 3-d scenes
JP2019106144A (ja) 仮想視点画像を生成するシステム、方法及びプログラム
KR20190136042A (ko) 3차원 모델의 생성 장치, 생성 방법, 및 프로그램
JP2020086983A (ja) 画像処理装置、画像処理方法、及びプログラム
JP4881178B2 (ja) 走行距離映像生成装置及び走行距離映像生成プログラム
JP2020042407A (ja) 情報処理装置、情報処理方法及びプログラム
WO2023131327A1 (zh) 视频合成方法、装置及系统
CN111970434A (zh) 多摄像机多目标的运动员跟踪拍摄视频生成系统及方法
JP2023100805A (ja) 撮像装置、撮像方法及び撮像プログラム
CN116523962A (zh) 针对目标对象的视觉跟踪方法、装置、系统、设备和介质
CN107683604A (zh) 生成装置
JP2020067815A (ja) 画像処理装置、画像処理方法およびプログラム
KR20200047267A (ko) 인공 신경망에 기반한 영상 처리 방법 및 장치
US20230328355A1 (en) Information processing apparatus, information processing method, and program
KR102343267B1 (ko) 다중 위치에서 촬영된 비디오를 이용한 360도 비디오 서비스 제공 장치 및 방법
JP2009519539A (ja) イベントデータを作成し、これをサービス提供可能な状態にするための方法及びシステム
JP6450305B2 (ja) 情報取得装置、情報取得方法及び情報取得プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23737189

Country of ref document: EP

Kind code of ref document: A1