WO2024124437A1 - 视频数据的处理方法、装置、显示设备以及存储介质 - Google Patents

视频数据的处理方法、装置、显示设备以及存储介质 Download PDF

Info

Publication number
WO2024124437A1
WO2024124437A1 PCT/CN2022/139009 CN2022139009W WO2024124437A1 WO 2024124437 A1 WO2024124437 A1 WO 2024124437A1 CN 2022139009 W CN2022139009 W CN 2022139009W WO 2024124437 A1 WO2024124437 A1 WO 2024124437A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
sound source
stream
image
source object
Prior art date
Application number
PCT/CN2022/139009
Other languages
English (en)
French (fr)
Inventor
姜庆兴
高伟标
Original Assignee
惠州视维新技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 惠州视维新技术有限公司 filed Critical 惠州视维新技术有限公司
Priority to PCT/CN2022/139009 priority Critical patent/WO2024124437A1/zh
Publication of WO2024124437A1 publication Critical patent/WO2024124437A1/zh

Links

Images

Definitions

  • the present application relates to the technical field of audio and video data processing, and specifically to a video data processing method, apparatus, display device, and non-volatile computer-readable storage medium (storage medium for short).
  • the display screens of display devices such as televisions are getting bigger and bigger, but the sound output position of the display devices is still set at the bottom or both sides of the TV.
  • the audio in most existing video data is two-channel audio.
  • the display screen plays video data, the audio sound field in the vertical direction is missing, the spatial sense of the audio is weak, and it is difficult to match the video picture. Users have a low sense of immersion when using the display device.
  • the embodiments of the present application provide a method, apparatus, display device, and storage medium for processing video data, so as to improve the spatial sense of audio to match the video picture.
  • the present application provides a method for processing video data, which is applied to a display device.
  • the method includes:
  • the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid are obtained.
  • a stereo video is constructed based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • the present application provides a video data processing device, which is applied to a display device, and the device includes:
  • a curved surface grid construction module is used to construct a curved surface grid that matches the display screen of the display device and obtain the coordinate conversion relationship between the display screen and the curved surface grid;
  • a motion trajectory acquisition module is used to acquire the image stream and the audio stream in the video data, and identify the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream;
  • a spatial trajectory acquisition module used to acquire the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid according to the motion trajectory coordinates of the sound source object and the coordinate conversion relationship;
  • the stereo video construction module is used to construct a stereo video based on the image stream, each audio element in the audio stream and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • the embodiment of the present application further provides a display device, which includes: one or more processors; a memory; and one or more computer-readable instructions, wherein the one or more computer-readable instructions are stored in the memory and configured to be executed by the processor to implement the following steps:
  • the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid are obtained.
  • a stereo video is constructed based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • the embodiment of the present application also provides one or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid are obtained.
  • a stereo video is constructed based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • the beneficial effects of the present application are as follows: by constructing a curved grid that matches the display screen of a display device, after obtaining the running trajectory coordinates of the sound source object corresponding to each audio element in the audio stream in the image stream, the spatial trajectory coordinates of each audio element in the corresponding curved grid are determined based on the motion trajectory coordinates, and finally a stereo video containing spatial audio is constructed based on the spatial trajectory coordinates of each audio element in the corresponding curved grid.
  • the audio sound field information of the audio stream in the vertical direction is supplemented by the spatial trajectory coordinates of the curved grid, so that the spatial sense of the audio in the stereo video matches the video picture, thereby enhancing the user's sense of immersion when watching the stereo video.
  • FIG. 1 is a diagram showing an application scenario of a method for processing video data according to one or more embodiments.
  • FIG. 2 is a flowchart of a method for processing video data according to one or more embodiments.
  • FIG. 3 is a schematic diagram of a display screen and a curved grid according to one or more embodiments.
  • FIG. 4A is a flowchart of steps for acquiring coordinates of a motion trajectory of an audio element corresponding to a sound source object according to one or more embodiments.
  • FIG. 4B is another schematic diagram of the step of acquiring the coordinates of the motion trajectory of the audio element corresponding to the sound source object according to one or more embodiments.
  • FIG. 5A is another schematic diagram of motion trajectory coordinates of an audio element corresponding to a sound source object according to one or more embodiments.
  • FIG. 5B is another schematic diagram of the motion trajectory coordinates of an audio element corresponding to a sound source object according to one or more embodiments.
  • FIG. 6 is a schematic diagram of the structure of a video data processing apparatus according to one or more embodiments.
  • FIG. 7 is a schematic diagram of the structure of a computer device according to one or more embodiments.
  • first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as “first” or “second” may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of “plurality” is two or more, unless otherwise clearly and specifically defined.
  • the video data processing method provided in the present application can be applied in the application environment shown in FIG1.
  • the terminal 110 communicates with the server 120 through the network to receive the video data sent by the server 120.
  • the terminal 110 constructs a curved grid matching the display screen, obtains the coordinate conversion relationship between the display screen and the curved grid, obtains the image stream and the audio stream in the video data, identifies the motion trajectory coordinates of the sound source object corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream, obtains the spatial trajectory coordinates of the sound source object corresponding to each audio element on the curved grid according to the motion trajectory coordinates and the coordinate conversion relationship of the sound source object, and finally, constructs a stereo video based on the spatial trajectory coordinates of each audio element in the image stream and the audio stream and the sound source object corresponding to each audio element.
  • the terminal 110 is a computer device with a display screen, which can be but not limited to various personal computers, laptops, smart phones, tablet computers and portable wearable devices, and
  • an embodiment of the present application provides a method for processing video data, which is mainly illustrated by applying the method to the terminal 110 shown in FIG. 1 .
  • the method includes steps S210 to S240, which are specifically as follows:
  • Step S210 constructing a curved surface grid that matches the display screen of the display device, and obtaining a coordinate conversion relationship between the display screen and the curved surface grid.
  • the curved grid is a virtual grid constructed based on the plane where the display screen of the display device is located, which is used to simulate the spatial position of the audio element corresponding to the sound source object to complete the sound field information of the audio element in the vertical direction.
  • the coordinate conversion relationship between the display screen and the curved grid refers to the conversion relationship between the two-dimensional coordinates corresponding to the display screen and the three-dimensional coordinates corresponding to the curved grid. It can be understood that compared with the two-dimensional coordinates corresponding to the display screen, the three-dimensional coordinates corresponding to the curved grid add coordinate information in the vertical direction.
  • a sphere can be constructed based on the display screen of the display device, and then the spherical surface corresponding to the hemisphere where the display screen is located is used as the curved grid, and the coordinate conversion relationship between the display screen and the curved grid is obtained based on the spherical coordinate expression.
  • the size of the display screen of the television is generally 16:9, with a height of 9 units and a width of 16 units.
  • the distance between the display screen and the audience is set to 70 units.
  • a spherical surface with the audience's position information as the center of the sphere and passing through the vertex corresponding to the display screen can be constructed, and the spherical surface corresponding to the hemisphere where the display screen is located is used as a curved grid; see Figure 3, in which the two-dimensional plane 310 is the display screen, and the three-dimensional plane 320 is the curved grid matching the display screen.
  • Step S220 obtaining an image stream and an audio stream in the video data, and identifying the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream.
  • the video data refers to the video content received by the display device in real time
  • the image stream refers to the picture data in the video data
  • the audio stream refers to the audio data in the video data.
  • the display device can separate the image data and the audio data in the video data to obtain the image stream and the audio stream, so as to facilitate the subsequent processing of the image stream and the audio stream respectively.
  • An audio stream often includes one or more audio elements corresponding to different audio elements
  • a sound source object refers to the sound-emitting object of an audio element in each frame image in an image stream.
  • the audio stream when an audio stream includes a person's voice, the audio stream includes a person's voice element; when an audio stream includes a car engine sound, the audio stream includes a car sound element.
  • the display device obtains the audio stream, it can use an audio element separation technology to separate the audio elements in the audio stream to obtain multiple independent audio elements in the audio stream; it is understandable that the audio element separation technology includes but is not limited to a human voice separation technology, a musical instrument sound separation technology, and the like.
  • the sound source object of the audio element refers to the sound object or sound point corresponding to the audio element in the frame image of the image stream;
  • the motion trajectory coordinates refer to the moving trajectory of the sound source object corresponding to the audio element in the image stream.
  • the sound source object corresponding to the audio element is in the lower left corner of the first frame image in the image stream and in the upper right corner of the third frame image in the image stream.
  • the motion trajectory coordinates of the sound source object corresponding to the audio element in the image stream from the first frame image to the third frame image are moving from the coordinate point in the lower left corner to the coordinate point in the upper right corner; more specifically, the motion trajectory coordinates may include the image coordinates of the sound source object in each frame image in the image stream.
  • the image coordinates of the sound source object corresponding to the audio element in each frame image in the image stream can be identified, and then the motion trajectory coordinates of the sound source object corresponding to the audio element in each frame image can be obtained based on the image coordinates of the sound source object corresponding to the audio element in each frame image.
  • Step S230 obtaining the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid according to the motion trajectory coordinates of the sound source object and the coordinate conversion relationship.
  • the spatial trajectory coordinates refer to the coordinate information of the audio element corresponding to the sound source object on the curved grid. It can be understood that, compared with the motion trajectory coordinates, the spatial trajectory coordinates supplement the sound field information of the audio element corresponding to the sound source object in the vertical direction.
  • the image coordinates corresponding to each frame image in the motion trajectory coordinates can be converted into spatial coordinates on the curved grid based on the coordinate conversion relationship between the two-dimensional plane coordinates of the display screen and the three-dimensional plane coordinates of the curved grid to obtain the spatial trajectory coordinates on the curved grid.
  • Step S240 construct a stereo video based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • audio rendering processing can be performed on the audio data of each audio element based on the control trajectory coordinates of the sound source object corresponding to each audio element to obtain stereo audio data, and then the stereo audio data and the image stream are combined to generate a stereo video.
  • a curved grid matching the display screen of the display device is constructed, and the coordinate conversion relationship between the display screen and the curved grid is obtained. Then, the image stream and the audio stream in the video data are obtained, and the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream are identified according to the image stream and the audio stream. According to the motion trajectory coordinates of the sound source object and the coordinate conversion relationship, the spatial trajectory coordinates of the sound source objects corresponding to each audio element on the curved grid are obtained; based on the spatial trajectory coordinates of each audio element in the image stream and the audio stream and the sound source objects corresponding to each audio element, a stereo video is constructed.
  • the spatial trajectory coordinates of each audio element in the corresponding curved grid are determined based on the motion trajectory coordinates, and finally a stereo video containing spatial audio is constructed based on the spatial trajectory coordinates of each audio element in the corresponding curved grid.
  • the spatial trajectory coordinates of the curved grid are used to complete the audio sound field information in the vertical direction of the audio stream, so that the spatial sense of the audio in the stereo video matches the video picture, thereby enhancing the user's immersion when watching the stereo video.
  • the step of identifying the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream includes:
  • audio element separation technology can be used to separate audio elements in the audio stream to obtain multiple independent audio elements in the audio stream; it can be understood that audio element separation technology includes but is not limited to human voice separation technology, musical instrument sound separation technology, etc.
  • S420 For a target audio element in the audio stream, intercept a first image stream synchronized with the target audio element in the image stream.
  • any audio elements can be used as target audio elements in turn for subsequent processing. It can be understood that only when the audio element exists can the sound source position information of the sound source object corresponding to the audio element be located in the image stream. Therefore, after obtaining the target audio element, the image stream within the duration of the target audio element can be obtained first, that is, the first image stream synchronized with the target audio element can be obtained.
  • S430 Input the target audio element and each frame image of the first image stream into a sound source localization model to obtain the sound source position coordinates of the sound source object corresponding to the target audio element in each frame image.
  • the sound source localization model is a trained model used to predict the position information of the sound source object corresponding to the target audio element in the frame image of the first image stream. It can be understood that the sound source localization model can be a neural network model, a machine learning model, etc.
  • the target audio element and each frame image in the first image stream can be input into the sound source localization module.
  • the sound source localization model is used to predict the predicted position coordinates of the sound source object corresponding to the target audio element in the frame image and the confidence corresponding to each predicted position. Then, based on each predicted position coordinate and its confidence, the sound source position coordinates of the sound source object corresponding to the target audio element are determined from the predicted position.
  • the step of inputting the target audio element and each frame image of the first image stream into the sound source localization model, and obtaining the sound source position coordinates of the sound source object corresponding to the target audio element in each frame image can specifically include: obtaining the target frame image and the historical frame image corresponding to the current prediction step from the first image stream; inputting the target audio element and the historical frame image into the sound source localization model, and obtaining the confidence of the sound source object corresponding to the target audio element in different prediction areas in the target frame image; if the maximum confidence among the confidences of each prediction area is greater than a preset confidence threshold, determining the sound source position coordinates of the sound source object corresponding to the target audio element in the target frame image according to the position information of the prediction area corresponding to the maximum confidence; if the maximum confidence among the confidences of each prediction area is less than or equal to the preset confidence threshold, setting the sound source position coordinates of the sound source object corresponding to the target audio element in the target frame image to a null value.
  • the sound source localization model processes the frame images in the first image stream frame by frame, that is, the sound source localization model predicts the position information of the sound source object of the target audio element in a frame image at each prediction step.
  • the target frame image of the current prediction step refers to the frame image currently processed by the sound source localization model in the first image stream
  • the historical frame image refers to the frame image of the historical time period corresponding to the target frame image in the first image stream.
  • the target frame image of the current prediction step is the frame image of the nth frame in the first image stream
  • the historical frame image corresponding to the target frame image can be the frame image of the (n-5)th frame to the (n-1)th frame in the first image stream.
  • the prediction region refers to the location of the sound source object corresponding to the target audio element in the current frame image, that is, the sound source location of the target audio element; the confidence of the prediction region refers to the probability value that the prediction region is the location of the sound source object corresponding to the target audio element.
  • the historical frame image can be input into the sound source localization model, and the sound source localization model is used to predict the prediction region of the target audio element in the current frame image, as well as the confidence of each prediction region.
  • the target prediction area with the largest confidence among the prediction areas.
  • the confidence of the target prediction area is greater than a preset confidence threshold, determine the target prediction area as the sound source position of the sound source object corresponding to the target audio element.
  • the confidences of all prediction areas in the target frame image are less than or equal to the preset confidence threshold, determine that there is no sound source position of the sound source object corresponding to the target audio element in the target frame image, and set the sound source position information of the sound source object corresponding to the target audio element in the target frame image to a null value.
  • the target audio element is background audio, and the target sound effect element is not processed subsequently.
  • S440 Determine the coordinates of the movement trajectory of the sound source object corresponding to the audio element in the first image stream according to the coordinates of the sound source position in each frame image of the first image stream.
  • the sound source position coordinates corresponding to each frame image can be determined as the motion trajectory coordinates of the sound source object corresponding to the audio element in the first image stream.
  • the step of determining the motion trajectory coordinates of the sound source object corresponding to the target audio element in the image stream according to the sound source position coordinates in each frame image includes: obtaining an invalid frame image in which the sound source position coordinates of the sound source object corresponding to the target audio element are null values; if the invalid frame image includes consecutive invalid frame images whose number is less than a preset value, obtaining the sound source position coordinates in the invalid frame image according to the sound source position coordinates of the sound source object corresponding to the target audio element in the previous frame image and the sound source position coordinates in the subsequent frame image.
  • the preceding frame image refers to the frame image at the preceding moment corresponding to the invalid frame image
  • the subsequent frame image refers to the frame image at the subsequent moment corresponding to the invalid frame image
  • the invalid frame image is the frame image at the (n-1) moment, the n moment and the (n+1) moment
  • the preceding frame image corresponding to the invalid frame image refers to the frame image at the (n-2) moment
  • the subsequent frame image corresponding to the invalid frame image refers to the frame image at the (n+2) moment.
  • an invalid frame image whose sound source position coordinates are null values is obtained, and continuous invalid frame images among all invalid frame images are obtained.
  • the number of continuous invalid frame images is greater than or equal to a preset value, it is determined that the target audio element is background audio during the time period corresponding to the continuous invalid frame images; when the number of continuous invalid frame images is less than the preset value, it is determined that the target audio element is not background audio during the time period corresponding to the continuous invalid frame images.
  • the sound source position coordinates in the invalid frame image can be calculated through an interpolation algorithm based on the sound source position coordinates of the sound source object corresponding to the target audio element in the previous frame image and the sound source position coordinates in the subsequent frame image.
  • the sound source position coordinates of the target audio element in the invalid frame image are predicted through the sound source position coordinates of the target audio element in the previous frame image and the sound source position coordinates in the subsequent frame image to complete the running trajectory coordinates in the first image stream, thereby ensuring the integrity of the motion trajectory coordinates of the target audio element. Subsequently, stereo objectification is realized based on the motion trajectory coordinates of the target audio element, which can effectively improve the authenticity of the target audio element.
  • FIG. 4B shows the process of obtaining the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the image stream.
  • decoding can be performed at the CPU to obtain the audio stream and the image stream.
  • the audio stream and the first image stream synchronized with the audio stream are input into the sound source localization model, and the sound source position coordinates of the sound source objects corresponding to the audio elements are marked in each frame image in the first image stream by the sound source localization model.
  • the motion trajectory coordinates of the sound source objects corresponding to the audio elements in the first image stream are determined.
  • the step of identifying the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream includes:
  • Step S510 Separate the audio elements of the audio stream to obtain multiple audio elements, and identify the sound source object type of the sound source object corresponding to each audio element.
  • the sound source object type refers to the type of an object such as an object that emits an audio element, including but not limited to a portrait type, a musical instrument type, an animal type, a mechanical type, etc.
  • the audio element separation technology can be used to separate the audio elements in the audio stream to obtain multiple independent audio elements in the audio stream; after obtaining each audio element, the sound source object type of the sound source object corresponding to the audio element can be identified through a sound source object recognition model, wherein the object recognition model can be a pre-trained neural network model for identifying the sound source object type of the sound source object corresponding to different audio elements.
  • Step S520 identifying the plane coordinates and image element type of each image element in each frame image in the image stream, and acquiring the trajectory information of each image element in the image stream according to the plane coordinates of each image element in each frame image.
  • the image elements refer to different objects in the frame image, including but not limited to portraits, musical instruments, animals, machinery, etc.; the plane coordinates refer to the position information of the image elements in the frame image; and the image element type refers to the information used to identify the object type corresponding to the image element.
  • the image element recognition model can be used to identify the position information (i.e., plane coordinates) and category information (i.e., image element type) of each image element in the frame image.
  • the image element recognition model can be a pre-trained neural network model for object detection.
  • the motion information of the different image elements in the image stream is determined based on the plane coordinates in each frame image.
  • Step S530 for the target audio element in the audio stream, according to the sound source object type of the sound source object corresponding to the target audio element and the image element type of each image element, determine the target image element that matches the sound source object corresponding to the target audio element from the image elements.
  • any audio element is determined as the target audio element in turn, and then the target image element corresponding to the target audio element is determined in each image element; specifically, the sound source object type of the sound source object corresponding to the target audio element can be matched with the image element type of each image element. If the sound source object type of the target audio element is the same as a certain image element type, the image element corresponding to the image element type can be determined as the target image element of the target audio element, that is, the image element is the object that emits the target audio element.
  • Step S540 If the sound source object corresponding to the target audio element matches the target image element, the motion trajectory coordinates of the sound source object corresponding to the target audio element in the image stream are generated according to the trajectory information of the target image element.
  • the trajectory information of the target image element in the image stream is determined as the motion trajectory coordinates of the sound source object corresponding to the target audio element in the image stream.
  • the target audio element is background audio.
  • the sound source object types of different audio elements and the image element types of different image elements in the frame image are identified respectively, and then the image elements corresponding to each audio element are determined based on the sound source object type and the image element type, and the trajectory information of the corresponding image element is determined as the motion trajectory coordinates of the sound source object corresponding to the audio element in the image stream, which can improve the efficiency and accuracy of determining the motion trajectory coordinates of the sound source object corresponding to the target audio element in the image stream.
  • FIG. 5B shows the process of obtaining the coordinates of the motion trajectories of the sound source objects corresponding to different audio elements in the image stream.
  • decoding can be performed at the CPU to obtain the audio stream and the image stream.
  • the audio elements of the audio stream are separated into multiple audio elements of preset sound source object types through a neural network model for audio element separation, and the sound source object type of the sound source object corresponding to each audio element is marked, wherein the sound source object type includes human voice, musical instrument, animal, machinery and other types.
  • the neural network model for object detection is used to identify multiple image elements of preset image element types in each frame image, and the plane coordinates of the image elements in each image element and the image element type are marked, wherein the image element types include human portrait, musical instrument, animal and mechanical products and other types.
  • the element matching module is used to match audio elements and image elements one by one based on the sound source object type corresponding to the audio element and the image element type corresponding to the image element, so as to obtain image elements corresponding to different audio elements, such as human voice matching human portrait, mechanical sound matching mechanical products, etc.
  • the motion trajectory coordinates of the sound source objects corresponding to the audio elements in the image stream are determined.
  • the step of constructing a curved grid that matches the display screen of the display device specifically includes: enlarging the equivalent plane corresponding to the display screen based on preset enlargement parameters to obtain a reference two-dimensional plane, and determining the reference origin of the reference two-dimensional screen based on the screen center of the display screen; constructing a spherical grid with the reference origin of the reference two-dimensional plane and a preset center distance, and determining the spherical grid corresponding to the hemisphere where the reference two-dimensional plane is located as the curved grid.
  • the reference two-dimensional plane refers to the reference two-dimensional plane after scaling the equivalent plane corresponding to the display screen.
  • the plane center of the equivalent plane corresponding to the display screen can be taken as the center point, and the equivalent plane of the display screen can be enlarged based on preset enlargement parameters; still taking the display device as a TV as an example, the height of the display screen of the TV is a 9 by 16 plane, and the equivalent plane of the display screen can be enlarged to obtain a 20 by 20 plane as the reference two-dimensional plane.
  • the preset center distance can be set according to the optimal viewing length ratio; for example, the display screen is a TV display screen, and its size is 16:9, then the preset center distance can be set to 70.
  • a spherical grid is constructed according to the reference origin of the reference two-dimensional plane and the preset center distance; for example, assuming that the preset center distance is 70 and the center coordinates are (0,0,0), the coordinate information of the reference origin of the reference two-dimensional plane (i.e., the screen center of the display screen) is (0,0,70), and then a spherical grid passing through the four vertices of the reference two-dimensional plane is constructed with the center coordinates, and the spherical grid of the hemisphere where the reference two-dimensional plane is located is determined as a curved grid.
  • the two-dimensional plane 310 in FIG. 3 is the display screen
  • the two-dimensional plane 330 is the reference two-dimensional plane
  • the three-dimensional plane 320 is the curved grid.
  • the step of obtaining the spatial trajectory coordinates of the sound source object corresponding to each audio element on the curved grid according to the motion trajectory coordinates of the sound source object and the coordinate conversion relationship includes: scaling the motion trajectory coordinates according to the amplification parameters to obtain the target trajectory coordinates of the sound source object corresponding to each audio element on the reference two-dimensional plane; and calculating the spatial trajectory coordinates of the sound source object corresponding to the audio element on the curved grid according to the target trajectory coordinates.
  • the target trajectory coordinates of the sound source objects corresponding to the audio elements on the reference two-dimensional plane can be calculated based on the preset magnification parameters, that is, the coordinates of the X-axis and Y-axis of the sound source objects corresponding to the audio elements on the arc grid; then, the value of the sound source objects corresponding to the audio elements in the vertical direction in the arc grid can be calculated based on the following formula (1), that is, the coordinate value of the sound source objects corresponding to the audio elements on the Z-axis on the arc grid can be obtained:
  • X sp and Y sp are the X-axis and Y-axis coordinates of the audio element corresponding to the sound source object on the arc grid (or the reference two-dimensional plane);
  • Z sp is the Z-axis coordinate of the audio element corresponding to the sound source object on the arc grid.
  • the spatial trajectory coordinates on the arc grid are obtained, and the position of the audio element is located based on the spatial trajectory coordinates to complete the sound field information in the vertical direction.
  • 5 may include a plurality of sub-steps or a plurality of stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the video data processing device 600 includes:
  • the arc surface grid construction module 610 is used to construct an arc surface grid matching the display screen of the display device and obtain the coordinate conversion relationship between the display screen and the arc surface grid;
  • the motion trajectory acquisition module 620 is used to acquire the image stream and the audio stream in the video data, and identify the motion trajectory coordinates of the sound source objects corresponding to different audio elements in the audio stream in the image stream according to the image stream and the audio stream;
  • the spatial trajectory acquisition module 630 is used to acquire the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid according to the motion trajectory coordinates of the sound source object and the coordinate conversion relationship;
  • the stereo video construction module 640 is used to construct a stereo video based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • the motion trajectory acquisition module 620 is used to separate the audio data of the audio stream to obtain audio elements; for the target audio element in the audio stream, intercept the first image stream synchronized with the target audio element in the image stream; input the target audio element and each frame image of the first image stream into the sound source localization model to obtain the sound source position coordinates of the sound source object corresponding to the target audio element in each frame image; determine the motion trajectory coordinates of the sound source object corresponding to the target audio element in the first image stream according to the sound source position coordinates in each frame image of the first image stream.
  • the motion trajectory acquisition module 620 is used to obtain the target frame image and the historical frame image corresponding to the current prediction step from the first image stream; input the target audio element and the historical frame image into the sound source localization model to obtain the confidence of the sound source object corresponding to the target audio element in different prediction areas in the target frame image; if the maximum confidence among the confidences of each prediction area is greater than a preset confidence threshold, determine the sound source position coordinates of the sound source object corresponding to the target audio element in the target frame image according to the position information of the prediction area corresponding to the maximum confidence; if the maximum confidence among the confidences of each prediction area is less than or equal to the preset confidence threshold, set the sound source position coordinates of the sound source object corresponding to the target audio element in the target frame image to a null value.
  • the motion trajectory acquisition module 620 is used to obtain an invalid frame image in which the sound source position coordinates of the sound source object corresponding to the target audio element are empty values; if the invalid frame image includes consecutive invalid frame images whose number is less than a preset value, the sound source position coordinates in the invalid frame image are obtained based on the sound source position coordinates of the sound source object corresponding to the target audio element in the previous frame image and the sound source position coordinates in the subsequent frame image.
  • the motion trajectory acquisition module 620 is used to separate the audio elements of the audio stream to obtain multiple audio elements, and identify the sound source object type of the sound source object corresponding to each audio element; identify the plane coordinates and image element type of each image element in each frame image in the image stream, and obtain the trajectory information of each image element in the image stream according to the plane coordinates of each image element in each frame image; for the target audio element in the audio stream, determine the target image element that matches the sound source object corresponding to the target audio element from the image elements according to the sound source object type of the sound source object corresponding to the target audio element and the image element type of each image element; if the sound source object corresponding to the target audio element matches the target image element, generate the motion trajectory coordinates of the sound source object corresponding to the target audio element in the image stream according to the trajectory information of the target image element.
  • the arc grid construction module 610 is used to enlarge the equivalent plane corresponding to the display screen based on preset enlargement parameters to obtain a reference two-dimensional plane, and determine the reference origin of the reference two-dimensional screen based on the screen center of the display screen; construct a spherical grid with the reference origin of the reference two-dimensional plane and a preset center distance, and determine the spherical grid corresponding to the hemisphere where the reference two-dimensional plane is located as a arc grid.
  • the spatial trajectory acquisition module 630 is used to scale the motion trajectory coordinates according to the magnification parameters to obtain the target trajectory coordinates of the sound source object corresponding to each audio element on the reference two-dimensional plane; based on the target trajectory coordinates, calculate the spatial trajectory coordinates of the sound source object corresponding to the audio element on the curved grid.
  • Each module in the above video data processing device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or can be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • the video data processing device 600 can be implemented in the form of a computer-readable instruction, and the computer-readable instruction can be run on a computer device as shown in FIG7.
  • the memory of the computer device can store various program modules constituting the video data processing device 600, such as the arc grid construction module 610, the motion trajectory acquisition module 620, the space trajectory acquisition module 630, and the stereo video construction module 640 shown in FIG9.
  • the computer-readable instructions constituted by each program module enable the processor to execute the steps of the video data processing method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG7 can perform step S210 through the arc grid construction module 610 in the video data processing device 600 shown in FIG6.
  • the computer device can perform step S220 through the motion trajectory acquisition module 620.
  • the computer device can perform step S230 through the spatial trajectory acquisition module 630.
  • the computer device can perform step S240 through the stereo video construction module 640.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external computer device through a network connection.
  • the computer-readable instructions are executed by the processor, a method for processing video data is implemented.
  • FIG. 7 is merely a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a display device comprising one or more processors; a memory; and one or more computer-readable instructions, wherein the one or more computer-readable instructions are stored in the memory and configured to be executed by the processor in the following steps:
  • the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid are obtained;
  • a stereo video is constructed based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • a non-volatile computer-readable storage medium on which computer-readable instructions are stored.
  • the computer-readable instructions are loaded by a processor, so that the processor performs the following steps:
  • the spatial trajectory coordinates of the sound source object corresponding to each audio element on the arc grid are obtained;
  • a stereo video is constructed based on the image stream, each audio element in the audio stream, and the spatial trajectory coordinates of the sound source object corresponding to each audio element.
  • Non-volatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).

Landscapes

  • Stereophonic System (AREA)

Abstract

本申请提供一种视频数据的处理方法、装置、显示设备以及存储介质。该方法通过构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。通过增加音频元素在弧面栅格的空间轨迹坐标,补全在垂直方向上音频声场信息,提升用户观看立体声视频时的沉浸感。

Description

视频数据的处理方法、装置、显示设备以及存储介质 技术领域
本申请涉及音视频数据处理技术领域,具体涉及一种视频数据的处理方法、装置、显示设备以及非易失性计算机可读存储介质(简称存储介质)。
背景技术
随着科技的发展,电视机等显示设备的显示屏幕越来越大,而显示设备的出声位置仍然设置于电视机的底部或两侧位置,而且现有大部分的视频数据中音频为双声道音频,显示屏在播放视频数据时在垂直方向上音频声场缺失,音频的空间感弱,难以与视频画面匹配,用户在使用显示设备过程中沉浸感低。
技术问题
本申请实施例提供一种视频数据的处理方法、装置、显示设备以及存储介质,用以提高音频的空间感以匹配视频画面。
技术解决方案
本申请实施例提供了一种视频数据的处理方法,应用于显示设备,该方法包括:
构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;及
基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
本申请提供一种视频数据的处理装置,应用于显示设备,该装置包括:
弧面栅格构建模块,用于构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
运动轨迹获取模块,用于获取视频数据中的图像流以及音频流,根据图像 流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
空间轨迹获取模块,用于根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;及
立体声视频构建模块,用于基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
本申请实施例还提供一种显示装置,该显示装置包括:一个或多个处理器;存储器;以及一个或多个计算机可读指令,其中一个或多个计算机可读指令被存储于存储器中,并配置为由处理器执行以实现以下步骤:
构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;及
基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
本申请实施例还提供一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;及
基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
有益效果
本申请的有益效果为:通过构建与显示设备的显示屏幕匹配的弧面栅格,在获取到音频流中各个音频元素对应声源对象在图像流中的运行轨迹坐标后,基于运动轨迹坐标确定各个音频元素在对应弧面栅格的空间轨迹坐标,最终基于各个音频元素在对应弧面栅格的空间轨迹坐标构建包含空间音频的立体声视频,相较于原有的音频流,通过弧面栅格的空间轨迹坐标补全音频流在垂直方向上音频声场信息,使得立体声视频中的音频的空间感匹配视频画面,提升用户观看立体声视频时的沉浸感。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为根据一个或多个实施例中视频数据的处理方法的应用场景图。
图2是根据一个或多个实施例中视频数据的处理方法的流程示意图。
图3是根据一个或多个实施例中显示屏幕与弧面栅格的示意图。
图4A是根据一个或多个实施例中音频元素对应声源对象的运动轨迹坐标获取步骤的流程示意图。
图4B是根据一个或多个实施例中音频元素对应声源对象的运动轨迹坐标获取步骤的另一个示意图。
图5A是根据一个或多个实施例中音频元素对应声源对象的运动轨迹坐标的又一个示意图。
图5B根据一个或多个实施例中音频元素对应声源对象的运动轨迹坐标的再一个示意图。
图6是根据一个或多个实施例中视频数据的处理装置的结构示意图。
图7是根据一个或多个实施例中计算机设备的结构示意图。
本发明的实施方式
这里所公开的具体结构和功能细节仅仅是代表性的,并且是用于描述本申 请的示例性实施例的目的。但是本申请可以通过许多替换形式来具体实现,并且不应当被解释成仅仅受限于这里所阐述的实施例。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本申请的描述中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个所述特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
在本申请的描述中,“例如”一词用来表示“用作例子、例证或说明”。本申请中被描述为“例如”的任何实施例不一定被解释为比其它实施例更优选或更具优势。为了使本领域任何技术人员能够实现和使用本发明,给出了以下描述。在以下描述中,为了解释的目的而列出了细节。应当明白的是,本领域普通技术人员可以认识到,在不使用这些特定细节的情况下也可以实现本发明。在其它实例中,不会对公知的结构和过程进行详细阐述,以避免不必要的细节使本发明的描述变得晦涩。因此,本发明并非旨在限于所示的实施例,而是与符合本申请所公开的原理和特征的最广范围相一致。
本申请提供的视频数据的处理方法,可以应用于如图1所示的应用环境中。其中,终端110通过网络与服务器120通过网络进行通信,以接收服务器120发送的视频数据,同时,终端110构建与显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系,获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标,根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标,最终,基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。其中,终端110是带有显示屏幕的计算机设备,可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器120 可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
参阅图2,本申请实施例提供了一种视频数据的处理方法,主要以该方法应用于如图1所示的终端110来举例说明,该方法包括步骤S210至S240,具体如下:
步骤S210,构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系。
其中,弧面栅格为基于显示设备的显示屏幕所在平面构建的虚拟栅格,用于模拟音频元素对应声源对象的空间位置,以补全音频元素在垂直方向上的声场信息。其中,显示屏幕与弧面栅格间的坐标转换关系,是指显示屏幕对应的二维坐标与弧面栅格对应三维坐标间的转换关系。可以理解的是,相较于显示屏幕对应二维坐标,弧面栅格对应三维坐标增加了垂直方向上的坐标信息。
具体地,构建与显示设备的显示屏幕匹配的弧面栅格,可以是基于显示设备的显示屏幕构建一个球面,进而,以显示屏幕所在的半球对应的球面作为弧面栅格,并基于球面坐标表达式获取显示屏幕与弧面栅格间的坐标转换关系。
例如,以显示设备是电视机为例,电视机的显示屏幕的尺寸一般为16:9,去显示屏幕的高为9个单位、宽为16个单位,按照最佳观影长度比,显示屏幕与观众间的距离设置为70个单位,因此,可以基于显示屏幕(假设其坐标信息为(0,0,0))以及观众所在位置信息(假设坐标信息为(0,0,-70)),构建以观众所在位置信息为球面中心且经过显示屏幕对于的顶点的球面,并将显示屏幕所在的半球对应的球面作为弧面栅格;参见图3,图3中二维平面310为显示屏幕,三维平面320为与显示屏幕匹配的弧面栅格。
步骤S220,获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标。
其中,视频数据是指显示设备实时接收到的视频内容,图像流是指视频数据中的画面数据,音频流是指视频数据中的音频数据。具体地,显示设备在接收到视频数据后,可将视频数据中的图像数据以及音频数据进行分离,得到图像流以及音频流,便于后续分别对图像流以及音频流进行处理。
音频流中往往包括一个或多个不同的音频元素对应的音频元素,声源对象是指音频元素在图像流中各个帧图像中的发声对象。例如,当音频流中包括有 人物声音,则音频流中包括有人物声音元素;当音频流中包括有汽车引擎声,则音频流中包括有汽车声音元素。具体地,显示设备获取到音频流后,可采用音频元素分离技术,对音频流中的音频元素进行分离,以获取音频流中的多个独立的音频元素;可理解的是,音频元素分离技术包括但不限于人声分离技术、乐器声分离技术等。
其中,音频元素的声源对象是指在图像流的帧图像中,音频元素对应的发声对象或发声点;运动轨迹坐标是指音频元素对应声源对象在图像流中的移动轨迹,例如,音频元素对应声源对象在图像流中第一帧帧图像的左下角,在图像流的第三帧帧图像的右上角,则音频元素对应声源对象在第一帧帧图像至第三帧帧图像的图像流中的运动轨迹坐标为从左下角的坐标点移动至右上角的坐标点;更具体地,运动轨迹坐标可以包括声源对象在图像流中各个帧图像的图像坐标。
具体地,在获取到视频数据中的音频流以及图像流后,针对音频流中的任意一个音频元素,可以识别该音频元素对应声源对象在图像流中各个帧图像中的图像坐标,进而基于该音频元素对应声源对象在各个帧图像对应的图像坐标获取其在图像流中的运动轨迹坐标。
步骤S230,根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标。
其中,空间轨迹坐标是指音频元素对应声源对象在弧面栅格上的坐标信息,可以理解的是,相较于运动轨迹坐标,空间轨迹坐标补充了音频元素对应声源对象在垂直方向上声场信息。
在确定到声源对象的运行轨迹坐标后,具体可以基于显示屏幕的二维平面坐标与弧面栅格的三维平面坐标间的坐标转换关系,将运动轨迹坐标中与各个帧图像对应的图像坐标转换为弧面栅格上的空间坐标,获得在弧面栅格上的空间轨迹坐标。
步骤S240,基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
在获取到音频流中各个音频元素对应声源对象的空间轨迹坐标后,可基于各个音频元素对应声源对象的控件轨迹坐标,对各个音频元素的音频数据进行 音频渲染处理,以获得立体声音频数据,进而结合立体声音频数据以及图像流,生成立体声视频。
上述视频数据的处理方法中,通过构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系,然后,获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标,根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。通过构建与显示设备的显示屏幕匹配的弧面栅格,在获取到音频流中各个音频元素对应声源对象在图像流中的运行轨迹坐标后,基于运动轨迹坐标确定各个音频元素在对应弧面栅格的空间轨迹坐标,最终基于各个音频元素在对应弧面栅格的空间轨迹坐标构建包含空间音频的立体声视频,相较于原有的音频流,通过弧面栅格的空间轨迹坐标补全音频流在垂直方向上音频声场信息,使得立体声视频中的音频的空间感匹配视频画面,提升用户观看立体声视频时的沉浸感。
在其中一个实施例中,参见图4A以及图4B,如图4A所示,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标的步骤,包括:
S410,对音频流进行音频数据分离,得到音频元素。
其中,可采用音频元素分离技术,对音频流中的音频元素进行分离,以获取音频流中的多个独立的音频元素;可理解的是,音频元素分离技术包括但不限于人声分离技术、乐器声分离技术等。
S420,针对音频流中的目标音频元素,在图像流中截取与目标音频元素同步的第一图像流。
其中,在获取到音频流中的音频元素后,可依次将任意音频元素作为目标音频元素,以进行后续的处理。可以理解的是,当存在音频元素时才可能在图像流中定位音频元素对应声源对象的声源位置信息,因此,在获取到目标音频元素后,可先获取在目标音频元素持续时间段内的图像流,即获取与目标音频元素同步的第一图像流。
S430,将目标音频元素以及第一图像流的各个帧图像输入至声源定位模型,获取目标音频元素对应声源对象在各帧图像中的声源位置坐标。
其中,声源定位模型为已经过训练模型,用于在第一图像流的帧图像中预测目标音频元素对应声源对象的位置信息。可以理解的是,声源定位模型可以是神经网络模型、机器学习模型等。
在获取到目标音频元素以及目标音频元素对应的第一图像流后,可以将目标音频元素以及第一图像流中的各个帧图像输入至声源定位模块中,通过声源定位模型在帧图像中预测目标音频元素对应声源对象所在的预测位置坐标以及各个预测位置对应的置信度,进而基于各个预测位置坐标及其置信度从预测位置中确定目标音频元素对应声源对象的声源位置坐标。
在其中一个实施例中,将目标音频元素以及第一图像流的各个帧图像输入至声源定位模型,获取目标音频元素对应声源对象在各帧图像中的声源位置坐标的步骤,具体可以包括:从第一图像流中获取当前预测步序对应的目标帧图像以及历史帧图像;将目标音频元素以及历史帧图像输入至声源定位模型,获取目标音频元素对应声源对象在目标帧图像中不同预测区域的置信度;若各预测区域的置信度中的最大置信度大于预设置信度阈值,根据最大置信度对应的预测区域的位置信息确定目标音频元素对应声源对象在目标帧图像的声源位置坐标;若各预测区域的置信度中的最大置信度小于或等于预设置信度阈值,将目标音频元素对应声源对象在目标帧图像的声源位置坐标置为空值。
可以理解的是,声源定位模型是对第一图像流中的帧图像进行逐帧处理,即声源定位模型在每一个预测步序预测目标音频元素在一个帧图像中声源对象的位置信息。其中,当前预测步的目标帧图像是指第一图像流中声源定位模型当前处理的帧图像,历史帧图像是指第一图像流中目标帧图像对应的历史时间段的帧图像。例如,当前预测步的目标帧图像为第一图像流中第n帧的帧图像,该目标帧图像对应的历史帧图像可以是第一图像流中第(n-5)帧的帧图像至第(n-1)帧的帧图像。
其中,预测区域是指在当前帧图像中,可能为目标音频元素对应声源对象所处的位置,即目标音频元素的声源位置;预测区域的置信度是指预测区域为目标音频元素对应声源对象所处位置的概率值。具体地,在获取到历史帧图像 后,可以将历史帧图像输入至声源定位模型中,通过声源定位模型预测目标音频元素在当前帧图像中的预测区域,以及各个预测区域的置信度。
确定各个预测区域中置信度最大的目标预测区域,当目标预测区域的置信度大于预设置信度阈值,将该目标预测区域确定为目标音频元素对应声源对象的声源位置,当目标预测区域的置信度小于或等于预设置信度阈值,则目标帧图像中所有预测区域的置信度均小于或等于预设置信度阈值,确定目标帧图像中无目标音频元素对应声源对象的声源位置,将目标音频元素对应声源对象在目标帧图像的声源位置信息置为空值。
可以理解的是,当目标音频元素对应声源对象在第一图像流中的所有帧图像的声源位置信息置均为空值时,目标音频元素为背景音频,后续不对该目标音效元素进行处理。
S440,根据在第一图像流各帧图像中的声源位置坐标确定音频元素对应声源对象在第一图像流中的运动轨迹坐标。
在获取到目标音频元素对应声源对象在第一图像流各个帧图像中的声源位置坐标后,可以将各个帧图像对应的声源位置坐标确定为音频元素对应声源对象在第一图像流中的运动轨迹坐标。
考虑到视频数据中可能存在上一帧帧图像中存在目标音频元素对应声源对象,而当前帧图像不存在目标音频元素对应声源对象,但下一帧帧图像中再次出现目标音频元素对应声源对象的情况,为了保证目标音频元素的运动轨迹的连续性,在其中一个实施例中,根据在各帧图像中的声源位置坐标确定目标音频元素对应声源对象在图像流中的运动轨迹坐标的步骤,包括:获取目标音频元素对应声源对象声源位置坐标为空值的无效帧图像;若无效帧图像中包括数量小于预设数值的连续无效帧图像,根据目标音频元素对应声源对象在前序帧图像的声源位置坐标以及在后序帧图像中的声源位置坐标,获取在无效帧图像中的声源位置坐标。
其中,前序帧图像是指无效帧图像对应的前序时刻的帧图像,后续帧图像是指无效帧图像对应的后序时刻的帧图像;例如,无效帧图像为(n-1)时刻、n时刻以及(n+1)时刻的帧图像,则无效帧图像对应的前序帧图像是指(n-2)时刻的帧图像,无效帧图像对应的后序帧图像是指(n+2)时刻的帧图像。
具体地,获取声源位置坐标为空值的无效帧图像,并获取所有无效帧图像中为连续的无效帧图像,当连续的无效帧图像的数量大于或等于预设数值,则确定该目标音频元素在连续的无效帧图像对应的时间段内为背景音频;当连续的无效帧图像的数量小于预设数值,则确定该目标音频元素在连续的无效帧图像对应时间内非背景音频,此时可以基于目标音频元素对应声源对象在前序帧图像的声源位置坐标以及在后序帧图像中的声源位置坐标,通过插值算法计算在无效帧图像中的声源位置坐标。
通过目标音频元素在前序帧图像中的声源位置坐标以及后序帧图像中的声源位置坐标,预测目标音频元素在无效帧图像中的声源位置坐标,以补全在第一图像流中的运行轨迹坐标,保证目标音频元素的运动轨迹坐标的完整性,后续基于目标音频元素的运动轨迹坐标实现立体声对象化,可有效提高目标音频元素的真实性。
参见图4B,图4B示出了不同音频元素对应声源对象在图像流中的运动轨迹坐标的获取过程,具体地,在接收到视频数据后,可在CPU处进行解码,以获取音频流以及图像流。进而,将音频流以及与音频流同步的第一图像流输入至声源定位模型中,通过声源定位模型在第一图像流中的各个帧图像中标记音频元素对应声源对象的声源位置坐标,最终基于音频元素对应声源对象在各个帧图像中的声源位置坐标,确定音频元素对应声源对象在第一图像流中的运动轨迹坐标。
在其中一个实施例中,参见图5A以及图5B,如图5A所示,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标的步骤,包括:
步骤S510,对音频流进行音频元素分离得到多个音频元素,并识别各音频元素对应声源对象的声源对象类型。
其中,声源对象类型是指发出音频元素的物体等对象的类型,包括但不限于人像类型、乐器类型、动物类型、机械类型等。具体地,可采用音频元素分离技术,对音频流中的音频元素进行分离,以获取音频流中的多个独立的音频元素;在获取到各个音频元素后,可通过声源对象识别模型,识别音频元素对应声源对象的声源对象类型,其中,对象识别模型可以是预先训练好的、用于 识别不同音频元素对应声源对象的声源对象类型的神经网络模型。
步骤S520,识别图像流中每个帧图像中的各个图像元素的平面坐标以及图像元素类型,并根据各图像元素在各帧图像中的平面坐标,获取各图像元素在图像流中的轨迹信息。
其中,图像元素是指帧图像中的不同物体,包括但不限于人像、乐器、动物、机械等;平面坐标是指图像元素在帧图像中的位置信息;图像元素类型是指用于标识图像元素对应的物体类型的信息。具体地,可以通过图像元素识别模型,识别帧图像中各个图像元素的所在位置信息(即平面坐标)及其类别信息(即图像元素类型)。其中,图像元素识别模型可以是预先训练好的、用于物体检测的神经网络模型。
在获取到不同图像元素在图像流中各个帧图像的平面坐标后,基于在各个帧图像中的平面坐标确定不同图像元素在图像流中的运动信息。
步骤S530,针对音频流中的目标音频元素,根据目标音频元素对应声源对象的声源对象类型以及各图像元素的图像元素类型,从图像元素中确定与目标音频元素对应声源对象匹配的目标图像元素。
在确定到音频元素后,依次任意音频元素确定为目标音频元素,进而在各个图像元素中确定与该目标音频元素对应的目标图像元素;具体地,可以将目标音频元素对应声源对象的声源对象类型与各个图像元素的图像元素类型进行匹配,若目标音频元素的声源对象类型与某个图像元素类型相同,则可以确定该图像元素类型对应的图像元素为目标音频元素的目标图像元素,即该图像元素为发出目标音频元素的物体。
步骤S540,若目标音频元素对应声源对象匹配到目标图像元素,根据目标图像元素的轨迹信息生成目标音频元素对应声源对象在图像流中的运动轨迹坐标。
在确定到目标音频元素对应声源对象匹配到目标图像元素后,将目标图像元素在图像流中的轨迹信息确定为目标音频元素对应声源对象在图像流中的运动轨迹坐标。
进一步地,若目标音频元素对应声源对象匹配不到目标图像元素,则目标音频元素为背景音频。
通过对音频数据以及图像数据解耦,分别识别不同音频元素的声源对象类型以及帧图像中不同图像元素的图像元素类型,进而基于声源对象类型以及图像元素类型确定与各个音频元素对应的图像元素,将对应图像元素的轨迹信息确定为音频元素对应声源对象在图像流中的运动轨迹坐标,可提高目标音频元素对应声源对象在图像流中的运动轨迹坐标的确定效率以及准确性。
参见图5B,图5B示出了不同音频元素对应声源对象在图像流中的运动轨迹坐标的获取过程,具体地,在接受到视频数据后,可在CPU处进行解码,以获取音频流以及图像流。进而,针对音频流,通过用于音频元素分离的神经网络模型对音频流进行音频元素分离为多个预设声源对象类型的音频元素,并标注各音频元素对应声源对象的声源对象类型,其中,声源对象类型包括人声、乐器、动物、机械以及其他等类型。针对图像流,通过用于物体检测的神经网络模型识别每个帧图像中的多个预设图像元素类型的图像元素,并标注图像元素在各个图像元素的平面坐标以及图像元素类型,其中,图像元素类型包括人像、乐器、动物以及机械品等类型。最后,通过元素匹配模块基于音频元素对应的声源对象类型和图像元素对应的图像元素类型,对音频元素与图像元素进行一一匹配,即可得到不同音频元素对应的图像元素,例如人声匹配人像、机械声匹配机械品等,进而基于音频元素对应的图像元素在各个图像元素的平面坐标,确定音频元素对应声源对象在图像流中的运动轨迹坐标。
在其中一个实施例中,构建与显示设备的显示屏幕匹配的弧面栅格的步骤,具体包括:基于预设的放大参数对显示屏幕对应的等效平面进行放大处理,得到基准二维平面,并基于显示屏幕的屏幕中心确定基准二维屏幕的基准原点;以基准二维平面的基准原点以及预设圆心距离构建球面栅格,将基准二维平面所在的半球面对应的球面栅格确定为弧面栅格。
其中,基准二维平面是指对显示屏幕对应的等效平面进行缩放后的基准二维平面,具体地,可以以显示屏幕对应的等效平面的平面中心为中心点,基于预设的放大参数对显示屏幕的等效平面进行放大处理;仍然以显示设备是电视机为例,电视机的显示屏幕的高为9乘16的平面,可以对显示屏幕的等效平面进行放大处理以获取20乘20的平面作为基准二维平面。
其中,预设圆心距离可以根据最佳观影长度比进行设置;例如,显示屏幕 为电视机的显示屏幕,其尺寸为16:9,则预设圆心距离可以设置为70。
具体地,在确定到基准二维平面后,根据基准二维平面的基准原点以及预设圆心距离构建球面栅格;例如,假设预设球心距离为70,球心坐标为(0,0,0),则基准二维平面的基准原点(即显示屏幕的屏幕中心)的坐标信息为(0,0,70),进而以球心坐标构建一个经过基准二维平面四个顶点的球面栅格,并将基准二维平面所在的半球面的球面栅格确定为弧面栅格。仍然参见图3,图3中二维平面310为显示屏幕,二维平面330为与基准二维平面,三维平面320为弧面栅格。
在其中一个实施例中,根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标的步骤,包括:根据放大参数对运动轨迹坐标进行缩放处理,获取各音频元素对应声源对象在基准二维平面上的目标轨迹坐标;根据目标轨迹坐标,计算音频元素对应声源对象在弧面栅格的上的空间轨迹坐标。
在获取到不同音频元素对应声源对象在图像流中的运动轨迹坐标后,可以先基于预设的放大参数,计算音频元素对应声源对象在基准二维平面上的目标轨迹坐标,即音频元素对应声源对象在弧面栅格上X轴以及Y轴的坐标;进而,可基于下述式(1)计算音频元素对应声源对象在弧面栅格中垂直方向上的取值,即得到音频元素对应声源对象在弧面栅格上Z轴上的坐标值:
Figure PCTCN2022139009-appb-000001
其中,X sp、Y sp为音频元素对应声源对象在弧面栅格(或者说基准二维平面)上的X轴以及Y轴的坐标;Z sp为音频元素对应声源对象在弧面栅格上的Z轴的坐标。
通过将运动轨迹坐标中与各个帧图像对应的图像坐标转换为弧面栅格上的空间坐标,获得在弧面栅格上的空间轨迹坐标,并基于空间轨迹坐标定位音频元素所在位置,补全垂直方向上的声场信息。
应该理解的是,虽然图2、图4以及图5的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以 以其它的顺序执行。而且,图2、图4以及图5中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
为了更好实施本申请实施例提供的视频数据的处理方法,在本申请实施例所提视频数据的处理方法的基础之上,本申请实施例中还提供一种视频数据的处理装置,如图6所示,视频数据的处理装置600包括:
弧面栅格构建模块610,用于构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
运动轨迹获取模块620,用于获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
空间轨迹获取模块630,用于根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;
立体声视频构建模块640,用于基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
在本申请一些实施例中,运动轨迹获取模块620,用于对音频流进行音频数据分离,得到音频元素;针对音频流中的目标音频元素,在图像流中截取与目标音频元素同步的第一图像流;将目标音频元素以及第一图像流的各个帧图像输入至声源定位模型,获取目标音频元素对应声源对象在各帧图像中的声源位置坐标;根据在第一图像流各帧图像中的声源位置坐标确定目标音频元素对应声源对象在第一图像流中的运动轨迹坐标。
在本申请一些实施例中,运动轨迹获取模块620,用于从第一图像流中获取当前预测步序对应的目标帧图像以及历史帧图像;将目标音频元素以及历史帧图像输入至声源定位模型,获取目标音频元素对应声源对象在目标帧图像中不同预测区域的置信度;若各预测区域的置信度中的最大置信度大于预设置信度阈值,根据最大置信度对应的预测区域的位置信息确定目标音频元素对应声源对象在目标帧图像的声源位置坐标;若各预测区域的置信度中的最大置信度 小于或等于预设置信度阈值,将目标音频元素对应声源对象在目标帧图像的声源位置坐标置为空值。
在本申请一些实施例中,运动轨迹获取模块620,用于获取目标音频元素对应声源对象声源位置坐标为空值的无效帧图像;若无效帧图像中包括数量小于预设数值的连续无效帧图像,根据目标音频元素对应声源对象在前序帧图像的声源位置坐标以及在后序帧图像中的声源位置坐标,获取在无效帧图像中的声源位置坐标。
在本申请一些实施例中,运动轨迹获取模块620,用于对音频流进行音频元素分离得到多个音频元素,并识别各音频元素对应声源对象的声源对象类型;识别图像流中每个帧图像中的各个图像元素的平面坐标以及图像元素类型,并根据各图像元素在各帧图像中的平面坐标,获取各图像元素在图像流中的轨迹信息;针对音频流中的目标音频元素,根据目标音频元素对应声源对象的声源对象类型以及各图像元素的图像元素类型,从图像元素中确定与目标音频元素对应声源对象匹配的目标图像元素;若目标音频元素对应声源对象匹配到目标图像元素,根据目标图像元素的轨迹信息生成目标音频元素对应声源对象在图像流中的运动轨迹坐标。
在本申请一些实施例中,弧面栅格构建模块610,用于基于预设的放大参数对显示屏幕对应的等效平面进行放大处理,得到基准二维平面,并基于显示屏幕的屏幕中心确定基准二维屏幕的基准原点;以基准二维平面的基准原点以及预设圆心距离构建球面栅格,将基准二维平面所在的半球面对应的球面栅格确定为弧面栅格。
在本申请一些实施例中,空间轨迹获取模块630,用于根据放大参数对运动轨迹坐标进行缩放处理,获取各音频元素对应声源对象在基准二维平面上的目标轨迹坐标;根据目标轨迹坐标,计算音频元素对应声源对象在弧面栅格的上的空间轨迹坐标。
关于视频数据的处理装置的具体限定可以参见上文中对于视频数据的处理方法的限定,在此不再赘述。上述视频数据的处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储 器中,以便于处理器调用执行以上各个模块对应的操作。
在本申请一些实施例中,视频数据的处理装置600可以实现为一种计算机可读指令的形式,计算机可读指令可在如图7所示的计算机设备上运行。计算机设备的存储器中可存储组成该视频数据的处理装置600的各个程序模块,比如,图9所示的弧面栅格构建模块610、运动轨迹获取模块620、空间轨迹获取模块630以及立体声视频构建模块640。各个程序模块构成的计算机可读指令使得处理器执行本说明书中描述的本申请各个实施例的视频数据的处理方法中的步骤。
例如,图7所示的计算机设备可以通过如图6所示的视频数据的处理装置600中的弧面栅格构建模块610执行步骤S210。计算机设备可通过运动轨迹获取模块620执行步骤S220。计算机设备可通过空间轨迹获取模块630执行步骤S230。计算机设备可通过立体声视频构建模块640执行步骤S240。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的计算机设备通过网络连接通信。该计算机可读指令被处理器执行时以实现一种视频数据的处理方法。
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在本申请一些实施例中,提供了一种显示设备,包括一个或多个处理器;存储器;以及一个或多个计算机可读指令,其中一个或多个计算机可读指令被存储于存储器中,并配置为由处理器执行以下步骤:
构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;
基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
在本申请一些实施例中,提供了一种非易失性计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器进行加载,使得处理器执行以下步骤:
构建与显示设备的显示屏幕匹配的弧面栅格,并获取显示屏幕与弧面栅格间的坐标转换关系;
获取视频数据中的图像流以及音频流,根据图像流以及音频流识别音频流中不同音频元素对应声源对象在图像流中的运动轨迹坐标;
根据声源对象的运动轨迹坐标以及坐标转换关系,获取各音频元素对应声源对象在弧面栅格上的空间轨迹坐标;
基于图像流、音频流中各音频元素以及各音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述 实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上对本申请实施例所提供的一种视频数据的处理方法、装置、显示设备以及存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (11)

  1. 一种视频数据的处理方法,其特征在于,应用于显示设备,所述方法包括:
    构建与所述显示设备的显示屏幕匹配的弧面栅格,并获取所述显示屏幕与所述弧面栅格间的坐标转换关系;
    获取视频数据中的图像流以及音频流,根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标;
    根据所述声源对象的运动轨迹坐标以及所述坐标转换关系,获取各所述音频元素对应声源对象在所述弧面栅格上的空间轨迹坐标;
    基于所述图像流、所述音频流中各所述音频元素以及各所述音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标的步骤,包括:
    对所述音频流进行音频数据分离,得到音频元素;
    针对所述音频流中的目标音频元素,在所述图像流中截取与所述目标音频元素同步的第一图像流;
    将所述目标音频元素以及所述第一图像流的各个帧图像输入至声源定位模型,获取所述目标音频元素对应声源对象在各所述帧图像中的声源位置坐标;
    根据在所述第一图像流各所述帧图像中的声源位置坐标确定所述目标音频元素对应声源对象在所述第一图像流中的运动轨迹坐标。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述目标音频元素以及所述第一图像流的各个帧图像输入至声源定位模型,获取所述目标音频元素对应声源对象在各所述帧图像中的声源位置坐标的步骤,包括:
    从所述第一图像流中获取当前预测步序对应的目标帧图像以及历史帧图像;
    将所述目标音频元素以及所述历史帧图像输入至声源定位模型,获取所述 目标音频元素对应声源对象在目标帧图像中不同预测区域的置信度;
    若各所述预测区域的置信度中的最大置信度大于预设置信度阈值,根据所述最大置信度对应的预测区域的位置信息确定所述目标音频元素对应声源对象在目标帧图像的声源位置坐标;
    若各所述预测区域的置信度中的最大置信度小于或等于预设置信度阈值,将所述目标音频元素对应声源对象在目标帧图像的声源位置坐标置为空值。
  4. 根据权利要求3所述的方法,其特征在于,所述根据在所述第一图像流各所述帧图像中的声源位置坐标确定所述目标音频元素对应声源对象在所述第一图像流中的运动轨迹坐标的步骤,包括:
    获取所述目标音频元素对应声源对象声源位置坐标为空值的无效帧图像;
    若所述无效帧图像中包括数量小于预设数值的连续无效帧图像,根据所述目标音频元素对应声源对象在前序帧图像的声源位置坐标以及在后序帧图像中的声源位置坐标,获取在所述无效帧图像中的声源位置坐标。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标的步骤,包括:
    对所述音频流进行音频元素分离得到多个音频元素,并识别各所述音频元素对应声源对象的声源对象类型;
    识别所述图像流中每个帧图像中的各个图像元素的平面坐标以及图像元素类型,并根据各所述图像元素在各所述帧图像中的平面坐标,获取各所述图像元素在所述图像流中的轨迹信息;
    针对所述音频流中的目标音频元素,根据所述目标音频元素对应声源对象的声源对象类型以及各所述图像元素的图像元素类型,从所述图像元素中确定与所述目标音频元素对应声源对象匹配的目标图像元素;
    若所述目标音频元素对应声源对象匹配到目标图像元素,根据所述目标图像元素的轨迹信息生成所述目标音频元素对应声源对象在所述图像流中的运动轨迹坐标。
  6. 根据权利要求1所述的方法,其特征在于,所述构建与所述显示设备的显示屏幕匹配的弧面栅格的步骤,包括:
    基于预设的放大参数对所述显示屏幕对应的等效平面进行放大处理,得到基准二维平面,并基于所述显示屏幕的屏幕中心确定所述基准二维屏幕的基准原点;
    以所述基准二维平面的基准原点以及预设圆心距离构建球面栅格,将所述基准二维平面所在的半球面对应的球面栅格确定为弧面栅格。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述声源对象的运动轨迹坐标以及所述坐标转换关系,获取各所述音频元素对应声源对象在所述弧面栅格上的空间轨迹坐标的步骤,包括:
    根据所述放大参数对所述运动轨迹坐标进行缩放处理,获取各所述音频元素对应声源对象在基准二维平面上的目标轨迹坐标;
    根据目标轨迹坐标,计算音频元素对应声源对象在弧面栅格的上的空间轨迹坐标。
  8. 根据权利要求1所述的方法,其特征在于,基于所述图像流、所述音频流中各所述音频元素以及各所述音频元素对应声源对象的空间轨迹坐标,构建立体声视频的步骤,包括:
    基于各所述音频元素对应声源对象的控件轨迹坐标,对各所述音频元素进行音频渲染处理,获得立体声音频数据;
    结合立体声音频数据以及所述图像流,生成立体声视频。
  9. 一种视频数据的处理装置,其特征在于,应用于显示设备,所述装置包括:
    弧面栅格构建模块,用于构建与所述显示设备的显示屏幕匹配的弧面栅格,并获取所述显示屏幕与所述弧面栅格间的坐标转换关系;
    运动轨迹获取模块,用于获取视频数据中的图像流以及音频流,根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标;
    空间轨迹获取模块,用于根据所述声源对象的运动轨迹坐标以及所述坐标转换关系,获取各所述音频元素对应声源对象在所述弧面栅格上的空间轨迹坐标;
    立体声视频构建模块,用于基于所述图像流、所述音频流中各所述音频元 素以及各所述音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
  10. 一种显示设备,其特征在于,所述显示设备包括:
    一个或多个处理器;
    存储器;以及
    一个或多个计算机可读指令,其中所述一个或多个计算机可读指令被存储于所述存储器中,并配置为由所述处理器执行以实现以下步骤:
    构建与所述显示设备的显示屏幕匹配的弧面栅格,并获取所述显示屏幕与所述弧面栅格间的坐标转换关系;
    获取视频数据中的图像流以及音频流,根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标;
    根据所述声源对象的运动轨迹坐标以及所述坐标转换关系,获取各所述音频元素对应声源对象在所述弧面栅格上的空间轨迹坐标;
    基于所述图像流、所述音频流中各所述音频元素以及各所述音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
  11. 一种非易失性计算机可读存储介质,其特征在于,其上存储有计算机可读指令,所述计算机可读指令被处理器进行加载,以执行以下步骤:
    构建与所述显示设备的显示屏幕匹配的弧面栅格,并获取所述显示屏幕与所述弧面栅格间的坐标转换关系;
    获取视频数据中的图像流以及音频流,根据所述图像流以及所述音频流识别所述音频流中不同音频元素对应声源对象在所述图像流中的运动轨迹坐标;
    根据所述声源对象的运动轨迹坐标以及所述坐标转换关系,获取各所述音频元素对应声源对象在所述弧面栅格上的空间轨迹坐标;
    基于所述图像流、所述音频流中各所述音频元素以及各所述音频元素对应声源对象的空间轨迹坐标,构建立体声视频。
PCT/CN2022/139009 2022-12-14 2022-12-14 视频数据的处理方法、装置、显示设备以及存储介质 WO2024124437A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/139009 WO2024124437A1 (zh) 2022-12-14 2022-12-14 视频数据的处理方法、装置、显示设备以及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/139009 WO2024124437A1 (zh) 2022-12-14 2022-12-14 视频数据的处理方法、装置、显示设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2024124437A1 true WO2024124437A1 (zh) 2024-06-20

Family

ID=91484288

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139009 WO2024124437A1 (zh) 2022-12-14 2022-12-14 视频数据的处理方法、装置、显示设备以及存储介质

Country Status (1)

Country Link
WO (1) WO2024124437A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030053680A1 (en) * 2001-09-17 2003-03-20 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
CN111666802A (zh) * 2019-03-08 2020-09-15 Lg 电子株式会社 声音对象跟随的方法和装置
CN113316078A (zh) * 2021-07-30 2021-08-27 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质
CN114040318A (zh) * 2021-11-02 2022-02-11 海信视像科技股份有限公司 一种空间音频的播放方法及设备
WO2022059869A1 (ko) * 2020-09-15 2022-03-24 삼성전자 주식회사 영상의 음질을 향상시키는 디바이스 및 방법
CN115174959A (zh) * 2022-06-21 2022-10-11 咪咕文化科技有限公司 视频3d音效设置方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030053680A1 (en) * 2001-09-17 2003-03-20 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
CN111666802A (zh) * 2019-03-08 2020-09-15 Lg 电子株式会社 声音对象跟随的方法和装置
WO2022059869A1 (ko) * 2020-09-15 2022-03-24 삼성전자 주식회사 영상의 음질을 향상시키는 디바이스 및 방법
CN113316078A (zh) * 2021-07-30 2021-08-27 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质
CN114040318A (zh) * 2021-11-02 2022-02-11 海信视像科技股份有限公司 一种空间音频的播放方法及设备
CN115174959A (zh) * 2022-06-21 2022-10-11 咪咕文化科技有限公司 视频3d音效设置方法及装置

Similar Documents

Publication Publication Date Title
US11783491B2 (en) Object tracking method and apparatus, storage medium, and electronic device
WO2020186935A1 (zh) 虚拟对象的显示方法、装置、电子设备和计算机可读存储介质
WO2022083383A1 (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN109741463B (zh) 虚拟现实场景的渲染方法、装置及设备
WO2020248900A1 (zh) 全景视频的处理方法、装置及存储介质
US11776209B2 (en) Image processing method and apparatus, electronic device, and storage medium
CN109685873B (zh) 一种人脸重建方法、装置、设备和存储介质
CN109754464B (zh) 用于生成信息的方法和装置
WO2023160513A1 (zh) 3d素材的渲染方法、装置、设备及存储介质
CN109600559B (zh) 一种视频特效添加方法、装置、终端设备及存储介质
US20240037898A1 (en) Method for predicting reconstructabilit, computer device and storage medium
JP2023532285A (ja) アモーダル中心予測のためのオブジェクト認識ニューラルネットワーク
CN110619656A (zh) 基于双目摄像头的人脸检测跟踪方法、装置及电子设备
CN115205925A (zh) 表情系数确定方法、装置、电子设备及存储介质
CN117237409B (zh) 基于物联网的射击游戏准星校正方法及系统
WO2024088144A1 (zh) 增强现实画面的处理方法、装置、电子设备及存储介质
WO2024022301A1 (zh) 视角路径获取方法、装置、电子设备及介质
TWI711004B (zh) 圖片處理方法和裝置
WO2024124437A1 (zh) 视频数据的处理方法、装置、显示设备以及存储介质
CN112528707A (zh) 图像处理方法、装置、设备及存储介质
CN113920282B (zh) 图像处理方法和装置、计算机可读存储介质、电子设备
CN112465692A (zh) 图像处理方法、装置、设备及存储介质
US20220189200A1 (en) Information processing system and information processing method
CN113762173B (zh) 人脸光流估计及光流值预测模型的训练方法和装置
CN114998814A (zh) 目标视频生成方法、装置、计算机设备和存储介质