CN116939242A - Video image processing method and device, electronic equipment and storage medium - Google Patents

Video image processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116939242A
CN116939242A CN202310900219.6A CN202310900219A CN116939242A CN 116939242 A CN116939242 A CN 116939242A CN 202310900219 A CN202310900219 A CN 202310900219A CN 116939242 A CN116939242 A CN 116939242A
Authority
CN
China
Prior art keywords
video image
target user
video
dimensional
speaking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310900219.6A
Other languages
Chinese (zh)
Inventor
李文宇
陈丽莉
苗京花
李治富
郑超
马思研
李言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Technology Development Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202310900219.6A priority Critical patent/CN116939242A/en
Publication of CN116939242A publication Critical patent/CN116939242A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/1066Session management
    • H04L65/1083In-session procedures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application provides a video image processing method, a video image processing device, electronic equipment and a storage medium, wherein the method can comprise the following steps: the target user who is speaking is determined among a plurality of candidate users. Video images of a target user are acquired. And encoding the video image and the collected audio data of the target user, and then carrying out network transmission. According to the embodiment of the application, the collected video images are aimed at the target user, so that the user who is speaking can be highlighted during the video call. Finally, the redundant information in the conversation process can be removed. Correspondingly, the picture played on the video display end is the target user who is speaking. Therefore, the focus of the observer at the display end can be aimed at the target user who is speaking and other candidate users around the target user can not be seen, important information is prevented from being missed, and the efficiency of video call is improved.

Description

Video image processing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method and apparatus for processing a video image, an electronic device, and a storage medium.
Background
With the development of the mobile internet, people generally start to conduct video communication from the network, such as video conference, video call, etc. The existing video communication directly rebroadcasts the multiparty conference scene in real time, so that the video picture generally contains various redundant information such as environmental background, non-speaker and the like. The display of such redundant information tends to cause the objects of the video communication to focus out of focus, eventually missing important information.
Disclosure of Invention
The embodiment of the application provides a video image processing method, a video image processing device, electronic equipment and a storage medium.
In a first aspect, an embodiment of the present application provides a method for processing a video image, where the method may include:
the target user who is speaking is determined among a plurality of candidate users.
Video images of a target user are acquired.
And encoding the video image and the collected audio data of the target user, and then carrying out network transmission.
In a second aspect, an embodiment of the present application provides a method for processing a video image, where the method may include:
and decoding the received audio and video data to obtain audio data and video images.
And processing the video image by utilizing a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image.
And carrying out three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image.
And combining the three-dimensional video image with the audio data for playing.
In a third aspect, an embodiment of the present application provides a processing apparatus for a video image, where the apparatus may include:
the target user determining module is used for determining a target user which is speaking from a plurality of candidate users;
the video image acquisition module is used for acquiring video images of target users;
and the transmission module is used for carrying out network transmission after encoding the video image and the collected audio data of the target user.
In a fourth aspect, an embodiment of the present application provides a processing apparatus for a video image, which may include:
the decoding module is used for decoding the received audio and video data to obtain audio data and video images;
the two-dimensional video image generation module is used for processing the video image by utilizing a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image;
the three-dimensional image generation module is used for carrying out three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image;
and the video playing module is used for combining the three-dimensional video image with the audio data to play the video.
In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, the processor implementing any one of the methods described above when the computer program is executed.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements a method as in any of the above.
Compared with the prior art, the application has the following advantages:
according to the embodiment of the application, the collected video images are aimed at the target user, so that the user who is speaking can be highlighted during the video call. Finally, the redundant information in the conversation process can be removed. Correspondingly, the picture played on the video display end is the target user who is speaking. Therefore, the focus of the observer at the display end can be aimed at the target user who is speaking and other candidate users around the target user can not be seen, important information is prevented from being missed, and the efficiency of video call is improved.
The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the technical means of the present application, as it is embodied in the present specification, and is intended to provide a better understanding of the above and other objects, features and advantages of the present application, as it is embodied in the following description.
Drawings
In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the application and are not therefore to be considered limiting of its scope.
FIG. 1 is a flowchart of a video image processing method according to the present application;
FIG. 2 is a schematic view of a video image processing method according to an embodiment of the application;
FIG. 3 is a schematic diagram of a video image including a target user in accordance with an embodiment of the present application;
FIG. 4 is a schematic view of a mouth keypoint of an embodiment of the present application;
FIG. 5 is a second flowchart of a video image processing method according to the present application;
fig. 6 is a schematic diagram of a video processing end and a video display end provided by the present application;
fig. 7 is one of block diagrams of a video image processing apparatus according to an embodiment of the present application;
FIG. 8 is a second block diagram of a video image processing apparatus according to an embodiment of the present application; and
fig. 9 is a block diagram of an electronic device used to implement an embodiment of the application.
Detailed Description
Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those skilled in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following describes related technologies of the embodiments of the present application. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application, which all belong to the protection scope of the embodiments of the present application.
An embodiment of the present application provides a method for processing a video image, and as shown in fig. 1, a flowchart of a method for processing a video image according to an embodiment of the present application corresponds to the first embodiment. The scene of the first embodiment is video acquisition and transmission, and the method may include:
step S101: the target user who is speaking is determined among a plurality of candidate users.
Step S102: video images of a target user are acquired.
Step S103: and encoding the video image and the collected audio data of the target user, and then carrying out network transmission.
The video image of the present application may include an image of a video conference or an image of a video call, etc. The execution main body of the application can be an intelligent terminal with video image acquisition and image video image playing functions such as a mobile phone or a palm computer, and can also be a server which is in remote communication with the image acquisition equipment such as the intelligent terminal or a camera. For convenience of distinction, the execution body of the first embodiment may be referred to as a video processing end. The execution subject of the subsequent second embodiment may be referred to as a video display end. The internal structures of the video processing side and the video display side may be the same.
For the video processing end, as shown in fig. 2, the video image capturing apparatus 201 that performs the video image capturing function in fig. 2 may be a camera, and in the example shown in fig. 2, 6 video image capturing apparatuses 201 are included, and each video image capturing apparatus 201 may have a unique number. Also included in fig. 2 is a video image display device 202 for displaying video pictures containing video call objects.
In the example shown in fig. 2, a plurality of candidate users 203 may be included. The plurality of video image capturing apparatuses 201 are provided for the purpose of satisfying that each candidate user 203 can capture a complete face image by at least one video image capturing apparatus 201.
Taking a video conference scenario as an example. Typically, only one user will speak at a meeting place at a time. Thus, the user who is speaking can be regarded as the target user. The manner of determination for the target user may be determined by facial recognition techniques. For example, a candidate user whose mouth is frequently active may be determined as a target user who is speaking by detecting the face of each candidate user in the video image. In the present example, a video image of one frame is taken as an example. In a real scene, the video image refers to an image sequence, that is, the video processing process needs to perform the same processing on multiple frames of video images.
After the target user is determined, the target user may be identified. Identification may include identifying the location, wear, etc. of the target user. So that a target user can be queried or tracked in video images captured by the plurality of video image capturing apparatuses 201 based on the identification. The purpose of the query or tracking may be to determine if a target user is photographed that is not occluded. If a certain video image acquisition device 201 can acquire the complete and unoccluded face of the target user, the video image acquisition device 201 can be used for acquiring the video image of the target user.
The captured video image may be processed. Processing may include removing background and removing other candidate users in the video image, and so forth. As shown in connection with fig. 3, the dashed box in fig. 3 may represent a processed video image containing the target user. If the dashed box can meet the display requirement of the video display end, the image in the dashed box can be directly used as the processed video image. If the dashed frame cannot meet the display requirement of the video display end, the dashed frame can be expanded according to the resolution corresponding to the display requirement of the video display end until the resolution corresponding to the display requirement is met. The expansion process may include uniformly diffusing the supplemental monochromatic pixels to the periphery based on the dashed box until the resolution requirement is reached.
And according to the requirement of audio-video synchronization, encoding the video image and the collected audio data of the target user according to the time sequence, and then carrying out network transmission. Therefore, the other party of the video call can perform decoding processing according to the received audio and video data to obtain the audio data and the video image, and finally the video call of the two parties is completed.
Since in the application the captured video image is targeted to the target user, the user who is speaking can be highlighted during the video call. Thereby, the redundant information in the conversation process can be removed. Correspondingly, the picture played on the video display end is the target user who is speaking. Therefore, the focus of a viewer at the video display end can be aimed at the target user who is speaking and other candidate users around the target user can not be seen, so that important information is prevented from being missed, and the efficiency of video call is improved.
In one embodiment, the determining, in step S101, the target user who is speaking among the plurality of candidate users may include:
step S1011: the detected sound source position is determined.
Step S1012: a target acquisition device is selected from a plurality of video image acquisition devices according to the sound source position.
Step S1013: from the video image acquired by the target acquisition device, the target user who is speaking is determined.
The video image capturing apparatus 201 may capture audio data in addition to video images. For example, the audio data acquired by the video image acquisition device 201 located at the intermediate position may be used as a positioning reference.
If the video image capturing apparatus 201 is provided with two receivers, the position of the sound source can be determined using the calculated time difference of the sound source to the two receivers. If the video image capturing apparatus 201 is provided with only one receiver, the position of the sound source can be determined using the measured intensity differences of the different sound waves.
After the sound source position is determined, a video image acquisition device which is closer to the sound source position can be used as a target acquisition device. As shown in connection with fig. 2, fig. 2 includes 3 candidate users, which are respectively located at three positions of the left side, the middle side and the right side of fig. 2. Correspondingly, there are 6 video image capturing devices 201, which are respectively located at the upper left, lower left, upper middle, lower middle, upper right and lower right in fig. 2. If the selected target user is the user on the left side in fig. 2, it can be selected among the two video image capturing apparatuses 201 on the upper left and the lower left. Further, if the target user is closer to the video image capturing apparatus, the video image capturing apparatus 201 at the lower left may be selected as the target capturing apparatus. If the target user is far from the video image capturing apparatus, the video image capturing apparatus 201 on the upper left may be selected as the target capturing apparatus. The determination of the distance may be determined by comparing the distance with a predetermined distance threshold.
After the target acquisition device is selected, the target user who is speaking can be determined from the video image acquired by the target acquisition device.
The purpose of selecting the target acquisition means is to reduce the effort of subsequently determining the target user who is speaking. It will be appreciated that if the target acquisition device is not selected, the identification of the target user who is speaking needs to be completed in the video images acquired by the plurality of video image acquisition devices. Obviously, the above identification process is time consuming and requires high computational effort. And the target acquisition device is selected according to the position of the speaker, so that only a single video image can be identified when the target user is identified, and the resource consumption is certainly saved.
In one embodiment, the determining the target user who is speaking from the video image acquired by the target acquisition device in step S1013 may include:
step S10131: and carrying out face detection in the video image acquired by the target acquisition device, and determining the face of at least one user.
Step S10132: and positioning key points of the faces of each user, and determining mouth key points.
Step S10133: and determining the target user who is speaking according to the position change condition of the mouth key points.
Face detection refers to detecting a face region in an image sequence. Face detection may be implemented by image feature-based algorithms such as classification algorithms (HaarCascade), gradient histograms, and feature extraction algorithms (hog+svm).
The detected face may be marked with a minimum bounding rectangle. Thereafter, the detected face may be tracked. That is, the tracking process is performed using an object detection algorithm (such as YOLO algorithm, fasterR-CNN algorithm, or the like) or the like in the image sequence.
For a recognized face, a shape-based algorithm may be further used to determine the location of key points of the face by computing shape features in the image, such as edges, curvatures, convex hulls, etc. For example, based on the hog+svm algorithm, a face key point locating algorithm based on image gradients is used to find the face key point. The advantage of selecting the HOG + SVM algorithm is that it may be more robust to rotation and scale variations.
As shown in connection with fig. 4, the determined keypoints may be identified by numerals for ease of description. The keypoints identified as 49 through 68 in fig. 4 are used to represent mouth keypoints. The target user who is speaking can be determined according to the position change condition of the mouth key points. For example, if the user is speaking, the frequency or magnitude of the change in his mouth keypoints will typically exceed a corresponding threshold. Based on this, the target user who is speaking can be determined by using the position change situation of the part key point.
In one embodiment, step S10133 may specifically include the following procedure:
and determining the target user who is speaking according to at least one of the displacement change amplitude of the mouth key point and the position change frequency of the mouth key point.
The displacement variation amplitude of the mouth keypoints may include a lateral variation amplitude and a longitudinal variation amplitude. For simplicity of the algorithm, the lateral variation amplitude may be referenced to the location of two keypoints identified as 49, 55. The keypoint identified as 49 is the leftmost mouth keypoint and the keypoint identified as 55 is the rightmost mouth keypoint. An initial distance w of the two keypoints identified as 49, 55 can be calculated. If the rate of change of w exceeds the corresponding threshold, it may be determined that a lateral change occurs. Similarly, the longitudinal variation amplitude may be referenced to the location of two keypoints identified as 52, 58. The keypoint identified as 52 is the uppermost mouth keypoint and the keypoint identified as 58 is the lowermost mouth keypoint. An initial distance h of the two keypoints identified as 52, 58 may be calculated. If the rate of change of h exceeds the corresponding threshold, it may be determined that a longitudinal change occurs.
Thus, at least one of the lateral change and the longitudinal change can be used as a basis for determining the target user who is speaking. In addition, the ratio of the two can also be used. For example, the ratio of the transverse initial distance to the longitudinal initial distance in the initial state is w/h. Whereby a corresponding ratio threshold can be set. If the horizontal change or the longitudinal change occurs, the changed distance ratio is compared with a ratio threshold value, and if the ratio threshold value is exceeded, the target user who is speaking can be determined accordingly.
In addition, the target user who is speaking can be determined by combining the position change frequency of the key points of the mouth. If a candidate user begins speaking, the number of times that the candidate user opens his mouth over a period of time is very fast and frequent. In contrast, a candidate user who does not speak has his mouth open and close at a low frequency, so that the number of times of opening and closing within a certain period of time is detected to determine whether the candidate user is speaking. For the frequency of the position change of the mouth keypoints, the positions of all the mouth keypoints may be used, or the positions of the mouth keypoints may be changed by more than a certain proportion (for example, 80% and 90%).
Based on this, it is possible to accurately detect whether or not the candidate user is speaking, that is, to detect the target user who is speaking.
In one embodiment, in a case that a speaking target user present in a video image acquired by the target acquisition device is occluded, the method may further include:
detecting video images acquired by a plurality of video image acquisition devices, and selecting video images which are not blocked by a target user who is speaking; and taking the video image acquisition device corresponding to the video image which is not blocked by the speaking target user as the video image acquisition device for acquiring the video image of the target user.
Because the target acquisition device for acquiring the video image of the target user is determined based on the sound source position, although the target user can be acquired, there are cases where the acquired target user is blocked. If the target user is blocked, the display effect of the video display end is affected. Based on the above, the video images acquired by the other video image acquisition devices can be detected, and the video image which is not blocked by the speaking target user can be selected.
The detection mode can be to detect the video images acquired by other video image acquisition devices by adopting a face tracking technology according to the identification of the target user, so as to select the video image which is not blocked by the target user who is speaking. Finally, the video image acquisition device corresponding to the video image which is not blocked by the speaking target user can be used as the video image acquisition device for acquiring the video image of the target user.
In one embodiment, the capturing the video image of the target user in step S102 may specifically include:
in the case that other candidate users are also included in the video image, the other candidate users in the video image are processed so that the target user is highlighted in the processed video image.
And when the target user is determined, other candidate users except the non-target user in the video image can be regarded as redundant information. Based on this, the body contour of the target user can be determined from the face recognition result of the target user. The determining process may include: the video image is preprocessed, which may include image enhancement, graying, binarization, and the like. For the preprocessed video image, computer vision techniques such as edge detection, feature extraction, morphological processing, etc. may be used to detect the human body contours in the preprocessed video image. Common human body contour detection methods include Sobel (Sobel) operator detection, prewitt (Prewitt) operator detection, and the like. The detected human body contour is extracted by using an area extraction algorithm, marked and stored. The marks may be the one-to-one correspondence of the detected human body marks and the face marks. Accordingly, the video picture is cut out according to the human body contour detected by the detected face mark of the target user. The clipping may be to retain information about the target user, such as the smallest bounding rectangle of the target user. If the area pixels of the target user after clipping are lower, pixel expansion can be performed according to the display requirements of the video display end. Illustratively, the monochrome pixel point can be diffused and supplemented to the periphery on the basis of the minimum circumscribed rectangle until the pixel requirement of the video display end is met.
Alternatively, blurring processing may be performed on candidate users other than the target user. Through the processing procedure, the target user can be highlighted in the processed video image.
An embodiment of the present application provides a method for processing a video image, and as shown in fig. 5, a flowchart of a method for processing a video image according to an embodiment of the present application corresponds to a second embodiment. The scene of the second embodiment is a video display, and the method may include:
step S501: and decoding the received audio and video data to obtain audio data and video images.
Step S502: and processing the video image by utilizing a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image.
Step S503: and carrying out three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image.
Step S504: and combining the three-dimensional video image with the audio data for playing.
The video image of the present application may include an image of a video conference or an image of a video call, etc. The execution body can be an intelligent terminal with image acquisition and image playing functions such as a mobile phone or a palm computer, a server which is in remote communication with the image acquisition equipment such as the intelligent terminal or a camera, and the like. The first embodiment corresponds to the video image acquisition, intelligent data processing and audio-video encoding process, as shown in fig. 6. The second embodiment corresponds to audio-video decoding, three-dimensional reconstruction, three-dimensional video image generation, and 3D display. The intelligent data processing may correspond to the target user determination in the first embodiment, deletion or blurring processing of other candidate users in the video image, and the like. The three-dimensional reconstruction may correspond to the processing of the video image using a pre-trained three-dimensional reconstruction model in the second embodiment to obtain a two-dimensional video image. The three-dimensional video image generation can correspondingly perform three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image.
After the video image and the audio image code of the target user are transmitted through the network at the sending end of the video image, the video display end decodes the received audio and video data, and then the audio data and the video image can be obtained. As in the first embodiment, the video image refers to a sequence of images. The resulting video image may be the image of fig. 3 that contains only the target user. Thus, the video image can be input to the pre-trained three-dimensional reconstruction model, so that at least two corresponding two-dimensional video images can be obtained for each frame of the image input to the model. The at least two corresponding two-dimensional video images may correspond to different display angles or display illumination, etc. The purpose of the three-dimensional reconstruction is to reconstruct and restore the three-dimensional scene shot by the video image acquisition device 201 at the video processing end at the video display end. The training process of the three-dimensional reconstruction model will be described later.
After at least two-dimensional video images are obtained, a three-dimensional fusion technology can be adopted to fuse the two-dimensional video images, so that a three-dimensional video image is obtained. Preferably, the three-dimensional video image may be a three-dimensional video image suitable for viewing by the naked eye. And geometrically combining the three-dimensional video image with the audio data to finish three-dimensional display at the video display end.
Because the display content of the video display end is three-dimensional, the three-dimensional scene shot by the video processing end can be restored. To show that the viewer at the video display end can have an immersive effect. In addition, if the video image sent from the video processing end is a processed video image which only highlights the speaking target user, then the picture played on the video display end is a three-dimensional speaking target user. Therefore, the focus of a viewer at the video display end can be aimed at the target user who is speaking and other candidate users around the target user can not be seen, so that important information is prevented from being missed, and the efficiency of video call is improved.
In one embodiment, the method may further include:
and detecting the position of the target user in the three-dimensional video image, and moving the target user to be within the specified position range when the detection result is that the target user is not within the specified position range.
For three-dimensional video images, they can be detected. The purpose of the detection is to determine whether the target user is within a specified location range of the three-dimensional video image. For example, the specified position range may be a middle region of the three-dimensional video image. For example, if the entirety of the target user is within the specified location range beyond a predetermined proportion, compliance may be determined. The predetermined ratio may be 80%, 90%, or 100% of the head of the target user may be located in a specified position range. Otherwise, it may be determined that the detection result is not in the specified position range. And if the detection result is that the target user is not in the specified position range, moving the target user to the specified position range.
In one embodiment, a method for training a three-dimensional reconstruction model may include:
step S505: and utilizing the video image samples containing the users as a training data set of the three-dimensional reconstruction model to be trained, wherein the users in each image sample are users with different ages, wearing and/or actions.
Step S506: and calculating parameters of the three-dimensional reconstruction model to be trained by using a back propagation algorithm according to the training data set and the labeling result of the input data, and adjusting the weight and the bias of the three-dimensional reconstruction model to be trained to obtain the preliminarily trained three-dimensional reconstruction model.
Step S507: and optimizing the preliminarily trained three-dimensional reconstruction model by using the test data set and the labeling result of the test data set to obtain the finally trained three-dimensional reconstruction model.
The training data set may contain video images of users of different ages, wearing different, and acting different. The finer dimension division can also comprise users with different nationalities, different complexion, different color development and different facial makeup. The difference in the makeup of the faces may include whether or not to wear glasses, whether or not to wear ornaments, make up in different hairstyles, and the like. The different users can be virtual users generated by using a model, and can also be real users authorized to shoot.
The training data set is input into a three-dimensional reconstruction model to be trained, and the three-dimensional reconstruction model to be trained can obtain at least two-dimensional images corresponding to the training data according to the input training data as an output result. And adjusting parameters in the three-dimensional reconstruction model by utilizing errors between the labeling result (two-dimensional image true value) of the input data and the output result. The error can be represented by a loss function, and the effect of the loss function can be understood as: when an output result obtained by forward propagation of the three-dimensional reconstruction model to be trained is close to a labeling result, a smaller value of the loss function is taken; otherwise, the value of the loss function increases. And, the loss function is a function with all parameters, weights and biases in the three-dimensional reconstruction model as arguments.
And adjusting all parameters, weights and biases in the three-dimensional reconstruction model to be trained by using a back propagation algorithm. The errors are counter-propagated in each layer of the three-dimensional reconstruction model to be trained, and parameters of each layer of the three-dimensional reconstruction model to be trained are adjusted according to the errors until the output result of the three-dimensional reconstruction model to be trained converges or reaches the expected effect. Thus, a preliminarily trained three-dimensional reconstruction model can be obtained.
And then, inputting the test data set into a primarily trained three-dimensional reconstruction model, wherein the primarily trained three-dimensional reconstruction model can obtain at least two-dimensional images corresponding to the primarily trained three-dimensional reconstruction model as an output result according to the input test data. And verifying the preliminarily trained three-dimensional reconstruction model by using the test data set, checking whether the preliminarily trained three-dimensional reconstruction model has generalization capability, and if the preliminarily trained three-dimensional reconstruction model has a good effect on the output result of the test data set, indicating that the model training is successful. If the effect is not good, the parameters in the three-dimensional reconstruction model need to be optimized and adjusted. The principle of the optimization and adjustment is the same as the training principle, and detailed description is omitted. The test data set may be the same data as the training data set, or may be a re-recorded video image containing a real person.
Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides a processing device of the video image. Fig. 7 is a block diagram of a video image processing apparatus according to an embodiment of the present application, where the video image processing apparatus may include:
a target user determining module 701, configured to determine a target user who is speaking from a plurality of candidate users;
A video image acquisition module 702, configured to acquire a video image of a target user;
and the transmission module 703 is used for encoding the video image and the collected audio data of the target user and then carrying out network transmission.
In one embodiment, the target user determination module 701 may include:
a sound source position determination sub-module for determining a detected sound source position;
the target acquisition device selection submodule is used for selecting a target acquisition device from a plurality of video image acquisition devices according to the sound source position;
the target user determination execution sub-module is used for determining the target user who is speaking from the video image acquired by the target acquisition device.
In one embodiment, the target user determination execution sub-module may include:
the face detection unit is used for carrying out face detection in the video image acquired by the target acquisition device and determining the face of at least one user;
the key point positioning unit is used for positioning key points of the faces of each user and determining mouth key points;
and the target user determining unit is used for determining the target user who is speaking according to the position change condition of the mouth key point.
In one embodiment, the target user determining unit is specifically configured to:
And determining the target user who is speaking according to at least one of the displacement change amplitude of the mouth key point and the position change frequency of the mouth key point.
In one embodiment, in a case that a presence of a speaking target user appearing in the video image acquired by the target acquisition device is blocked, the method further includes:
the shielding detection unit is used for detecting the video images acquired by the plurality of video image acquisition devices and selecting video images which are not shielded by a speaking target user;
the video image acquisition device determining module is used for taking the video image acquisition device corresponding to the video image which is not blocked by the speaking target user as the video image acquisition device for acquiring the video image of the target user.
In one embodiment, the video image acquisition module 702 is specifically configured to:
in the case that other candidate users are also included in the video image, the other candidate users in the video image are processed so that the target user is highlighted in the processed video image.
Corresponding to the application scene and the method of the method provided by the embodiment of the application, the embodiment of the application also provides a processing device of the video image. Fig. 8 is a block diagram of a video image processing apparatus according to an embodiment of the present application, which may include:
A decoding module 801, configured to decode received audio and video data to obtain audio data and a video image;
a two-dimensional video image generating module 802, configured to process a video image by using a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image;
the three-dimensional image generation module 803 is configured to perform three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image;
the video playing module 804 is configured to combine the three-dimensional video image with the audio data for video playing.
In one embodiment, the position correction module is further included, and the position correction module can be used for detecting the position of the target user in the three-dimensional video image, and if the detection result is that the target user is not in the specified position range, the target user is moved to be within the specified position range.
In one embodiment, further comprising a model training module, the model training module may comprise:
the training data set construction submodule is used for utilizing video image samples containing users as training data sets of the three-dimensional reconstruction model to be trained, wherein the users in the image samples are users with different ages, wearing and/or actions;
the primary training sub-module is used for calculating parameters of the three-dimensional reconstruction model to be trained by using a back propagation algorithm according to the training data set and the labeling result of the input data, and adjusting the weight and bias of the three-dimensional reconstruction model to be trained to obtain a primary trained three-dimensional reconstruction model;
And the model optimization sub-module is used for optimizing the preliminarily trained three-dimensional reconstruction model by using the test data set and the labeling result of the test data set to obtain the finally trained three-dimensional reconstruction model.
The functions of each module in each device of the embodiment of the present application may be referred to the corresponding descriptions in the above methods, and have corresponding beneficial effects, which are not described herein. It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
Fig. 9 is a block diagram of an electronic device used to implement an embodiment of the application. As shown in fig. 9, the electronic device includes: memory 910 and processor 920, memory 910 stores a computer program executable on processor 920. The processor 920 implements the method in the above-described embodiments when executing the computer program. The number of memories 910 and processors 920 may be one or more.
The electronic device further includes:
and the communication interface 930 is used for communicating with external equipment and carrying out data interaction transmission.
If the memory 910, the processor 920, and the communication interface 930 are implemented independently, the memory 910, the processor 920, and the communication interface 930 may be connected to each other and perform communication with each other through buses. The bus may be an industry standard architecture (IndustryStandardArchitecture, ISA) bus, an external device interconnect (PeripheralComponent Interconnect, PCI) bus, or an extended industry standard architecture (ExtendedIndustryStandardArchitecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 910, the processor 920, and the communication interface 930 are integrated on a chip, the memory 910, the processor 920, and the communication interface 930 may communicate with each other through internal interfaces.
The embodiment of the application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the method provided in the embodiment of the application.
The embodiment of the application also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication equipment provided with the chip executes the method provided by the embodiment of the application.
The embodiment of the application also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the application embodiment.
It should be appreciated that the processor may be a central processing unit (CentralProcessingUnit, CPU), but may also be other general purpose processors, digital signal processors (DigitalSignalProcessor, DSP), application specific integrated circuits (ApplicationSpecificIntegratedCircuit, ASIC), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (AdvancedRISCMachines, ARM) architecture.
Further alternatively, the memory may include a read-only memory and a random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include Read-only memory (ROM), programmable Read-only memory (ProgrammableROM, PROM), erasable programmable Read-only memory (ErasablePROM, EPROM), electrically erasable programmable Read-only memory (ElectricallyEPROM, EEPROM), or flash memory, among others. Volatile memory can include random access memory (RandomAccessMemory, RAM), which acts as external cache. By way of example, and not limitation, many forms of RAM are available. For example, static Random Access Memory (SRAM), dynamic random access memory (DynamicRandomAccessMemory, DRAM), synchronous dynamic random access memory (SynchronousDRAM, SDRAM), double data rate synchronous dynamic random access memory (DoubleDataRateSDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (EnhancedSDRAM, ESDRAM), synchronous link dynamic random access memory (SynclinkDRAM, SLDRAM), and direct memory bus random access memory (DirectRambusRAM, DRRAM).
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Any process or method described in flow charts or otherwise herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps described in the flowcharts or otherwise described herein, e.g., may be considered a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is merely an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (13)

1. A method of processing a video image, comprising:
determining a target user who is speaking from a plurality of candidate users;
collecting video images of the target user;
and encoding the video image and the acquired audio data of the target user, and then carrying out network transmission.
2. The method of claim 1, wherein the determining the target user who is speaking among the plurality of candidate users comprises:
determining the detected sound source position;
selecting a target acquisition device from a plurality of video image acquisition devices according to the sound source position;
and determining the target user who is speaking from the video image acquired by the target acquisition device.
3. The method of claim 2, wherein the determining the target user who is speaking from the video image acquired by the target acquisition device comprises:
Performing face detection in the video image acquired by the target acquisition device, and determining the face of at least one user;
positioning key points of the faces of each user, and determining mouth key points;
and determining the target user who is speaking according to the position change condition of the mouth key point.
4. A method according to claim 3, wherein said determining the target user who is speaking based on the change in position of the mouth keypoints comprises:
and determining the speaking target user according to at least one of the displacement variation amplitude of the mouth key point and the position variation frequency of the mouth key point.
5. The method of claim 2, further comprising, in the event that the presence of the speaking target user in the video image acquired by the target acquisition device is occluded:
detecting the video images acquired by the video image acquisition devices, and selecting the video image which is not blocked by the speaking target user;
and taking the video image acquisition device corresponding to the video image which is not blocked by the speaking target user as the video image acquisition device for acquiring the video image of the target user.
6. The method of claim 1, wherein the capturing the video image of the target user comprises:
and processing other candidate users in the video image under the condition that the video image also comprises other candidate users, so that the target user is highlighted in the processed video image.
7. A method of processing a video image, comprising:
decoding the received audio and video data to obtain audio data and video images;
processing the video image by utilizing a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image;
performing three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image;
and combining the three-dimensional video image with the audio data for playing.
8. The method as recited in claim 7, further comprising:
and detecting the position of a target user in the three-dimensional video image, and moving the target user to be within a specified position range when the detection result is that the target user is not within the specified position range.
9. The method of claim 7, wherein the training method of the three-dimensional reconstruction model comprises:
Using video image samples containing users as training data sets of the three-dimensional reconstruction model to be trained, wherein the users in each image sample are users with different ages, wearing and/or actions;
calculating parameters of the three-dimensional reconstruction model to be trained by using a back propagation algorithm according to the training data set and the labeling result of the input data, and adjusting the weight and bias of the three-dimensional reconstruction model to be trained to obtain a preliminarily trained three-dimensional reconstruction model;
and optimizing the preliminarily trained three-dimensional reconstruction model by using the test data set and the labeling result of the test data set to obtain a final trained three-dimensional reconstruction model.
10. A video image processing apparatus, comprising:
the target user determining module is used for determining a target user which is speaking from a plurality of candidate users;
the video image acquisition module is used for acquiring video images of the target user;
and the transmission module is used for carrying out network transmission after encoding the video image and the collected audio data of the target user.
11. A video image processing apparatus, comprising:
the decoding module is used for decoding the received audio and video data to obtain audio data and video images;
The two-dimensional video image generation module is used for processing the video image by utilizing a pre-trained three-dimensional reconstruction model to obtain a two-dimensional video image;
the three-dimensional image generation module is used for carrying out three-dimensional fusion processing on the two-dimensional video image to obtain a three-dimensional video image;
and the video playing module is used for combining the three-dimensional video image with the audio data to play the video.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-9 when the computer program is executed.
13. A computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-9.
CN202310900219.6A 2023-07-20 2023-07-20 Video image processing method and device, electronic equipment and storage medium Pending CN116939242A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310900219.6A CN116939242A (en) 2023-07-20 2023-07-20 Video image processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310900219.6A CN116939242A (en) 2023-07-20 2023-07-20 Video image processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116939242A true CN116939242A (en) 2023-10-24

Family

ID=88382207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310900219.6A Pending CN116939242A (en) 2023-07-20 2023-07-20 Video image processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116939242A (en)

Similar Documents

Publication Publication Date Title
CN108764091B (en) Living body detection method and apparatus, electronic device, and storage medium
US10896518B2 (en) Image processing method, image processing apparatus and computer readable storage medium
US10387724B2 (en) Iris recognition via plenoptic imaging
US10872420B2 (en) Electronic device and method for automatic human segmentation in image
US11074436B1 (en) Method and apparatus for face recognition
CN108197586B (en) Face recognition method and device
EP3236391B1 (en) Object detection and recognition under out of focus conditions
US10284817B2 (en) Device for and method of corneal imaging
US8391645B2 (en) Detecting orientation of digital images using face detection information
US8081844B2 (en) Detecting orientation of digital images using face detection information
WO2020018359A1 (en) Three-dimensional living-body face detection method, face authentication recognition method, and apparatuses
CN107945135B (en) Image processing method, image processing apparatus, storage medium, and electronic device
CN108810406B (en) Portrait light effect processing method, device, terminal and computer readable storage medium
CN109002796B (en) Image acquisition method, device and system and electronic equipment
CN110781770B (en) Living body detection method, device and equipment based on face recognition
CN111325107A (en) Detection model training method and device, electronic equipment and readable storage medium
US20100014760A1 (en) Information Extracting Method, Registration Device, Verification Device, and Program
CN116939242A (en) Video image processing method and device, electronic equipment and storage medium
CN107578006B (en) Photo processing method and mobile terminal
CN113837019B (en) Cosmetic progress detection method, device, equipment and storage medium
US10282633B2 (en) Cross-asset media analysis and processing
CN114299569A (en) Safe face authentication method based on eyeball motion
KR102151851B1 (en) Face recognition method based on infrared image and learning method for the same
Ma et al. Totems: Physical objects for verifying visual integrity
CN113691731B (en) Processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination