WO2019041992A1 - 一种图像处理方法、装置和终端设备 - Google Patents

一种图像处理方法、装置和终端设备 Download PDF

Info

Publication number
WO2019041992A1
WO2019041992A1 PCT/CN2018/092887 CN2018092887W WO2019041992A1 WO 2019041992 A1 WO2019041992 A1 WO 2019041992A1 CN 2018092887 W CN2018092887 W CN 2018092887W WO 2019041992 A1 WO2019041992 A1 WO 2019041992A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
image data
face
area
virtual reality
Prior art date
Application number
PCT/CN2018/092887
Other languages
English (en)
French (fr)
Inventor
戴天荣
朱育革
赵大川
陈翔
Original Assignee
歌尔股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 歌尔股份有限公司 filed Critical 歌尔股份有限公司
Priority to US16/461,718 priority Critical patent/US11295550B2/en
Publication of WO2019041992A1 publication Critical patent/WO2019041992A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to an image processing method, apparatus, and terminal device.
  • VR Virtual Reality
  • the host side configures a 360-degree camera, collects the full-view scene video of the live broadcast location, and shares the VR head-mounted display device (HMD) to the access terminal via the network, and the visitor wears
  • the VR HMD experiences scene video on the host side and can be viewed by rotating the head to view scenes of different perspectives.
  • the feature of this application is that the VR video data stream is transmitted in one direction.
  • VR social networking requires two-way flow of VR video data between two points, that is, both parties need to configure a 360-degree camera and VR HMD at the same time, and collect local full-view video and send it to the other party. Watched by the other party from the VR HMD.
  • an image processing method, apparatus and terminal device of the present invention have been proposed in order to solve or at least partially solve the above problems.
  • an image processing method comprising:
  • the first face image data and the second face image data are combined to generate a composite image.
  • an image processing apparatus comprising:
  • a first acquiring unit configured to acquire an actual image of the specified target from the video stream collected by the camera, where the designated target is wearing the virtual reality wearing device;
  • a recognition unit configured to identify, from the actual image, the occlusion area of the virtual target wearing device that is not covered by the virtual reality wearing device, and the occlusion area of the display device that is not covered by the virtual reality wearing device First face image data;
  • a second acquiring unit configured to obtain, according to the first facial image data and the preset facial expression model, second facial image data that matches the first facial image data, and the second facial image data and the virtual reality
  • the head covering display device corresponds to the occlusion area
  • a generating unit configured to combine the first facial image data and the second facial image data to generate a composite image.
  • a terminal device comprising: an image processing device as described above.
  • the technical solution of the present invention is that, after acquiring the actual image of the specified target wearing the virtual reality wearing device, the virtual reality wearing of the specified target face is first recognized from the actual image. Displaying the occlusion area of the device and the occlusion area of the display device by the virtual reality, and inputting the first facial image data corresponding to the occlusion area of the virtual reality wearing device to the preset facial expression model, The second facial image data matched by the facial image data; then the first facial image data and the second facial image data are merged to generate a composite image.
  • the composite image is a complete image with expression information, and the composite image is compared with the static image. More realistic and accurate, it helps the social parties to get the facial expression information of the other party in time, improve the social quality, ensure the smooth progress of the social, and enhance the user experience.
  • FIG. 1 is a schematic flowchart diagram of an image processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing the functional structure of an image processing apparatus according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a function of an image processing apparatus according to another embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
  • the design idea of the present invention is that, in view of the technical solution of covering the portion blocked by the VR HMD by using the static picture of the eye, the lack of expression information is still caused, and the static picture and the rest of the face cannot be well integrated, which is not very good. natural. It is also considered that there is a strong correlation between the image of the eye and the peripheral portion of the face that is blocked by the virtual reality wearing device and the image information of the face that is not blocked by the virtual reality wearing device.
  • the technical solution introduces a facial expression model, and obtains a facial image of an occlusion region that matches a facial image information that is not blocked by the virtual reality wearing display device through the facial expression model, thereby obtaining a composite image with complete expression information.
  • FIG. 1 is a schematic flowchart diagram of an image processing method according to an embodiment of the present invention. As shown in FIG. 1, the image processing method includes:
  • Step S110 Acquire an actual image of the specified target from the video stream collected by the camera, wherein the designated target is wearing a virtual reality wearing device (VR HMD).
  • VR HMD virtual reality wearing device
  • the camera is disposed at a position where the specified target can be collected, and the camera may be a camera item or a camera set on the terminal device, as long as the device that implements the method can obtain the video stream collected by the camera. can.
  • the participating users include Party A and Party B. Both Party A and Party B wear VR HMD, and both are equipped with cameras.
  • the camera can collect video streams including Party A and Party B respectively.
  • the camera configured by the user at Party A can collect the video stream including the user of Party A
  • the camera configured by the user at Party B can collect the video stream including the user of Party B.
  • This embodiment is for the user who participates in the social interaction.
  • the camera transmits the video stream of the specified target (the party A) to the social party (the party B).
  • the designated target may be a user who wears a VR HMD to socialize, and the designated target wears the VR HMD, so in the actual image, the eyes of the face of the specified target and the part around the eyes are blocked by the VR HMD, and cannot be Get complete emoticons and influence social processes.
  • an actual image of a specified target needs to be acquired from the video stream captured by the camera.
  • Step S120 Identifying, from the actual image, the occlusion region of the specified target face that is not occluded by the virtual reality wearing device and the occlusion region of the virtual reality wearing device, and acquiring the occlusion region corresponding to the occlusion region of the virtual reality wearing device A facial image data.
  • the face of the specified target in the actual image is identified by the image recognition method, and the area that the face is not blocked by the VR HMD and the area that is blocked by the VR HMD are recognized, because the area that is not blocked by the VR HMD needs to be blocked.
  • Image data matching the VR HMD occlusion area is obtained, so it is necessary to acquire the first face image data that is not occluded by the virtual reality head-mounted display device from the actual image.
  • Step S130 according to the first facial image data and the preset facial expression model, obtaining second facial image data matching the first facial image data, the second facial image data and the virtual reality wearing display device
  • the occlusion area corresponds.
  • the preset facial expression model is obtained by training the specified target sample (for example, using a neural network for machine learning), and the image data not blocked by the VR HMD can be obtained and occluded by the VR HMD in the sample training.
  • the relationship between the image data of the region, and therefore, based on the first face image data acquired from the actual image and the preset facial expression model, the second face image matching the first face image data can be obtained.
  • the data, that is, the image data that matches the occlusion area of the VR HMD is obtained.
  • Step S140 the first facial image data and the second facial image data are combined to generate a composite image.
  • the first face image data and the second face image data are combined by the image fusion method to generate a composite image. Because the second facial image data is an image with expression information matched by the VR HMD occlusion region, therefore, the composite image has a complete expression with the specified target, and after the composite image is obtained, the composite image can be The local user participating in the social network sends the other user who participates in the social interaction.
  • the composite image is a complete image with expression information, compared to the static image fusion using no expression information.
  • the composite image of the embodiment is more realistic and accurate, and is beneficial for the social parties to obtain the facial expression information of the other party in time, improve the social quality, ensure the smooth progress of the social, and enhance the user experience.
  • obtaining the second facial image data matching the first facial image data according to the first facial image data and the preset facial expression model in step S130 includes: The face image data is input to a preset facial expression model such that the facial expression model recognizes the first facial image data, and outputs second facial image data that matches the first facial image data.
  • the preset facial expression model there is a relationship between image data not blocked by the VR HMD and image data of the occlusion region of the VR HMD, when the first facial image data is input to the preset face.
  • the facial expression model recognizes the first facial image data, and outputs second facial image data that matches the first facial image data. That is to say, the preset facial expression model automatically analyzes the first facial image data, and then directly generates the second facial image data that matches the first facial data according to the first facial image data, thereby facilitating the improvement of the image. The efficiency of processing further increases the user experience.
  • the preset facial expression model is obtained by using a deep neural network
  • the preset facial expression model obtained by the deep neural network includes:
  • each second sample image includes a facial expression of the designated user.
  • the purpose of acquiring a plurality of first sample images is to extract a portion of the second sample image corresponding to the VR HMD occlusion region, for example, the VR HMD occlusion region is an eye region, It is necessary to extract the eye region in the second sample image.
  • the plurality of second sample images should contain various expression information of the user so that the more accurate second image data can be matched when the actual image is processed.
  • the marked area obtained here is the same area of the occluded area in the first sample image, and the marked area corresponds to an image element in an unoccluded state in which the occlusion area includes expression information of the specified target.
  • the first occluded area is an eye area
  • the eye area of the designated target face of the second sample image is marked.
  • the image of the marked area is placed in the second specified set, and the second specified set is used as the input set of the deep neural network training, and the image of the first specified set is placed as the image element in the output set, and the second designation is placed.
  • the image in the collection acts as an image element in the input collection.
  • the second specified set has an input-output correspondence corresponding to the image elements in the first specified set, that is, two images having a one-to-one correspondence in the first specified set and the second specified set.
  • the elements are from the same second sample image. For example, in the second specified set is the image element of the eye region of the sample image 1, and the image elements of the non-eye region of the sample image 1 are in a one-to-one correspondence with the first specified set.
  • each pair of image elements having an input-output correspondence relationship in the input set and the output set is input into a preset depth neural network for training, because the input set is not in the second sample.
  • the image element of the marked area (corresponding to the image element of the unoccluded area), and the image element in the output set is the marked area image corresponding to each image element in the input set (corresponding to the occluded area in the unoccluded state) Image element). Therefore, after training in a preset deep neural network, a function relationship between the occlusion region image and the region image of the occlusion region in an unobstructed state can be obtained.
  • the occluded area in the first sample image is an eye area
  • the image element in the input set is an image element of the non-eye area in the second sample image
  • the output set is the second sample.
  • the image element of the eye region in the unobstructed state of the eye region in the image is trained in the preset depth neural network, and the image element and the eye region of the non-eye region are obtained without being occluded.
  • the functional relationship between the image elements of the eye area is an eye area
  • the image element in the input set is an image element of the non-eye area in the second sample image
  • the output set is the second sample.
  • the function relationship obtained above is a function relationship between the unoccluded area image and the generated occlusion area image.
  • the image corresponding to the unoccluded area can be generated according to the function relationship.
  • Occlusion area image When acquiring the video stream collected by the camera, determining an actual image of the specified target in the video stream, and identifying an unoccluded area of the specified target face from the actual image, according to the function relationship obtained above, The image data of the occlusion area that is not occluded by the occlusion area is combined with the image data of the obtained occlusion area to generate a composite image.
  • the composite image is a face image that specifies a complete target, which is an unoccluded facial image.
  • a deep neural network is designed, the type, the number of layers, and the number of nodes in each layer are set according to the image resolution and the required generation effect.
  • a machine learning method using a deep neural network obtains a facial expression model of a specified target by performing machine learning on a sample image of a specified target.
  • the second specified set of the embodiment has an input-output correspondence corresponding to the image elements in the first specified set, that is, the present embodiment has supervised training through the deep neural network, and has input and output.
  • the image elements of the corresponding relationship are input into the deep neural network for training to generate neural network model parameters. Because the input image elements and the output image elements have corresponding relationships, the unoccluded area image and the generated occlusion area can be generated through training.
  • the functional relationship between images: output f(input), input is the image of the unoccluded area of the face, and output is the generated eye and the facial image corresponding to the occlusion area.
  • the present embodiment introduces a machine learning method of a deep neural network, and trains a sample image of a specified target, and uses artificial intelligence to generate image data of the occlusion region of the VR HMD by training-predicting the sample image of the specified target.
  • the composite image is more closely matched to the specified target, and the generated composite image is more natural and enhances the user experience.
  • the loss function is a vital part of machine learning optimization. It can measure the predictive power of the model based on the predicted results. In practical applications, the loss function is subject to many factors, such as whether there are outliers, the choice of machine learning algorithms, the time complexity of gradient descent, the difficulty of derivation, and the confidence of predicted values. Therefore, the loss function suitable for different types of data is also different.
  • the loss function of the preset depth neural network training is an image in the output set and an generated image matching the image in the input set. The mean square error between.
  • the image elements in the input set have a one-to-one correspondence with the image elements in the output set.
  • the loss function is the image element in the output set and the actual generated input and input.
  • the mean square error between images that match the image elements in the collection For example, the image elements 1, 2, and 3 in the input set respectively have a one-to-one correspondence with the image elements 4, 5, and 6 in the output set, and are actually generated according to the determined functional relationship and the graphic elements 1, 2, and 3.
  • the picture elements 1, 2, 3 match the picture elements 7, 8, 9, then the loss function is the mean square error between picture element 4 and picture element 7, picture element 5 and picture element 8, picture element 6 and picture element 9.
  • the VR HMD is larger than the face of the specified target.
  • the VR HMD also blocks a part of the non-face area. If only the face is image processed, the generated image is generated. The difference between the composite image and the real effect is large, and the non-face image that is blocked by the VR HMD needs to be occluded, which can be performed by the following method:
  • the method illustrated in FIG. 1 further includes: identifying a non-face region occluded by the virtual reality headset from the actual image; and obtaining the actual image from the video stream a third image, the background image is extracted from the third image, and the non-face region occluded by the virtual reality wearing device is used using image data in the background image that matches the non-face region occluded by the virtual reality wearing device Perform de-occlusion processing.
  • the number of the third images here is not specifically limited. Since the video stream collected by the camera is relatively fixed with the position of the environment, the occlusion processing can be performed according to the background image information in the plurality of image frames before the actual image.
  • the method shown in FIG. 1 further includes: identifying non-face image data occluded by the virtual reality wearing device from the actual image, and inputting non-face image data to In the preset non-face model, the preset non-face model identifies the non-face image data, and outputs the fourth image data that matches the non-face region occluded by the virtual reality wearing device, according to the fourth image The data de-occludes the non-face area that is obscured by the virtual reality headset.
  • the preset non-face model in this embodiment can be generated by an unsupervised training neural network.
  • the above-described de-occlusion processing may adopt an image fusion method to fuse the acquired image data or the fourth image data that matches the non-face region occluded by the VR HMD with the image data that is not blocked by the VR HMD in the actual image.
  • the non-face area occluded by the virtual reality wearing device avoids the fusion of the first facial image data and the second facial image data, and the connection with the non-face area is too Obviously, the generated composite image is guaranteed to be more realistic and complete, rather than merely embodying the expression information of the specified target, and the entire composite image is more ornamental and enhances the user experience.
  • the generated composite image is image data of the first facial image data, the second facial image data, and the non-human face that is not blocked by the VR HMD.
  • the acquired image data or the fourth image data matched with the non-face area occluded by the VR HMD is fused to generate a complete composite image.
  • the non-face image data blocked by the VR HMD in this embodiment may be an area such as a hair or an ear of a specified target, and the occluded hair or ear may be displayed by the above (1) or (2), so that The resulting composite image is more realistic.
  • FIG. 2 is a schematic diagram showing the functional structure of an image processing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the image processing apparatus 200 includes:
  • the first obtaining unit 210 is configured to acquire an actual image of the specified target from the video stream collected by the camera, where the designated target is wearing the virtual reality wearing device.
  • the identifying unit 220 is configured to identify, from the actual image, the occlusion region of the virtual target wearing device that is not covered by the virtual reality wearing device, and the occlusion region of the virtual reality wearing device, and the occlusion region that is not covered by the virtual reality wearing device Corresponding first face image data.
  • the second obtaining unit 230 is configured to obtain, according to the first facial image data and the preset facial expression model, second facial image data that matches the first facial image data, and the second facial image data is virtualized
  • the actual head-mounted display device corresponds to the occlusion area.
  • the generating unit 240 is configured to combine the first facial image data and the second facial image data to generate a composite image.
  • the second obtaining unit 230 is configured to input the first facial image data into the preset facial expression model, so that the facial expression model recognizes the first facial image data, and outputs Second face image data that matches the first face image data.
  • the second obtaining unit 230 further includes:
  • a training module configured to obtain a preset facial expression model by using a deep neural network, specifically: acquiring a plurality of first sample images of a specified target captured by the camera in the first scene, and collecting the second sample in the second scene Specifying a plurality of second sample images of the target; wherein, in the first scenario, the designated target is wearing the virtual reality wearing device; and in the second scenario, the designated target is not wearing the virtual reality wearing display device, and each second sample
  • the image includes a facial expression of the specified user; the first occluded area is identified from the first sample image, the first occluded area information is acquired; and the designated target face of the second sample image is obtained according to the first occluded area information Marking an area corresponding to the first occluded area; placing an image of the marked area in the second sample image into the first specified set, and using the first specified set as an output set during deep neural network training; An image of an unmarked area of a specified target face in the two sample image is placed in a second specified set, and the
  • the loss function of the preset depth neural network training is a mean square error between the image in the output set and the generated image that matches the image in the input set.
  • the image processing apparatus 200 shown in FIG. 2 further includes:
  • a processing unit configured to identify a non-face region occluded by the virtual reality headset from the actual image; obtain a plurality of third images before the actual image from the video stream, extract a background image from the third image, and use the background
  • the image data corresponding to the non-face area occluded by the virtual reality wearing device in the image is subjected to a occlusion process on the non-face area blocked by the virtual reality wearing device.
  • the image processing apparatus 200 shown in FIG. 2 further includes:
  • a processing unit configured to identify non-face image data blocked by the virtual reality wearing device from the actual image, and input non-face image data into a preset non-face model to make a preset non-
  • the face model recognizes non-face image data, outputs fourth image data that is matched with the non-face region, and performs de-occlusion processing on the non-face region based on the fourth image data.
  • the present invention also provides an embodiment of a processing apparatus for image data.
  • FIG. 3 is a schematic structural diagram of an image processing apparatus according to another embodiment of the present invention.
  • the image processing apparatus 300 includes a memory 310 and a processor 320.
  • the memory 310 and the processor 320 are communicably connected by an internal bus 330.
  • the memory 310 stores a computer program 311 capable of image processing that can be executed by the processor 320.
  • the computer program 311 of the image processing can be implemented by the processor 320 to implement the above method steps.
  • memory 310 can be a memory or a non-volatile memory.
  • the non-volatile memory may be: a storage drive (such as a hard drive), a solid state drive, any type of storage disk (such as a compact disc, a DVD, etc.), or a similar storage medium, or a combination thereof.
  • the memory can be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory.
  • RAM Random Access Memory
  • volatile memory volatile memory
  • non-volatile memory flash memory.
  • the non-volatile memory and memory are machine-readable storage media on which a computer program 311 for image processing performed by the processor 320 can be stored.
  • FIG. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
  • the terminal device 400 includes an image processing device 410 as shown in FIG. 2 or 3.
  • the terminal device 400 is a virtual reality head mounted display device.
  • the terminal device 400 is a computer or a server that is connected to the virtual reality head-mounted display device during the social process, and the composite image of the user who participates in the social interaction may be sent to the other user participating in the social interaction through the computer or the server.
  • the technical solution of the present invention is that, after acquiring the actual image of the specified target wearing the virtual reality wearing device, the virtual reality wearing of the specified target face is first recognized from the actual image. Displaying the occlusion area of the device and the occlusion area of the display device by the virtual reality, and inputting the first facial image data corresponding to the occlusion area of the virtual reality wearing device to the preset facial expression model, The second facial image data matched by the facial image data; then the first facial image data and the second facial image data are merged to generate a composite image.
  • the composite image is a complete image with expression information, which is beneficial for the social parties to obtain the expression information of the other party in time. Improve social quality, ensure the smooth progress of social activities, and enhance the user experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)
  • Image Processing (AREA)

Abstract

本发明公开了一种图像处理方法、装置和终端设备。该方法包括:从摄像头采集的视频流中获取指定目标的实际图像;从实际图像中识别出指定目标脸部的未被虚拟现实头戴设备遮挡区域和被虚拟现实头戴设备遮挡区域,获取与未被遮挡区域对应的第一脸部图像数据;根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第二脸部图像数据,第二脸部图像数据与被遮挡区域相对应;将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。该图像处理装置包括第一获取单元、识别单元、第二获取单元和生成单元,用于执行上述的方法步骤。本方案有利于社交双方及时获得对方的表情信息,保证社交的顺利进行,提升用户体验。

Description

一种图像处理方法、装置和终端设备 技术领域
本发明涉及计算机技术领域,特别涉及一种图像处理方法、装置和终端设备。
背景技术
虚拟现实技术(Virtual Reality,简称VR)的一个重要应用领域是社交领域。例如,VR视频直播的应用中,主持人侧配置360度摄像头,采集直播地点的全视角场景视频,经由网络共享给接入端的VR头戴显示设备(Head Mounted Device,简称HMD),访客通过佩戴VR HMD体验主持人侧的场景视频,并可以通过转动头部来观看不同视角的场景。该应用的特点是VR视频数据流为单向传输。随着VR社交需求的不断提升,VR社交需要两点之间实现VR视频数据流的双向流动,即社交双方都需要同时配置360度摄像头和VR HMD,同时采集本地全视角视频并发送给对方,由对方从VR HMD中观看。
但是,因为社交双方均佩戴VR HMD,这将导致本地摄像头拍摄到的人脸都会被VR HMD遮挡住眼睛及周围部分。因为眼部周围图像带有非常丰富的表情信息,表情信息的缺失严重影响VR技术在社交领域的应用。所以,急需一种图像处理方案,对被VR HMD遮挡住的眼睛及周围部分进行重建,以保证社交过程中表情信息的完整。
发明内容
鉴于上述问题,提出了本发明的一种图像处理方法、装置和终端设备,以便解决或至少部分地解决上述问题。
根据本发明的一个方面,提供了一种图像处理方法,该方法包括:
从摄像头采集的视频流中获取指定目标的实际图像,其中,指定目标佩戴有虚拟现实头戴设备;
从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据;
根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第 二脸部图像数据,第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应;
将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。
根据本发明的另一个方面,提供了一种图像处理装置,该装置包括:
第一获取单元,用于从摄像头采集的视频流中获取指定目标的实际图像,其中,指定目标佩戴有虚拟现实头戴设备;
识别单元,用于从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据;
第二获取单元,用于根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第二脸部图像数据,第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应;
生成单元,用于将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。
根据本发明的又一个方面,提供了一种终端设备,该终端设备包括:如前所述的图像处理装置。
综上所述,本发明技术方案的有益效果是:当获取到戴有虚拟现实头戴设备的指定目标的实际图像后,先从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,将未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据输入到预设的脸部表情模型中,就可以得到与第一脸部图像数据匹配的第二脸部图像数据;然后将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。因为第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应,且带有表情信息,所以合成图像则是完整的带有表情信息的图像,相比较使用静态图片来说,合成图像更加逼真、准确,有利于社交双方及时获得对方的表情信息,提高社交质量,保证社交的顺利进行,提升用户体验。
附图说明
图1为本发明一个实施例提供的一种图像处理方法的流程示意图;
图2为本发明一个实施例提供的一种图像处理装置的功能结构示意图;
图3为本发明另一个实施例提供的一种图像处理装置的功能结构示意图;
图4为本发明一个实施例提供的一种终端设备的功能结构示意图。
具体实施方式
本发明的设计思路是:鉴于使用眼部的静态图片覆盖被VR HMD遮挡的部分的技术方案,仍然会导致表情信息的缺失,且静态图片与脸部其余部分不能很好的融合,会很不自然。又考虑到,人脸被虚拟现实头戴显示设备遮挡的眼部及周边部分的图像,与未被虚拟现实头戴显示设备遮挡的脸部图像信息之间有着强相关的关系。本技术方案引入脸部表情模型,通过脸部表情模型得到与未被虚拟现实头戴显示设备遮挡的脸部图像信息匹配的遮挡区域的脸部图像,进而获得具有完整表情信息的合成图像。为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1为本发明一个实施例提供的一种图像处理方法的流程示意图。如图1所示,该图像处理方法包括:
步骤S110,从摄像头采集的视频流中获取指定目标的实际图像,其中,指定目标佩戴有虚拟现实头戴设备(VR HMD)。
本实施例中,摄像头设置在可以采集到指定目标的位置,该摄像头可以是摄像头单品,也可以是终端设备上设置的摄像头,只要满足实施本方法的装置可以获取到摄像头采集的视频流即可。在社交应用中,参与社交的包括甲方用户和乙方用户,甲方用户和乙方用户均佩戴VR HMD,且均配置有摄像头,该摄像头可以分别采集到包括甲方用户和乙方用户的视频流,例如,甲方用户处配置的摄像头可以采集到包括甲方用户的视频流,乙方用户处配置的摄像头可以采集到包括乙方用户的视频流。本实施例是从参与社交的其中一侧用户来说的,例如,从甲方用户侧来说,摄像头通过采集指定目标(甲方用户)的视频流传输给社交对方(乙方用户)。在本实施例中,该指定目标可以是佩戴VR HMD进行社交的用户,指定目标佩戴着VR HMD,所以实际图像中,指定目标的人脸的眼睛以及眼睛周围部分是被VR HMD遮挡的,无法获取到完整的表情信息,影响社交过程。为了对摄像头采集的图像进行处理,需要从摄像头采集的视频流中获取一指定目标的实际图像。
步骤S120,从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据。
本实施例中,通过图像识别方法识别出实际图像中指定目标的脸部,并识别出脸部未被VR HMD遮挡的区域和被VR HMD遮挡的区域,因为需要通过未被VR HMD遮挡区域,得到与被VR HMD遮挡区域相匹配的图像数据,所以需要从实际图像中获取识别出的未被虚拟现实头戴显示设备遮挡区域的第一脸部图像数据。
步骤S130,根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第二脸部图像数据,第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应。
本实施例中,预设的脸部表情模型是通过指定目标样本训练得到的(例如,使用神经网络进行机器学习),在样本训练中可以获得未被VR HMD遮挡的图像数据与被VR HMD遮挡区域的图像数据之间的关系,因此,根据从实际图像中获取的第一脸部图像数据和预设的脸部表情模型,就可以得到与第一脸部图像数据匹配的第二脸部图像数据,即得到与被VR HMD遮挡区域相匹配的图像数据。
针对一个用户来说,只需要进行一次样本训练就可以,但是当用户更换VR HMD时,因为会存在更换前和更换后的VR HMD的大小不一致的情况,需要进行重新训练,防止根据原预设的脸部表情模型生成的第二脸部图像与第一脸部图像数据不能进行完美的融合。
步骤S140,将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。
通过图像融合方法,将第一脸部图像数据和和第二脸部图像数据相融合,生成合成图像。因为,第二脸部图像数据是与被VR HMD遮挡区域相匹配的带有表情信息的图像,因此,合成图像中带有指定目标的完整表情,获得合成图像后,就可以将该合成图像从参与社交的本侧用户发送参与该社交的另一侧用户。
因为第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应,且带有表情信息,所以合成图像则是完整的带有表情信息的图像,相比较使用没有表情信息的静态图片融合的合成图像来说,本实施例的合成图像更加逼真、准确,有利于社交双方及时获得对方的表情信息,提高社交质量,保证社交的顺利进行,提升用户体验。
在本发明的一个实施例中,步骤S130中的根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第二脸部图像数据包括:将第一脸部图像数据输入到预设的脸部表情模型中,以使脸部表情模型识别第一脸部图像数据,输出与第一 脸部图像数据相匹配的第二脸部图像数据。
如上文说明,在预设的脸部表情模型中有未被VR HMD遮挡的图像数据与被VR HMD遮挡区域的图像数据之间的关系,当将第一脸部图像数据输入到预设的脸部表情模型后,脸部表情模型识别第一脸部图像数据,就会输出与第一脸部图像数据相匹配的第二脸部图像数据。也就是说,预设的脸部表情模型会自动分析第一脸部图像数据,然后根据第一脸部图像数据直接生成与第一脸部数据匹配的第二脸部图像数据,有利于提高图像处理的效率,进一步增加用户体验。
进一步地,上述的预设的脸部表情模型是通过深度神经网络得到的,通过深度神经网络得到预设的脸部表情模型包括:
(1)获取摄像头在第一场景下采集的指定目标的多个第一样本图像,以及在第二场景下采集的指定目标的多个第二样本图像;其中,在第一场景下,指定目标佩戴有虚拟现实头戴设备;在第二场景下,指定目标未佩戴虚拟现实头戴显示设备,且各第二样本图像中包含指定用户的脸部表情。
在本实施例中,获取多个第一样本图像的目的是为了可以将第二样本图像中与被VR HMD遮挡区域对应的部分提取出来,例如,被VR HMD遮挡区域是眼部区域,则需要将第二样本图像中的眼部区域提取出来。多个第二样本图像中应该包含用户各种表情信息,以便在对实际图像进行处理时,可以匹配到更加准确的第二图像数据。
(2)从第一样本图像中识别出第一被遮挡区域,获取第一被遮挡区域信息。
如上文说明,为了将第二样本图像中与被VR HMD遮挡区域对应的部分提取出来,需要识别出第一样本图像中的第一被遮挡区域,然后获取到第一被遮挡区域信息,例如,区域边界的坐标信息。
(3)根据第一被遮挡区域信息,对第二样本图像的指定目标脸部的与第一被遮挡区域对应的区域进行标记,获得标记区域。
这里获得的标记区域是第一样本图像中的被遮挡区域相同的区域,该标记区域相当于遮挡区域未被遮挡状态下的图像元素,该标记区域中包括指定目标的表情信息。例如,第一被遮挡区域是眼部区域,那么在对第二样本图像的指定目标脸部进行标记时,则对第二样本图像的指定目标脸部的眼部区域进行标记。
(4)将第二样本图像中标记区域的图像放入第一指定集合中,将该第一指定集合 作为深度神经网络训练时的输出集合;将第二样本图像中的指定目标脸部的未被标记区域的图像放入第二指定集合中,将该第二指定集合作为深度神经网络训练时的输入集合,放入第一指定集合的图像作为输出集合中的图像元素,放入第二指定集合中的图像作为输入集合中的图像元素。其中,第二指定集合与第一指定集合中的图像元素有一一对应的输入输出对应关系,也就是说,第一指定集合中和第二指定集合中的具有一一对应关系的两个图像元素来自同一个第二样本图像。例如,第二指定集合中的是样本图像1的眼部区域的图像元素,则第一指定集合的与其具有一一对应关系的是样本图像1的非眼部区域的图像元素。
(5)将输入集合和输出集合中的每一对具有输入输出对应关系的图像元素输入到预设的深度神经网络中进行训练,确定未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系,以使在第一脸部图像数据输入到预设的脸部表情模型时,预设的脸部表情模型根据识别的第一脸部图像数据和函数关系输出与第一脸部图像数据匹配的第二脸部图像数据。
在本实施例中,将输入集合和输出集合中的每一对具有输入输出对应关系的图像元素输入到预设的深度神经网络中进行训练,因为输入集合中的是第二样本中的未被标记区域的图像元素(相当于未被遮挡区域的图像元素),输出集合中的图像元素是与输入集合中的各图像元素一一对应的标记区域图像(相当于被遮挡区域在未遮挡状态下的图像元素)。所以通过预设的深度神经网络中进行训练后,就可以得到遮挡区域图像和与该遮挡区域在不遮挡状态下的该区域图像之间的函数关系。
在一个具体的例子中,第一样本图像中被遮挡区域是眼部区域,则输入集合中的图像元素是第二样本图像中的非眼部区域的图像元素,输出集合则是第二样本图像中的眼部区域未被遮挡状态下的眼部区域的图像元素,通过预设的深度神经网络中进行训练后,就可以得到非眼部区域的图像元素和眼部区域未被遮挡状态下的眼部区域的图像元素之间的函数关系。
上述得到的函数关系即是未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系,当确定了未遮挡区域图像后,就可以根据该函数关系,生成与未遮挡区域图像匹配的遮挡区域图像。当获取到摄像头采集的视频流时,确定该视频流中的指定目标的实际图像,从实际图像中识别出指定目标脸部的未被遮挡区域,根据上述得到的函 数关系,就可以生成与该未被遮挡区域匹配的遮挡区域的图像数据,将未被遮挡区域的图像与获得的遮挡区域的图像数据相融合,就可以生成合成图像。该合成图像则是指定目标完整的脸部图像,该脸部图像是未被遮挡的脸部图像。
本实施例,设计一个深度神经网络,其类型、层数以及每一层的节点数量,根据图像分辨率和所需生成效果设定。采用深度神经网络的机器学习方法,通过对指定目标的样本图像进行机器学习,获得指定目标的脸部表情模型。且,本实施例第二指定集合与第一指定集合中的图像元素有一一对应的输入输出对应关系,也就是说,本实施例通过深度神经网络进行有监督式的训练,将具有输入输出对应关系的图像元素输入到深度神经网络中进行训练生成神经网络模型参数,因为输入的图像元素和输出的图像元素有对应关系,通过训练就可以生成未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系:output=f(input),input为脸部未遮挡区域的图像,output则为生成的眼部及周围对应于遮挡区域的脸部图像。
可见,本实施例引入深度神经网络的机器学习方法,对指定目标的样本图像进行训练,利用人工智能通过对指定目标的样本图像训练-预测的方式来生成被VR HMD遮挡区域的图像数据,可以使得合成图像与指定目标更加匹配,生成的合成图像更加自然,增强用户体验。
损失函数是机器学习优化中至关重要的一部分。它能根据预测结果,衡量出模型预测能力的好坏。在实际应用中,选取损失函数会受到诸多因素的制约,比如是否有异常值、机器学习算法的选择、梯度下降的时间复杂度、求导的难易程度以及预测值的置信度等等。因此,不同类型的数据适合的损失函数也是不同的。在本发明的一个本实施例中,在预设的深度神经网络训练过程中,预设的深度神经网络训练的损失函数是输出集合中的图像和生成的与输入集合中的图像相匹配的图像之间的均方差。
在本实施例中,输入集合中的图像元素和输出集合中的图像元素有一一对应关系。当确定函数关系后,通过输入集合中的图像元素和确定的函数关系,生成与输入集合中的图像元素相匹配的图像,则该损失函数是该输出集合中的图像元素和实际生成的与输入集合中的图像元素相匹配的图像之间的均方差。例如,输入集合中的图像元素1、2、3,分别与输出集合中的图像元素4、5、6具有一一对应关系,根据确定的函数关系和图形元素1、2、3,实际生成与图像元素1、2、3匹配的图像元素7、8、9,则损失函 数是图像元素4和图像元素7、图像元素5和图像元素8、图像元素6和图像元素9之间的均方差。
在实际应用中,VR HMD比指定目标的脸部要大,图像中除了指定目标的脸部区域的部分,VR HMD还会遮挡一部分非脸部区域,如果仅对脸部进行图像处理,生成的合成图像与真实效果的差距较大,需要对被VR HMD遮挡的非脸部图像进行去遮挡处理,可以通过下述的方法进行:
(1)在本发明的一个实施例中,图1所示的方法还包括:从实际图像中识别出被虚拟现实头戴设备遮挡的非脸部区域;从视频流中获取实际图像之前的多个第三图像,从第三图像中提取背景图像,使用背景图像中与被虚拟现实头戴设备遮挡的非脸部区域相匹配的图像数据,对被虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
这里第三图像的个数不具体限定。因为摄像头采集视频流是与环境的位置是相对固定的,可以根据实际图像之前的多个图像帧中的背景图像信息进行去遮挡处理。
(2)在本发明的另一个实施例中,图1所示的方法还包括:从实际图像中识别出被虚拟现实头戴设备遮挡的非脸部图像数据,将非脸部图像数据输入到预设的非脸部模型中,以使预设的非脸部模型识别非脸部图像数据,输出与被虚拟现实头戴设备遮挡的非脸部区域匹配的第四图像数据,根据第四图像数据对被虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
本实施例中预设的非脸部模型可以通过无监督训练的神经网络生成。上述的去遮挡处理可以采用图像融合方法,将获取的与被VR HMD遮挡的非脸部区域相匹配的图像数据或者第四图像数据与实际图像中未被VR HMD遮挡的图像数据进行融合。
通过上述的(1)和(2)对被虚拟现实头戴设备遮挡的非脸部区域,避免第一脸部图像数据和第二脸部图像数据融合后,与非脸部区域的衔接处过于明显,保证生成的合成图像更加真实、完整,而非仅仅体现指定目标的表情信息,整个合成图像更具有观赏性,增强用户体验。
在本发明的一个实施例中,该图像处理方法在实际应用中,生成合成图像是将第一脸部图像数据、第二脸部图像数据、非人脸部分中未被VR HMD遮挡的图像数据,以及获取的与被VR HMD遮挡的非脸部区域相匹配的图像数据或者第四图像数据进行融合,以生成完整的合成图像。
例如,本实施例中被VR HMD遮挡的非脸部图像数据可以是指定目标的头发或耳朵等区域,通过上述的(1)或(2)就可以将被遮挡的头发或耳朵展现出来,使得生成的合成图像更加逼真。
图2为本发明一个实施例提供的一种图像处理装置的功能结构示意图。如图2所示,该图像处理装置200包括:
第一获取单元210,用于从摄像头采集的视频流中获取指定目标的实际图像,其中,指定目标佩戴有虚拟现实头戴设备。
识别单元220,用于从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据。
第二获取单元230,用于根据第一脸部图像数据和预设的脸部表情模型,得到与第一脸部图像数据匹配的第二脸部图像数据,第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应。
生成单元240,用于将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。
在本发明的一个实施例中,第二获取单元230,用于将第一脸部图像数据输入到预设的脸部表情模型中,以使脸部表情模型识别第一脸部图像数据,输出与第一脸部图像数据相匹配的第二脸部图像数据。
在本发明的一个实施例中,第二获取单元230还包括:
训练模块,用于通过深度神经网络得到预设的脸部表情模型,具体用于:获取摄像头在第一场景下采集的指定目标的多个第一样本图像,以及在第二场景下采集的指定目标的多个第二样本图像;其中,在第一场景下,指定目标佩戴有虚拟现实头戴设备;在第二场景下,指定目标未佩戴虚拟现实头戴显示设备,且各第二样本图像中包含指定用户的脸部表情;从第一样本图像中识别出第一被遮挡区域,获取第一被遮挡区域信息;根据第一被遮挡区域信息,对第二样本图像的指定目标脸部的与第一被遮挡区域对应的区域进行标记;将第二样本图像中标记区域的图像放入第一指定集合中,将该第一指定集合作为深度神经网络训练时的输出集合;将第二样本图像中的指定目标脸部的未被标记区域的图像放入第二指定集合中,将该第二指定集合作为深度神经网络训练时的输入 集合;第二指定集合与第一指定集合中的图像元素有一一对应的输入输出对应关系;将输入集合和输出集合中的每一具有对输入输出对应关系的图像元素输入到预设的深度神经网络中进行训练,确定未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系,以使第二获取单元将第一脸部图像数据输入到预设的脸部表情模型,预设的脸部表情模型根据输入的第一脸部图像数据和函数关系输出与其匹配的第二脸部图像数据。
进一步地,在预设的深度神经网络训练过程中,预设的深度神经网络训练的损失函数是输出集合中的图像和生成的与输入集合中的图像相匹配的图像之间的均方差。
在本发明的一个实施例中,图2所示的图像处理装置200还包括:
处理单元,用于从实际图像中识别出被虚拟现实头戴设备遮挡的非脸部区域;从视频流中获取实际图像之前的多个第三图像,从第三图像中提取背景图像,使用背景图像中与被虚拟现实头戴设备遮挡的非脸部区域对应的图像数据,对被虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
在本发明的一个实施例中,图2所示的图像处理装置200还包括:
处理单元,用于从实际图像中识别出被所述虚拟现实头戴设备遮挡的非脸部图像数据,将非脸部图像数据输入到预设的非脸部模型中,以使预设的非脸部模型识别非脸部图像数据,输出与被非脸部区域匹配的第四图像数据,根据第四图像数据对非脸部区域进行去遮挡处理。
与前述图像数据的处理方法实施例相对应的,本发明还提供了一种图像数据的处理装置实施例。
图3为本发明另一个实施例提供的一种图像处理装置的结构示意图。如图3所示,图像处理装置300包括存储器310和处理器320,存储器310和处理器320之间通过内部总线330通讯连接,存储器310存储有能够被处理器320执行的图像处理的计算机程序311,该图像处理的计算机程序311被处理器320执行时能够实现上述方法步骤。
在不同的实施例中,存储器310可以是内存或者非易失性存储器。其中非易失性存储器可以是:存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、DVD等),或者类似的存储介质,或者它们的组合。内存可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存。进一步,非易失性 存储器和内存作为机器可读存储介质,其上可存储由处理器320执行的图像处理的计算机程序311。
图4为本发明一个实施例提供的一种终端设备的功能结构示意图。如图4所示,该终端设备400包括:如图2或图3所示的图像处理装置410。
在本发明的一个实施例中,该终端设备400是虚拟现实头戴显示设备。或者,该终端设备400是在社交过程中与虚拟现实头戴显示设备进行连接的计算机或服务器,可以通过计算机或者服务器将参与社交的本侧用户的合成图像发送给参与社交的另一侧用户。
需要说明的是,图2、图3所示的装置和图4所示的终端设备的各实施例与图1所示的方法的各实施例对应相同,上文已有详细说明,在此不再赘述。
综上所述,本发明技术方案的有益效果是:当获取到戴有虚拟现实头戴设备的指定目标的实际图像后,先从实际图像中识别出指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,将未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据输入到预设的脸部表情模型中,就可以得到与第一脸部图像数据匹配的第二脸部图像数据;然后将第一脸部图像数据和第二脸部图像数据相融合,生成合成图像。因为第二脸部图像数据与被虚拟现实头戴显示设备遮挡区域相对应,且带有表情信息,所以合成图像则是完整的带有表情信息的图像,有利于社交双方及时获得对方的表情信息,提高社交质量,保证社交的顺利进行,提升用户体验。
以上所述,仅为本发明的具体实施方式,在本发明的上述教导下,本领域技术人员可以在上述实施例的基础上进行其他的改进或变形。本领域技术人员应该明白,上述的具体描述只是更好的解释本发明的目的,本发明的保护范围应以权利要求的保护范围为准。

Claims (13)

  1. 一种图像处理方法,其中,所述方法包括:
    从摄像头采集的视频流中获取指定目标的实际图像,其中,所述指定目标佩戴有虚拟现实头戴设备;
    从所述实际图像中识别出所述指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与所述未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据;
    根据所述第一脸部图像数据和预设的脸部表情模型,得到与所述第一脸部图像数据匹配的第二脸部图像数据,所述第二脸部图像数据与所述被虚拟现实头戴显示设备遮挡区域相对应;
    将所述第一脸部图像数据和所述第二脸部图像数据相融合,生成合成图像。
  2. 如权利要求1所述的图像处理方法,其中,所述根据所述第一脸部图像数据和预设的脸部表情模型,得到与所述第一脸部图像数据匹配的第二脸部图像数据包括:
    将所述第一脸部图像数据输入到所述预设的脸部表情模型中,以使所述脸部表情模型识别所述第一脸部图像数据,输出与所述第一脸部图像数据相匹配的第二脸部图像数据。
  3. 如权利要求2所述的图像处理方法,其中,所述预设的脸部表情模型是通过深度神经网络得到的,所述通过深度神经网络得到预设的脸部表情模型包括:
    获取摄像头在第一场景下采集的所述指定目标的多个第一样本图像,以及在第二场景下采集的所述指定目标的多个第二样本图像;其中,在所述第一场景下,所述指定目标佩戴有所述虚拟现实头戴设备;在所述第二场景下,所述指定目标未佩戴所述虚拟现实头戴显示设备,且各第二样本图像中包含所述指定用户的脸部表情;
    从所述第一样本图像中识别出第一被遮挡区域,获取所述第一被遮挡区域信息;
    根据所述第一被遮挡区域信息,对所述第二样本图像的所述指定目标脸部的与所述第一被遮挡区域对应的区域进行标记;
    将所述第二样本图像中标记区域的图像放入第一指定集合中,将该第一指定集合作为深度神经网络训练时的输出集合;将所述第二样本图像中的所述指定目标脸部的未被标记区域的图像放入第二指定集合中,将该第二指定集合作为深度神经网络训练时的输 入集合;所述第二指定集合与所述第一指定集合中的图像元素有一一对应的输入输出对应关系;
    将所述输入集合和所述输出集合中的每一对具有输入输出对应关系的图像元素输入到预设的深度神经网络中进行训练,确定未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系,以使在所述第一脸部图像数据输入到所述预设的脸部表情模型时,所述预设的脸部表情模型根据输入的所述第一脸部图像数据和所述函数关系输出与其匹配的第二脸部图像数据。
  4. 如权利要求3所述的图像处理方法,其中,
    在所述预设的深度神经网络训练过程中,所述预设的深度神经网络训练的损失函数是所述输出集合中的图像和生成的与所述输入集合中的图像相匹配的图像之间的均方差。
  5. 如权利要求1所述的图像处理方法,其中,所述方法还包括:
    从所述实际图像中识别出被所述虚拟现实头戴设备遮挡的非脸部区域;
    从所述视频流中获取所述实际图像之前的多个第三图像,从所述第三图像中提取背景图像,使用所述背景图像中与被所述虚拟现实头戴设备遮挡的非脸部区域对应的图像数据,对所述被所述虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
  6. 如权利要求1所述的图像处理方法,其中,所述方法还包括:
    从所述实际图像中识别出被所述虚拟现实头戴设备遮挡的非脸部图像数据,将所述非脸部图像数据输入到预设的非脸部模型中,以使所述预设的非脸部模型识别所述非脸部图像数据,输出与被所述虚拟现实头戴设备遮挡的非脸部区域匹配的第四图像数据,根据所述第四图像数据对所述被所述虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
  7. 一种图像处理装置,所述装置包括:
    第一获取单元,用于从摄像头采集的视频流中获取指定目标的实际图像,其中,所述指定目标佩戴有虚拟现实头戴设备;
    识别单元,用于从所述实际图像中识别出所述指定目标脸部的未被虚拟现实头戴显示设备遮挡区域和被虚拟现实头戴显示设备遮挡区域,获取与所述未被虚拟现实头戴显示设备遮挡区域对应的第一脸部图像数据;
    第二获取单元,用于根据所述第一脸部图像数据和预设的脸部表情模型,得到与所述第一脸部图像数据匹配的第二脸部图像数据,所述第二脸部图像数据与所述被虚拟现实头戴显示设备遮挡区域相对应;
    生成单元,用于将所述第一脸部图像数据和所述第二脸部图像数据相融合,生成合成图像。
  8. 如权利要求7所述的图像处理装置,其中,所述第二获取单元,用于将第一脸部图像数据输入到预设的脸部表情模型中,以使脸部表情模型识别第一脸部图像数据,输出与第一脸部图像数据相匹配的第二脸部图像数据。
  9. 如权利要求8所述的图像处理装置,其中,所述第二获取单元还包括:
    训练模块,用于通过深度神经网络得到所述预设的脸部表情模型,具体用于:
    获取摄像头在第一场景下采集的所述指定目标的多个第一样本图像,以及在第二场景下采集的所述指定目标的多个第二样本图像;其中,在所述第一场景下,所述指定目标佩戴有所述虚拟现实头戴设备;在所述第二场景下,所述指定目标未佩戴所述虚拟现实头戴显示设备,且各第二样本图像中包含所述指定用户的脸部表情;
    从所述第一样本图像中识别出第一被遮挡区域,获取所述第一被遮挡区域信息;
    根据所述第一被遮挡区域信息,对所述第二样本图像的所述指定目标脸部的与所述第一被遮挡区域对应的区域进行标记;
    将所述第二样本图像中标记区域的图像放入第一指定集合中,将该第一指定集合作为深度神经网络训练时的输出集合;将所述第二样本图像中的所述指定目标脸部的未被标记区域的图像放入第二指定集合中,将该第二指定集合作为深度神经网络训练时的输入集合;所述第二指定集合与所述第一指定集合中的图像元素有一一对应的输入输出对应关系;
    将所述输入集合和所述输出集合中的每一对具有输入输出对应关系的图像元素输入到预设的深度神经网络中进行训练,确定未遮挡区域图像和生成的与其匹配的遮挡区域图像之间的函数关系,以使所述第二获取单元将所述第一脸部图像数据输入到所述预设的脸部表情模型,所述预设的脸部表情模型根据输入的所述第一脸部图像数据和所述函数关系输出与其匹配的第二脸部图像数据。
  10. 如权利要求9所述的图像处理装置,其中,在所述训练模块得到预设的深度神 经网络训练过程中,预设的深度神经网络训练的损失函数是输出集合中的图像和生成的与输入集合中的图像相匹配的图像之间的均方差。
  11. 如权利要求7所述的图像处理装置,其中,所述装置还包括:
    处理单元,用于从所述实际图像中识别出被所述虚拟现实头戴设备遮挡的非脸部区域;从所述视频流中获取所述实际图像之前的多个第三图像,从所述第三图像中提取背景图像,使用所述背景图像中与被所述虚拟现实头戴设备遮挡的非脸部区域对应的图像数据,对被所述虚拟现实头戴设备遮挡的非脸部区域进行去遮挡处理。
  12. 如权利要求7所述的图像处理装置,其中,所述装置还包括:
    处理单元,用于从所述实际图像中识别出被所述虚拟现实头戴设备遮挡的非脸部图像数据,将所述非脸部图像数据输入到预设的非脸部模型中,以使所述预设的非脸部模型识别所述非脸部图像数据,输出与被所述非脸部区域匹配的第四图像数据,根据所述第四图像数据对所述非脸部区域进行去遮挡处理。
  13. 一种终端设备,其中,所述终端设备包括:如权利要求7-12任一项所述的图像处理装置。
PCT/CN2018/092887 2017-08-30 2018-06-26 一种图像处理方法、装置和终端设备 WO2019041992A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/461,718 US11295550B2 (en) 2017-08-30 2018-06-26 Image processing method and apparatus, and terminal device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710766169.1 2017-08-30
CN201710766169.1A CN107680069B (zh) 2017-08-30 2017-08-30 一种图像处理方法、装置和终端设备

Publications (1)

Publication Number Publication Date
WO2019041992A1 true WO2019041992A1 (zh) 2019-03-07

Family

ID=61135055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/092887 WO2019041992A1 (zh) 2017-08-30 2018-06-26 一种图像处理方法、装置和终端设备

Country Status (3)

Country Link
US (1) US11295550B2 (zh)
CN (1) CN107680069B (zh)
WO (1) WO2019041992A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11145124B2 (en) 2017-08-30 2021-10-12 Ronald H. Winston System and method for rendering virtual reality interactions
CN107680069B (zh) 2017-08-30 2020-09-11 歌尔股份有限公司 一种图像处理方法、装置和终端设备
CN108256505A (zh) * 2018-02-12 2018-07-06 腾讯科技(深圳)有限公司 图像处理方法及装置
JP7250809B2 (ja) * 2018-03-13 2023-04-03 ロナルド ウィンストン 仮想現実システムおよび方法
CN108551552B (zh) * 2018-05-14 2020-09-01 Oppo广东移动通信有限公司 图像处理方法、装置、存储介质及移动终端
CN108764135B (zh) * 2018-05-28 2022-02-08 北京微播视界科技有限公司 图像生成方法、装置,及电子设备
CN110647780A (zh) * 2018-06-07 2020-01-03 东方联合动画有限公司 一种数据处理方法、系统
CN110147805B (zh) 2018-07-23 2023-04-07 腾讯科技(深圳)有限公司 图像处理方法、装置、终端及存储介质
CN109215007B (zh) * 2018-09-21 2022-04-12 维沃移动通信有限公司 一种图像生成方法及终端设备
CN111045618A (zh) * 2018-10-15 2020-04-21 广东美的白色家电技术创新中心有限公司 产品展示方法、装置及系统
CN109948525A (zh) * 2019-03-18 2019-06-28 Oppo广东移动通信有限公司 拍照处理方法、装置、移动终端以及存储介质
WO2020214897A1 (en) 2019-04-18 2020-10-22 Beckman Coulter, Inc. Securing data of objects in a laboratory environment
CN111860380A (zh) * 2020-07-27 2020-10-30 平安科技(深圳)有限公司 人脸图像生成方法、装置、服务器及存储介质
CN112257552B (zh) * 2020-10-19 2023-09-05 腾讯科技(深圳)有限公司 图像处理方法、装置、设备及存储介质
CN114594851A (zh) * 2020-11-30 2022-06-07 华为技术有限公司 图像处理方法、服务器和虚拟现实设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1196366A (ja) * 1997-09-19 1999-04-09 Nippon Telegr & Teleph Corp <Ntt> ヘッドマウントディスプレイを装着した人物の顔画像合成方法およびその装置
US20160217621A1 (en) * 2015-01-28 2016-07-28 Sony Computer Entertainment Europe Limited Image processing
CN106170083A (zh) * 2015-05-18 2016-11-30 三星电子株式会社 用于头戴式显示器设备的图像处理
CN107305621A (zh) * 2016-04-17 2017-10-31 张翔宇 一种虚拟现实眼镜的图像捕获设备及图像合成系统
CN107491165A (zh) * 2016-06-12 2017-12-19 张翔宇 一种vr眼镜面部3d图像、平面图像捕获与手势捕获系统
CN107680069A (zh) * 2017-08-30 2018-02-09 歌尔股份有限公司 一种图像处理方法、装置和终端设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442295B2 (en) * 2010-06-29 2013-05-14 Analogic Corporation Anti-counterfeiting / authentication
US10445863B2 (en) * 2014-08-04 2019-10-15 Facebook Technologies, Llc Method and system for reconstructing obstructed face portions for virtual reality environment
US20160140761A1 (en) 2014-11-19 2016-05-19 Microsoft Technology Licensing, Llc. Using depth information for drawing in augmented reality scenes
CN104539868B (zh) * 2014-11-24 2018-06-01 联想(北京)有限公司 一种信息处理方法及电子设备
US9904054B2 (en) * 2015-01-23 2018-02-27 Oculus Vr, Llc Headset with strain gauge expression recognition system
US10217261B2 (en) * 2016-02-18 2019-02-26 Pinscreen, Inc. Deep learning-based facial animation for head-mounted display
US10684674B2 (en) * 2016-04-01 2020-06-16 Facebook Technologies, Llc Tracking portions of a user's face uncovered by a head mounted display worn by the user
US20180101989A1 (en) * 2016-10-06 2018-04-12 Google Inc. Headset removal in virtual, augmented, and mixed reality using an eye gaze database
US20180158246A1 (en) * 2016-12-07 2018-06-07 Intel IP Corporation Method and system of providing user facial displays in virtual or augmented reality for face occluding head mounted displays

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1196366A (ja) * 1997-09-19 1999-04-09 Nippon Telegr & Teleph Corp <Ntt> ヘッドマウントディスプレイを装着した人物の顔画像合成方法およびその装置
US20160217621A1 (en) * 2015-01-28 2016-07-28 Sony Computer Entertainment Europe Limited Image processing
CN106170083A (zh) * 2015-05-18 2016-11-30 三星电子株式会社 用于头戴式显示器设备的图像处理
CN107305621A (zh) * 2016-04-17 2017-10-31 张翔宇 一种虚拟现实眼镜的图像捕获设备及图像合成系统
CN107491165A (zh) * 2016-06-12 2017-12-19 张翔宇 一种vr眼镜面部3d图像、平面图像捕获与手势捕获系统
CN107680069A (zh) * 2017-08-30 2018-02-09 歌尔股份有限公司 一种图像处理方法、装置和终端设备

Also Published As

Publication number Publication date
CN107680069B (zh) 2020-09-11
US11295550B2 (en) 2022-04-05
US20210374390A1 (en) 2021-12-02
CN107680069A (zh) 2018-02-09

Similar Documents

Publication Publication Date Title
WO2019041992A1 (zh) 一种图像处理方法、装置和终端设备
US11238568B2 (en) Method and system for reconstructing obstructed face portions for virtual reality environment
Chen et al. What comprises a good talking-head video generation?: A survey and benchmark
KR102390781B1 (ko) 얼굴 표정 추적
US9030486B2 (en) System and method for low bandwidth image transmission
WO2015116388A2 (en) Self-initiated change of appearance for subjects in video and images
CN109952759A (zh) 用于具有hmd的视频会议的改进的方法和系统
JP2016537922A (ja) 擬似ビデオ通話方法及び端末
CN109242940B (zh) 三维动态图像的生成方法和装置
CN111710036A (zh) 三维人脸模型的构建方法、装置、设备及存储介质
WO2018095317A1 (zh) 视频数据处理方法、装置及设备
Liang et al. Head reconstruction from internet photos
CN113192132B (zh) 眼神捕捉方法及装置、存储介质、终端
CN109150690B (zh) 交互数据处理方法、装置、计算机设备和存储介质
WO2018121699A1 (zh) 视频通信方法、设备和终端
US20220398816A1 (en) Systems And Methods For Providing Real-Time Composite Video From Multiple Source Devices Featuring Augmented Reality Elements
Zheng et al. Learning view-invariant features for person identification in temporally synchronized videos taken by wearable cameras
Elgharib et al. Egocentric videoconferencing
Michibata et al. Cooking activity recognition in egocentric videos with a hand mask image branch in the multi-stream cnn
Wang et al. Digital twin: Acquiring high-fidelity 3D avatar from a single image
CN113570689B (zh) 人像卡通化方法、装置、介质和计算设备
US20230386147A1 (en) Systems and Methods for Providing Real-Time Composite Video from Multiple Source Devices Featuring Augmented Reality Elements
CN105979331A (zh) 一种智能电视的数据推荐方法和装置
KR102558806B1 (ko) 멀티카메라를 이용한 대상 추적 장치
CN111881807A (zh) 基于人脸建模及表情追踪的vr会议控制系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18851743

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18851743

Country of ref document: EP

Kind code of ref document: A1