CN107707837B

CN107707837B - Image processing method and apparatus, electronic apparatus, and computer-readable storage medium

Info

Publication number: CN107707837B
Application number: CN201710812799.8A
Authority: CN
Inventors: 张学勇
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2017-09-11
Filing date: 2017-09-11
Publication date: 2021-06-29
Anticipated expiration: 2037-09-11
Also published as: CN107707837A

Abstract

The invention discloses an image processing method for processing a combined image. The combined image is formed by fusing a preset three-dimensional background image and a character area image in a scene image of a current user in a real scene. The image processing method comprises the following steps: acquiring sound information of a current user; processing the sound information to obtain sound characteristics; the predetermined three-dimensional background image is switched according to the sound characteristics. The invention also discloses an image processing device, an electronic device and a computer readable storage medium. When the image processing method, the image processing device, the electronic device and the computer readable storage medium of the embodiment of the invention process the merged image of the real person and the predetermined three-dimensional background image, the predetermined three-dimensional background image in the merged image is switched according to the recognized sound characteristic of the current user, so that the merged image is richer, the user experience is improved, and the interestingness is enhanced.

Description

Image processing method and apparatus, electronic apparatus, and computer-readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic apparatus, and a computer-readable storage medium.

Background

When the character image of the existing real scene is fused with the virtual background image, the background image is usually fixed, and the user experience is poor.

Disclosure of Invention

Embodiments of the present invention provide an image processing method, an image processing apparatus, an electronic apparatus, and a computer-readable storage medium.

The image processing method of the embodiment of the invention is used for processing a merged image, wherein the merged image is formed by fusing a preset three-dimensional background image and a character area image in a scene image of a current user in a real scene, and the image processing method comprises the following steps:

acquiring the sound information of the current user;

processing the sound information to obtain sound characteristics; and

switching the predetermined three-dimensional background image according to the sound characteristic.

The image processing device is used for processing the merged image, and the merged image is formed by fusing the preset three-dimensional background image and the figure region image in the scene image of the current user in the real scene. The image processing apparatus includes an acousto-electric element and a processor. The sound and electricity element is used for acquiring sound information of the current user, and the processor is used for processing the sound information to obtain sound characteristics and switching the preset three-dimensional background image according to the sound characteristics.

The electronic device of an embodiment of the present invention includes one or more processors, memory, and one or more programs. Wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs including instructions for performing the image processing method described above.

The computer-readable storage medium of an embodiment of the present invention includes a computer program for use in conjunction with an electronic device capable of image capture, the computer program being executable by a processor to perform the image processing method described above.

According to the image processing method, the image processing device, the electronic device and the computer readable storage medium, when the merged image of the real person and the predetermined three-dimensional background image is processed, the predetermined three-dimensional background image in the image is merged according to the identified sound characteristic of the current user, so that the merged image is richer, the interestingness is enhanced, and the user experience is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram illustrating an image processing method according to some embodiments of the invention.

FIG. 2 is a schematic diagram of an image processing apparatus according to some embodiments of the invention.

Fig. 3 is a schematic structural diagram of an electronic device according to some embodiments of the invention.

FIG. 4 is a flow diagram illustrating an image processing method according to some embodiments of the invention.

FIG. 5 is a flow chart illustrating an image processing method according to some embodiments of the present invention.

FIG. 6 is a flow chart illustrating an image processing method according to some embodiments of the invention.

FIG. 7 is a flow chart illustrating an image processing method according to some embodiments of the invention.

FIG. 8 is a flow chart illustrating an image processing method according to some embodiments of the present invention.

Fig. 9(a) to 9(e) are schematic views of a scene of structured light measurement according to an embodiment of the present invention.

FIGS. 10(a) and 10(b) are schematic views of a scene for structured light measurement according to one embodiment of the present invention.

FIG. 11 is a flow chart illustrating an image processing method according to some embodiments of the invention.

FIG. 12 is a flow chart illustrating an image processing method according to some embodiments of the invention.

FIG. 13 is a flow chart illustrating an image processing method according to some embodiments of the invention.

FIG. 14 is a schematic view of an electronic device according to some embodiments of the inventions.

FIG. 15 is a schematic diagram of an image processing apparatus according to some embodiments of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Referring to fig. 1, an image processing method according to an embodiment of the invention is used for processing a merged image. The combined image is formed by fusing a preset three-dimensional background image and a character area image in a scene image of a current user in a real scene. The image processing method includes the steps of:

02; acquiring sound information of a current user;

04: processing the sound information to obtain sound characteristics; and

06: the predetermined three-dimensional background image is switched according to the sound characteristics.

Referring to fig. 2 and fig. 3 together, the image processing apparatus 100 according to the embodiment of the invention is used for processing a merged image. The merged image is formed by fusing a preset three-dimensional background image and a character area image in a scene image of a current user in a real scene. The image processing method according to the embodiment of the present invention can be implemented by the image processing apparatus 100 according to the embodiment of the present invention, and is used in the electronic apparatus 1000. The image processing apparatus 100 includes a processor 20 and an acousto-electric element 70. Step 02 may be implemented by an acousto-electric element 70.

Steps

04 and 06 may be implemented by processor 20.

That is, the acoustoelectric element 70 is used to acquire the voice information of the current user. The processor 20 is configured to process the sound information to obtain sound characteristics, and to switch the predetermined three-dimensional background image according to the sound characteristics.

The image processing apparatus 100 according to the embodiment of the present invention can be applied to the electronic apparatus 1000 according to the embodiment of the present invention. That is, the electronic apparatus 1000 according to the embodiment of the present invention includes the image processing apparatus 100 according to the embodiment of the present invention.

In some embodiments, the electronic device 1000 includes a mobile phone, a tablet computer, a notebook computer, a smart band, a smart watch, a smart helmet, smart glasses, and the like.

In some embodiments, the initial predetermined three-dimensional background image may be a predetermined three-dimensional background image obtained by modeling an actual scene, or may be an animated predetermined three-dimensional background image. The predetermined three-dimensional background image may be randomly selected by the processor 20 or may be selected by the current user.

In some application scenes, such as a video conference or a video chat process, for the requirements of safety, privacy, interest increase and the like, the two participating parties adopt virtual images to replace a real scene as a background to be fused with a character area image in a scene image of a current user in the real scene to form a combined image, and the combined image is output and presented to the other party. Generally, the virtual image is single and fixed, and lacks variation, so that the same virtual image is dull when being used for a long time, and the user experience is poor.

According to the image processing method provided by the embodiment of the invention, the sound information of the current user is collected, and the processor 20 processes the collected sound information to extract the sound characteristic, so that the preset three-dimensional background image is switched according to certain relations between the sound characteristic and the preset three-dimensional background image, more and richer combined images are presented, the interestingness is enhanced, and the user experience is improved.

In some embodiments, the sound characteristic includes at least one of loudness, pitch, timbre. That is, the sound characteristics may include only loudness, pitch, or timbre, and may also include both loudness and pitch, loudness and timbre, or both, and may also include loudness, pitch, and timbre.

Specifically, loudness refers to the volume level of sound. In some cases, the volume level has a certain relationship with the real environment where the user is currently located. For example, when the speaking voice of the current user is small, the real environment where the current user is located is quiet, and at this time, the predetermined three-dimensional background image in the merged image may be switched to a predetermined three-dimensional background image containing elements such as night, moon, stars, library, trawl, and the like. When the speaking voice of the current user is large, the fact that the real environment where the current user is located is noisy is indicated, and at the moment, the preset three-dimensional background image in the combined image can be switched into the preset three-dimensional background image containing elements such as crowds, roads and rainwater. Because the sound information collected by the acoustoelectric element 70 is represented by a set of waveforms, the processor 20 can determine the loudness of the sound information according to the amplitudes of the set of waveforms.

The tone is substantially the frequency of vocal cord vibration, and when the tone is expressed in terms of voice, the higher the vocal cord vibration frequency is, the more crisp and sharp the voice produced by the current user is, and the lower the vocal cord vibration frequency is, the more rough and rough the voice produced by the current user is. Generally, the voice of men is deep and the voice of women is crisp. Therefore, after the processor 20 processes the sound information collected by the sound and electric element 70, if the sound is judged to be crisp, the predetermined three-dimensional background image in the merged image can be switched to the predetermined three-dimensional background image containing the elements such as katemn, flowers and the like, and if the sound is judged to be rough, the predetermined three-dimensional background image in the merged image can be switched to the predetermined three-dimensional background image containing the elements such as a court, an automobile, a game character and the like. The processor 20 may extract a lowest frequency value of the sound information as a tone feature, so as to determine a tone characteristic according to the lowest frequency value.

The sound emitted by each user is different and, therefore, each user has a different timbre. The predetermined three-dimensional background image may also be switched according to the timbre of the sound. In this case, the sound may include an environmental sound in addition to the current user's own sound. For example, the sound information collected by the acoustic-electric element 70 may include the sound of a violin, and at this time, the predetermined three-dimensional background image of the combined image may be switched to a predetermined three-dimensional background image including violin elements; alternatively, the sound information collected by the sound and electric element 70 may include a cat sound, and at this time, the predetermined three-dimensional background image in the merged image may be switched to a predetermined three-dimensional background image including cat elements. The processor 20 may perform the timbre determination by analyzing and detecting the spectral features of the sound information.

Referring to fig. 4, in some embodiments, the sound characteristics include a plurality of sound characteristics, and the predetermined three-dimensional background image includes a plurality of sound characteristics, each of the sound characteristics corresponding to one of the predetermined three-dimensional background images. The step 06 of switching the predetermined three-dimensional background image according to the sound characteristic includes:

061: a predetermined three-dimensional background image corresponding to the sound characteristic is switched according to the sound characteristic.

Referring back to fig. 2, in some embodiments, step 061 may be implemented by processor 20. That is, the processor 20 is also configured to switch the predetermined three-dimensional background image corresponding to the sound characteristic according to the sound characteristic.

Specifically, for example, the loudness characteristic among the sound characteristics may be divided into a plurality of levels according to the volume level, each level of volume corresponding to one predetermined three-dimensional background image. After the processor 20 calculates the volume of the sound information, the corresponding predetermined three-dimensional background image is switched according to the volume. For another example, in the tone color characteristics of the sound characteristics, different tone colors correspond to different predetermined three-dimensional background images (for example, a violin corresponds to a predetermined three-dimensional background image containing violin elements, and a piano corresponds to a predetermined three-dimensional background image containing piano elements). After the processor 20 processes the sound information to determine the tone color type, the corresponding predetermined three-dimensional background image is switched according to the determined tone color type.

Referring to fig. 5, in some embodiments, the predetermined three-dimensional background image includes a plurality and is stored in a predetermined order. The step 06 of switching the predetermined three-dimensional background image according to the sound characteristic includes:

062: a plurality of predetermined three-dimensional background images are switched in a predetermined manner according to sound characteristics.

Referring back to FIG. 2, in some embodiments, step 062 may be performed by processor 20. That is, the processor 20 is also configured to switch a plurality of predetermined three-dimensional background images in a predetermined manner according to the sound characteristics.

Specifically, one sound characteristic may correspond to a plurality of predetermined three-dimensional background images. The predetermined three-dimensional background images may be stored in the electronic device 1000 (shown in fig. 3) in a predetermined sequence, where the predetermined sequence may be a storage sequence of the predetermined three-dimensional background images, or may be a custom sequence performed by a user according to personal preferences and image styles, after the storage, the sound characteristic may be used as a trigger for switching the predetermined three-dimensional background images, and after the processor 20 obtains the sound characteristic, the switching of the predetermined three-dimensional background images is triggered.

The predetermined manner may include, but is not limited to, the following: full cycles, random handovers, etc.

Referring to fig. 6, in some embodiments, the image processing method according to the embodiments of the present invention further includes:

011: acquiring a scene image of a current user;

012: acquiring a depth image of a current user;

013: processing the scene image and the depth image to extract a person region of the current user in the scene image to obtain a person region image; and

014: and fusing the human figure region image with a preset three-dimensional background image to obtain a combined image.

Referring to fig. 2, in some embodiments, the image processing apparatus 100 further includes a visible light camera 11 and a depth image capturing component 12. Step 011 can be implemented by the visible light camera 11, step 012 can be implemented by the depth image acquisition component 12, and both

steps

013 and 014 can be implemented by the processor 20.

That is, the visible light camera 11 may be used to acquire a scene image of the current user; the depth image acquisition component 12 may be used to acquire a depth image of a current user; the processor 20 may be configured to process the scene image and the depth image to extract a person region of a current user in the scene image to obtain a person region image, and to fuse the person region image with a predetermined three-dimensional background image to obtain a merged image.

The scene image can be a gray level image or a color image, and the depth image representation includes depth information of each person or object in the scene of the current user. The scene range of the scene image is consistent with the scene range of the depth image, and each pixel in the scene image can find the depth information corresponding to the pixel in the depth image.

The existing method for segmenting the human and the background mainly performs segmentation of the human and the background according to similarity and discontinuity of adjacent pixels in terms of pixel values, but the segmentation method is easily influenced by environmental factors such as external illumination and the like. According to the image processing method, the depth image of the current user is obtained to extract the person region in the scene image. Because the acquisition of the depth image is not easily influenced by factors such as illumination, color distribution in a scene and the like, the character region extracted through the depth image is more accurate, and particularly, the boundary of the character region can be accurately calibrated. Furthermore, the effect of the merged image obtained by fusing the more accurate character region image and the predetermined three-dimensional background image is better.

Referring to fig. 7, in some embodiments, the step 012 of acquiring the depth image of the current user includes:

0121: projecting structured light to a current user;

0122: shooting a structured light image modulated by a current user; and

0123: and demodulating the phase information corresponding to each pixel of the structured light image to obtain a depth image.

Referring back to fig. 2, in some embodiments, the depth image capture assembly 12 includes a structured light projector 121 and a structured light camera 122. Step 0121 may be implemented by the structured light projector 121, and both

steps

0122 and 0123 may be implemented by the structured light camera 122.

That is, the structured light projector 121 may be used to transmit structured light to a current user; the structured light camera 122 may be configured to capture a structured light image modulated by a current user, and demodulate phase information corresponding to each pixel of the structured light image to obtain a depth image.

Specifically, after the structured light projector 121 projects a certain pattern of structured light onto the face and the body of the current user, a structured light image modulated by the current user is formed on the surface of the face and the body of the current user. The structured light camera 122 captures a modulated structured light image, and demodulates the structured light image to obtain a depth image. The pattern of the structured light may be laser stripes, gray codes, sinusoidal stripes, non-uniform speckles, etc.

Referring to fig. 8, in some embodiments, the step 0123 of demodulating phase information corresponding to each pixel of the structured light image to obtain the depth image includes:

01231: demodulating phase information corresponding to each pixel in the structured light image;

01232: converting the phase information into depth information; and

01233: and generating a depth image according to the depth information.

Referring back to FIG. 2, in some embodiments,

steps

01231, 01232, and 01233 can all be implemented by structured light camera 122.

That is, the structured light camera 122 may be further configured to demodulate phase information corresponding to each pixel in the structured light image, convert the phase information into depth information, and generate a depth image according to the depth information.

Specifically, the phase information of the modulated structured light is changed compared with the unmodulated structured light, and the structured light displayed in the structured light image is the distorted structured light, wherein the changed phase information can represent the depth information of the object. Therefore, the structured light camera 122 first demodulates the phase information corresponding to each pixel in the structured light image, and then calculates the depth information according to the phase information, thereby obtaining the final depth image.

In order to make the process of acquiring depth images of the face and body of the current user according to the structure more obvious to those skilled in the art, a widely-applied raster projection technique (fringe projection technique) is taken as an example to illustrate the specific principle. The grating projection technology belongs to the field of surface structured light in a broad sense.

As shown in fig. 9(a), when the surface structured light is used for projection, firstly, a sinusoidal stripe is generated by computer programming, and is projected to a measured object through the structured light projector 121, then the structured light camera 122 is used to photograph the bending degree of the stripe after being modulated by an object, and then the bending stripe is demodulated to obtain a phase, and then the phase is converted into depth information to obtain a depth image. To avoid the problem of error or error coupling, the depth image capturing assembly 12 needs to be calibrated before using the structured light to capture the depth information, and the calibration includes calibration of geometric parameters (e.g., relative position parameters between the structured light camera 122 and the structured light projector 121, etc.), calibration of internal parameters of the structured light camera 122 and internal parameters of the structured light projector 121, and so on.

Specifically, in a first step, the computer is programmed to generate sinusoidal stripes. Since the phase is acquired by the distorted stripes, for example, by a four-step phase shift method, four stripes with a phase difference of pi/2 are generated, and then the structured light projector 121 projects the four stripes onto the object to be measured (the mask shown in fig. 9 (a)) in a time-sharing manner, and the structured light camera 122 acquires the image on the left side of fig. 9(b) and reads the stripes on the reference surface shown on the right side of fig. 9 (b).

And secondly, phase recovery is carried out. The structured light camera 122 calculates a modulated phase according to the four acquired modulated fringe patterns (i.e., structured light images), and the obtained phase pattern is a truncated phase pattern. Since the result of the four-step phase-shifting algorithm is calculated by the arctan function, the phase after the light modulation of the structure is limited to between-pi, i.e. it starts again each time the modulated phase exceeds-pi, pi. The resulting phase principal value is shown in fig. 9 (c).

In the phase recovery process, the jump-canceling process is required, that is, the truncated phase is recovered to the continuous phase. As shown in fig. 9(d), the modulated continuous phase diagram is on the left and the reference continuous phase diagram is on the right.

And thirdly, subtracting the modulated continuous phase from the reference continuous phase to obtain a phase difference (namely phase information), wherein the phase difference represents the depth information of the measured object relative to the reference surface, and substituting the phase difference into a phase and depth conversion formula (parameters related in the formula are calibrated), so that the three-dimensional model of the object to be measured shown in the figure 9(e) can be obtained.

It should be understood that, in practical applications, the structured light used in the embodiments of the present invention may be any pattern other than the grating, according to different application scenarios.

As a possible implementation mode, the invention can also use speckle structure light to collect the depth information of the current user.

Specifically, the method for acquiring depth information by using speckle structure light is to use a substantially flat diffraction element, wherein the diffraction element is provided with a relief diffraction structure with a specific phase distribution, and the cross section of the diffraction element is provided with a step relief structure with two or more concave-convex parts. The thickness of the substrate in the diffraction element is approximately 1 micron, the height of each step is not uniform, and the height can be in the range of 0.7-0.9 micron. The structure shown in fig. 10(a) is a partial diffraction structure of the collimating beam splitting element of the present embodiment. Fig. 10(b) is a cross-sectional side view taken along section a-a, both in units of micrometers on the abscissa and on the ordinate. Speckle patterns generated by speckle structured light are highly random and can shift pattern with distance. Therefore, before obtaining depth information using speckle structured light, firstly, a speckle pattern in a space needs to be calibrated, for example, a reference plane is taken every 1 cm within a range of 0-4 m from the structured light camera 122, 400 speckle images are saved after calibration is completed, and the smaller the calibrated interval is, the higher the accuracy of the obtained depth information is. Then, the structured light projector 121 projects the speckle structured light onto a measured object (i.e., a current user), and the speckle pattern of the speckle structured light projected onto the measured object is changed by the height difference of the surface of the measured object. After the structured light camera 122 shoots the speckle pattern (i.e., structured light image) projected onto the measured object, the speckle pattern and 400 speckle images stored after previous calibration are subjected to cross-correlation operation one by one, and then 400 correlation images are obtained. The position of the measured object in the space can display a peak value on the correlation image, and the peak values are superposed together and subjected to interpolation operation to obtain the depth information of the measured object.

Since the common diffraction element diffracts the light beam to obtain a plurality of diffracted lights, the difference of the light intensity of each diffracted light beam is large, and the risk of injury to human eyes is also large. Even if the diffracted light is diffracted twice, the uniformity of the obtained light beam is low. Therefore, the effect of projecting the object to be measured by using the light beam diffracted by the ordinary diffraction element is poor. In this embodiment, the collimating beam splitting element is adopted, and the collimating beam splitting element not only has the function of collimating the non-collimated light beam, but also has the function of splitting light, that is, the non-collimated light reflected by the reflector exits a plurality of collimated light beams at different angles after passing through the collimating beam splitting element, the cross-sectional areas of the emitted collimated light beams are approximately equal, the energy fluxes are approximately equal, and further, the effect of projecting by using the scattered light diffracted by the light beams is better. Meanwhile, the laser emergent light is dispersed to each beam of light, the risk of damaging human eyes is further reduced, and compared with other uniformly-arranged structured light, the speckle structured light has the advantage that the electric quantity consumed by the speckle structured light is lower when the same collecting effect is achieved.

Referring to fig. 11, in some embodiments, the processing 013 processes the scene image and the depth image to extract the human figure region of the current user in the scene image to obtain the human figure region image further includes:

0131: identifying a face region in a scene image;

0132: acquiring depth information corresponding to a face region from a depth image;

0133: determining the depth range of the character region according to the depth information of the face region; and

0134: and determining a human figure region which is connected with the human face region and falls within the depth range according to the depth range of the human figure region to obtain a human figure region image.

Referring back to fig. 2, in some embodiments, step 0131, step 0132, step 0133, and step 0134 may be implemented by processor 20.

That is, the processor 20 may be further configured to identify a face region in the scene image, obtain depth information corresponding to the face region from the depth image, determine a depth range of the person region according to the depth information of the face region, and determine a person region connected to the face region and falling within the depth range according to the depth range of the person region to obtain a person region image.

Specifically, a trained depth learning model can be used to identify a face region in a scene image, and then depth information of the face region can be determined according to a corresponding relationship between the scene image and a depth image. Because the face region includes features such as a nose, eyes, ears, lips, and the like, the depth data corresponding to each feature in the face region in the depth image is different, for example, when the face is directly facing the depth image capturing component 12, the depth data corresponding to the nose may be smaller, and the depth data corresponding to the ears may be larger in the depth image captured by the depth image capturing component 12. Therefore, the depth information of the face region may be a value or a range of values. When the depth information of the face area is a numerical value, the numerical value can be obtained by averaging the depth data of the face area; alternatively, it may be obtained by taking the median of the depth data of the face region.

Since the human figure region includes the human face region, that is, the human figure region and the human face region are located in a certain depth range, after the processor 20 determines the depth information of the human face region, the depth range of the human figure region may be set according to the depth information of the human face region, and then the human figure region falling within the depth range and connected to the human face region is extracted according to the depth range of the human figure region to obtain the human figure region image.

In this way, the person region image can be extracted from the scene image based on the depth information. Because the depth information is not affected by the image of factors such as illumination, color temperature and the like in the environment, the extracted figure region image is more accurate.

Referring back to fig. 11, in some embodiments, the processing 013 processes the scene image and the depth image to extract the person region of the current user in the scene image to obtain the person region image further includes:

0135: processing the scene image to obtain a full-field edge image of the scene image; and

0136: and correcting the image of the person region according to the full-field edge image of the scene image.

Referring back to fig. 2, in some embodiments, step 0135 and step 0136 may both be implemented by processor 20. That is, the processor 20 may be further configured to process the scene image to obtain a full-field edge image of the scene image, and modify the person region image according to the full-field edge image of the scene image.

The processor 20 first performs edge extraction on the scene image to obtain a full-field edge image of the scene image, where edge lines in the full-field edge image of the scene image include edge lines of the current user and a background object in a scene where the current user is located. Specifically, the edge extraction can be performed on the scene image through a Canny operator. The core of the algorithm for edge extraction by the Canny operator mainly comprises the following steps: firstly, a 2D Gaussian filtering template is used for carrying out convolution on a scene image so as to eliminate noise; then, obtaining the gradient value of the gray scale of each pixel by using a differential operator, calculating the gradient direction of the gray scale of each pixel according to the gradient value, and finding out adjacent pixels of the corresponding pixels along the gradient direction through the gradient direction; then, each pixel is traversed, and if the gray value of a certain pixel is not the maximum compared with the gray values of two adjacent pixels in front and back in the gradient direction, the pixel is not considered as the edge point. Therefore, the pixel points at the edge position in the scene image can be determined, and the full-field edge image of the scene image after edge extraction is obtained.

The processor 20 obtains a full-field edge image of the scene image, and then corrects the human region image according to the full-field edge image of the scene image. It is understood that the person region image is obtained by merging all pixels in the scene image, which are connected to the face region and fall within the set depth range, and in some scenes, there may be some objects connected to the face region and fall within the depth range. Therefore, in order to make the extracted person region image more accurate, the person region image can be corrected using the full-field edge map of the scene image.

Further, the processor 20 may perform a secondary correction on the corrected image of the person region, for example, perform an expansion process on the corrected image of the person region to expand the image of the person region to retain edge details of the image of the person region.

Referring to fig. 12, in some embodiments, the step 014 of fusing the human figure region image with the predetermined three-dimensional background image to obtain the merged image includes:

01411: acquiring a preset fusion area in a preset three-dimensional background image;

01412: determining a pixel area to be replaced of a preset fusion area according to the person area image; and

01413: and replacing the pixel area to be replaced of the preset fusion area with the human figure area image to obtain a combined image.

Referring back to fig. 2, in some embodiments,

steps

01411, 01412, and 01413 may all be implemented by processor 20.

That is, the processor 20 may be further configured to obtain a predetermined fusion region in the predetermined three-dimensional background image, determine a pixel region to be replaced of the predetermined fusion region according to the human figure region image, and replace the pixel region to be replaced of the predetermined fusion region with the human figure region image to obtain a merged image.

It can be understood that when the predetermined three-dimensional background image is obtained through actual scene modeling, depth data corresponding to each pixel in the predetermined three-dimensional background image can be directly obtained in the modeling process; when the preset three-dimensional background image is obtained through animation production, the depth data corresponding to each pixel in the preset three-dimensional background image can be set by a producer; in addition, since each object existing in the predetermined three-dimensional background image is also known, the fusion position of the human figure region image, that is, the predetermined fusion region can be specified from the depth data and the object existing in the predetermined three-dimensional background image before the image fusion processing is performed using the predetermined three-dimensional background image. Since the size of the image of the person region acquired by the visible light camera 11 is affected by the acquisition distance, when the acquisition distance is short, the image of the person region is large, and when the acquisition distance is long, the image of the person region is small, the processor 20 needs to determine the pixel region to be replaced in the predetermined fusion region according to the size of the image of the person region actually acquired by the visible light camera 11. And then, replacing the pixel area to be replaced in the preset fusion area with the image of the person area to obtain a fused combined image. In this way, the fusion of the human figure region image and the predetermined three-dimensional background image is realized.

Referring to fig. 13, in some embodiments, the step 014 of fusing the human figure region image with the predetermined three-dimensional background image to obtain the merged image includes:

01421: processing the predetermined three-dimensional background image to obtain a full-field edge image of the predetermined three-dimensional background image;

01422: acquiring depth data of a preset three-dimensional background image;

01423: determining a calculation fusion area of a preset three-dimensional background image according to a full-field edge image and depth data of the preset three-dimensional background image;

01424: determining and calculating a pixel area to be replaced of the fusion area according to the person area image; and

01425: and replacing the pixel area to be replaced of the calculated fusion area with the human figure area image to obtain a combined image.

Referring back to FIG. 2, in some embodiments,

steps

01421, 01422, 01423, 01424 and 01425 may all be implemented by processor 20.

That is, the processor 20 may be further configured to process the predetermined three-dimensional background image to obtain a full-field edge image of the predetermined three-dimensional background image, obtain depth data of the predetermined three-dimensional background image, determine a calculation fusion region of the predetermined three-dimensional background image according to the full-field edge image and the depth data of the predetermined three-dimensional background image, determine a pixel region to be replaced of the calculation fusion region according to the person region image, and replace the pixel region to be replaced of the calculation fusion region with the person region image to obtain a merged image.

It is understood that, if the fusion position of the human figure region image is not specified in advance when the predetermined three-dimensional background image is fused with the human figure region image, the processor 20 needs to first determine the fusion position of the human figure region image in the predetermined three-dimensional background image. Specifically, the processor 20 first performs edge extraction on the predetermined three-dimensional background image to obtain a full-field edge image, and obtains depth data of the predetermined three-dimensional background image, wherein the depth data is obtained in a modeling or animation process of the predetermined three-dimensional background image. Processor 20 then determines a calculated fusion region in the predetermined three-dimensional background image based on the full-field edge image and the depth data of the predetermined three-dimensional background image. Since the size of the image of the person region is affected by the acquisition distance of the visible light camera 11, the size of the image of the person region needs to be calculated, and the pixel region to be replaced in the fusion region needs to be determined and calculated according to the size of the image of the person region. And finally, replacing the pixel area to be replaced in the calculated fusion area image with the person area image so as to obtain a combined image. In this way, the fusion of the human figure region image and the predetermined three-dimensional background image is realized.

The fused merged image may be displayed on the display 50 (shown in fig. 14) of the electronic device 1000.

In some embodiments, the person region image may be a two-dimensional person region image or a three-dimensional person region image. The processor 20 may extract a two-dimensional character region image from the scene image according to the depth information in the depth image, and the processor 20 may further create a three-dimensional image of the character region according to the depth information in the depth image, and perform color filling on the three-dimensional character region according to the color information in the scene image to obtain a three-dimensional color character region image.

In some embodiments, the predetermined fusion region or the calculated fusion region in the predetermined three-dimensional background image may be one or more. When the number of the predetermined fusion areas is one, the fusion position of the two-dimensional character area image or the three-dimensional character area image in the predetermined three-dimensional background image is the only one predetermined fusion area; when one calculated fusion region is provided, the fusion position of the two-dimensional character region image or the three-dimensional character region image in the predetermined three-dimensional background image is the only calculated fusion region; when the predetermined fusion area is a plurality of predetermined fusion areas, the fusion position of the two-dimensional character area image or the three-dimensional character area image in the predetermined three-dimensional background image can be any one of the plurality of predetermined fusion areas, and further, because the three-dimensional character area image has depth information, the predetermined fusion area matched with the depth information of the three-dimensional character area image can be searched in the plurality of predetermined fusion areas as the fusion position, so as to obtain better fusion effect; when the number of the calculated fusion areas is multiple, the fusion position of the two-dimensional character area image or the three-dimensional character area image in the calculated three-dimensional background image can be any one of the multiple calculated fusion areas, and further, because the three-dimensional character area image has depth information, the calculated fusion area matched with the depth information of the three-dimensional character area image can be searched in the multiple calculated fusion areas to be used as the fusion position, so that a better fusion effect can be obtained.

In some application scenarios, for example, when a current user wants to hide a current background during a video process with another person, the image processing method of the embodiment of the present invention may be used to fuse a person region image corresponding to the current user with a predetermined three-dimensional background image, and then display the fused merged image to the other party. Since the current user is in a video call with the other party, the visible light camera 11 needs to capture a scene image of the current user in real time, the depth image collecting component 12 also needs to collect a depth image corresponding to the current user in real time, and the processor 20 timely processes the scene image and the depth image collected in real time so that the other party can see a smooth video picture formed by combining multiple frames of combined images. After the processor 20 acquires the sound characteristics, the processor 20 processes the multi-frame scene images and the multi-frame depth images to obtain multi-frame character region images, and the multi-frame character region images are respectively fused with the switched predetermined three-dimensional background images, so that multi-frame combined images are obtained. The multiple frames of combined images can be combined into a video picture.

Referring to fig. 3 and fig. 14, an electronic device 1000 is further provided in an embodiment of the invention. The electronic device 1000 includes the image processing device 100. The image processing apparatus 100 may be implemented using hardware and/or software. The image processing apparatus 100 includes an imaging device 10 and a processor 20.

The imaging device 10 includes a visible light camera 11 and a depth image acquisition assembly 12.

Specifically, the visible light camera 11 includes an image sensor 111 and a lens 112, and the visible light camera 11 can be used to capture color information of a current user to obtain an image of a scene, wherein the image sensor 111 includes a color filter array (e.g., a Bayer filter array), and the number of the lens 112 can be one or more. In the process of acquiring a scene image by the visible light camera 11, each imaging pixel in the image sensor 111 senses light intensity and wavelength information from a shooting scene to generate a group of original image data; the image sensor 111 sends the group of raw image data to the processor 20, and the processor 20 performs operations such as denoising and interpolation on the raw image data to obtain a colorful scene image. Processor 20 may process each image pixel in the raw image data one-by-one in a variety of formats, for example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits, and processor 20 may process each image pixel at the same or different bit depth.

The depth image acquisition assembly 12 includes a structured light projector 121 and a structured light camera 122, and the depth image acquisition assembly 12 is operable to capture depth information of a current user to obtain a depth image. The structured light projector 121 is used to project structured light to the current user, wherein the structured light pattern may be a laser stripe, a gray code, a sinusoidal stripe, or a randomly arranged speckle pattern, etc. The structured light camera 122 includes an image sensor 1221 and lenses 1222, and the number of the lenses 1222 may be one or more. The image sensor 1221 is used to capture a structured light image projected onto a current user by the structured light projector 121. The structured light image may be sent by the depth acquisition component 12 to the processor 20 for demodulation, phase recovery, phase information calculation, and the like to obtain the depth information of the current user.

In some embodiments, the functions of the visible light camera 11 and the structured light camera 122 can be implemented by one camera, that is, the imaging device 10 includes only one camera and one structured light projector 121, and the camera can capture not only the scene image but also the structured light image.

In addition to acquiring a depth image by using structured light, a depth image of a current user can be acquired by a binocular vision method, a Time of Flight (TOF) based depth image acquisition method, and the like.

The processor 20 is further configured to fuse the image of the person region extracted from the image of the scene and the depth image with a predetermined three-dimensional background image. When the human figure region image and the predetermined three-dimensional background image are subjected to fusion processing, a two-dimensional human figure region image and the predetermined three-dimensional background image may be fused to obtain a combined image, or a three-dimensional colorful human figure region image and the predetermined three-dimensional background image may be fused to obtain a combined image.

Further, the image processing apparatus 100 includes a memory 30. The Memory 30 may be embedded in the electronic device 1000, or may be a Memory independent from the electronic device 1000, and may include a Direct Memory Access (DMA) feature. The raw image data collected by the visible light camera 11 or the structured light image related data collected by the depth image collecting assembly 12 can be transmitted to the memory 30 for storage or buffering. Processor 20 may read raw image data from memory 30 for processing to obtain an image of a scene and may read structured light image-related data from memory 30 for processing to obtain a depth image. In addition, the scene image and the depth image may be stored in the memory 30 for the processor 20 to call up processing at any time, for example, the processor 20 calls up the scene image and the depth image to perform the person region extraction, and performs the fusion processing on the extracted person region image and the initial predetermined three-dimensional background image or the switched predetermined three-dimensional background image to obtain the merged image. Wherein the predetermined three-dimensional background image and the merged image may also be stored in the memory 30.

The image processing apparatus 100 may also include a display 50. The display 50 may retrieve the merged image directly from the processor 20 and may also retrieve the merged image from the memory 30. The display 50 displays the merged image for viewing by a user or for further Processing by a Graphics Processing Unit (GPU). The image processing apparatus 100 further includes an encoder/decoder 60, and the encoder/decoder 60 may encode and decode image data of a scene image, a depth image, a predetermined three-dimensional background image, a merged image, and the like, and the encoded image data may be stored in the memory 30 and may be decompressed by the decoder for display before the image is displayed on the display 50. The encoder/decoder 60 may be implemented by a Central Processing Unit (CPU), a GPU, or a coprocessor. In other words, the encoder/decoder 60 may be any one or more of a Central Processing Unit (CPU), a GPU, and a coprocessor.

The image processing apparatus 100 further comprises a control logic 40. When imaging device 10 is imaging, processor 20 may perform an analysis based on data acquired by the imaging device to determine image statistics for one or more control parameters (e.g., exposure time, etc.) of imaging device 10. Processor 20 sends the image statistics to control logic 40 and control logic 40 controls imaging device 10 to determine the control parameters for imaging. Control logic 40 may include a processor and/or microcontroller that executes one or more routines (e.g., firmware). One or more routines may determine control parameters of imaging device 10 based on the received image statistics.

The image processing apparatus 100 further includes an acoustic electric element 70. The acoustoelectric element 70 converts sound into current output using the principle of electromagnetic induction. The current user can drive the inside air vibration of acoustoelectric component 70 when sounding for tiny displacement appears between the inside coil of acoustoelectric component 70 and the magnetic core, thereby the cutting magnetism is felt the line and is produced electric current. The acousto-electric element 70 delivers an electric current to the processor 20, and the processor 20 processes the electric current to generate acoustic information. The sound information may be further processed by the processor 20 to obtain sound characteristics and may also be sent to the memory 30 for storage. The electro-acoustic element 70 may be a microphone.

Referring to fig. 15, an electronic device 1000 according to an embodiment of the invention includes one or more processors 20, a memory 30, and one or more programs 31. Where one or more programs 31 are stored in memory 30 and configured to be executed by one or more processors 20. The program 31 includes instructions for executing the image processing method of any one of the above embodiments.

For example, the program 31 includes image processing method instructions for performing the steps of:

02; acquiring sound information of a current user;

04: processing the sound information to obtain sound characteristics; and

As another example, the program 31 includes instructions for an image processing method that performs the steps of:

0131: identifying a face region in a scene image;

The computer-readable storage medium of an embodiment of the present invention includes a computer program for use in conjunction with the image-enabled electronic device 1000. The computer program may be executed by the processor 20 to perform the image processing method of any of the above embodiments.

For example, the computer program may be executable by the processor 20 to perform the image processing method described in the following steps:

02; acquiring sound information of a current user;

04: processing the sound information to obtain sound characteristics; and

As another example, the computer program may be executable by the processor 20 to perform an image processing method as described in the following steps:

0131: identifying a face region in a scene image;

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An image processing method, which is used for processing a merged image in a process of a video conference or a video chat of a user, wherein the merged image is formed by fusing a preset three-dimensional background image and a character area image in a scene image of the current user in a real scene, and the image processing method comprises the following steps:

acquiring sound information and environmental sound of the current user;

processing the sound information and the environmental sound to obtain sound characteristics, wherein the sound characteristics comprise one or more of loudness, tone and timbre, the loudness is associated with the real environment where the current user is located and is used for representing whether the current real environment is quiet or noisy, the loudness is judged according to the amplitude of the waveform of the collected sound information and is divided into a plurality of levels according to the volume, the volume of each level corresponds to a preset three-dimensional background image, the tone is obtained according to the lowest frequency of the sound information and is the frequency of sound band vibration, the higher the frequency of the sound band vibration is, the more crisp and sharp the sound emitted by the current user is, the lower the frequency of the sound band vibration is, and the more crisp and rough the sound is judged according to the frequency to correspond to the preset three-dimensional background image; the tone is analyzed and detected according to the frequency spectrum characteristics of the sound information, each user has different tones, and the different tones correspond to different predetermined three-dimensional background images; and

switching the preset three-dimensional background image according to the sound characteristics, wherein the preset three-dimensional background image comprises elements corresponding to the sound characteristics;

the preset three-dimensional background image is a preset three-dimensional background image obtained by modeling an actual scene or a preset three-dimensional background image obtained by animation.

2. The image processing method according to claim 1, characterized in that the image processing method further comprises:

acquiring a scene image of the current user;

acquiring a depth image of the current user;

processing the scene image and the depth image to extract a person region of the current user in the scene image to obtain a person region image; and

and fusing the human figure region image and the preset three-dimensional background image to obtain a combined image.

3. The image processing method according to claim 1, wherein the sound characteristic includes a plurality of, the predetermined three-dimensional background image includes a plurality of, each of the sound characteristics corresponds to one of the predetermined three-dimensional background images, and the step of switching the predetermined three-dimensional background image according to the sound characteristic includes:

and switching a predetermined three-dimensional background image corresponding to the sound characteristic according to the sound characteristic.

4. The image processing method according to claim 1, wherein the predetermined three-dimensional background image is composed of a plurality and stored in a predetermined order, and the step of switching the predetermined three-dimensional background image according to the sound characteristic includes:

switching a plurality of the three-dimensional background images in a predetermined manner according to the sound characteristics.

5. The image processing method according to claim 2, wherein the step of obtaining the depth image of the current user comprises:

projecting structured light towards the current user;

shooting a structured light image modulated by the current user; and

and demodulating phase information corresponding to each pixel of the structured light image to obtain the depth image.

6. The image processing method according to claim 5, wherein the step of demodulating phase information corresponding to each pixel of the structured light image to obtain the depth image comprises:

demodulating phase information corresponding to each pixel in the structured light image;

converting the phase information into depth information; and

and generating the depth image according to the depth information.

7. The image processing method according to claim 2, wherein the step of fusing the human figure region image with the predetermined three-dimensional background image to obtain a merged image includes:

acquiring a preset fusion area in the preset three-dimensional background image;

determining a pixel area to be replaced of the preset fusion area according to the human figure area image; and replacing the pixel area to be replaced of the preset fusion area with the human figure area image to obtain the combined image.

8. The image processing method according to claim 2, wherein the step of fusing the human figure region image with the predetermined three-dimensional background image to obtain a merged image includes:

processing the predetermined three-dimensional background image to obtain a full-field edge image of the predetermined three-dimensional background image;

acquiring depth data of the preset three-dimensional background image;

determining a calculation fusion area of the preset three-dimensional background image according to the full-field edge image of the preset three-dimensional background image and the depth data;

determining a pixel area to be replaced of the calculation fusion area according to the figure area image; and

and replacing the pixel area to be replaced of the calculated fusion area with the human figure area image to obtain the combined image.

9. An image processing apparatus for processing a merged image during a video conference or a video chat performed by a user, the merged image being formed by fusing a predetermined three-dimensional background image and a person region image in a scene image of a current user in a real scene, the image processing apparatus comprising:

the sound and electricity element is used for acquiring the sound information of the current user and the environmental sound; and a processor to:

processing the sound information and the environmental sound to obtain sound characteristics, wherein the sound characteristics comprise one or more of loudness, tone and timbre, the loudness is associated with the current real environment where the user is located and is used for representing that the current real environment is quiet or noisy, the loudness is judged according to the amplitude of the waveform of the collected sound information and is divided into a plurality of levels according to the volume, the volume of each level corresponds to a preset three-dimensional background image, the tone is obtained according to the lowest frequency of the sound information, the tone is the frequency of sound band vibration, the higher the sound band vibration frequency is, the crisper and sharper the sound emitted by the current user is, the lower the sound band vibration frequency is, and the crisper or crisper the sound is judged according to the frequency to correspond to the preset three-dimensional background image; the tone is analyzed and detected according to the frequency spectrum characteristics of the sound information, each user has different tones, and the different tones correspond to different predetermined three-dimensional background images; and

10. The image processing apparatus according to claim 9, characterized by further comprising:

the visible light camera is used for acquiring a scene image of the current user;

the depth image acquisition component is used for acquiring a depth image of the current user;

the processor is further configured to:

11. The image processing apparatus according to claim 9, wherein the sound characteristic includes a plurality of sound characteristics, the predetermined three-dimensional background image includes a plurality of sound characteristics, each of the sound characteristics corresponds to one of the predetermined three-dimensional background images, and the processor is further configured to:

12. The image processing apparatus according to claim 9, wherein the predetermined three-dimensional background image includes a plurality and is stored in a predetermined order, the processor further configured to:

13. The image processing apparatus of claim 10, wherein the depth image acquisition assembly comprises a structured light projector and a structured light camera, the structured light projector for projecting structured light to the current user;

the structured light camera is configured to:

shooting a structured light image modulated by the current user; and

14. The image processing apparatus of claim 13, wherein the structured light camera is further configured to:

converting the phase information into depth information; and

and generating the depth image according to the depth information.

15. The image processing apparatus of claim 10, wherein the processor is further configured to:

determining a pixel area to be replaced of the preset fusion area according to the human figure area image; and

and replacing the pixel area to be replaced of the preset fusion area with the human figure area image to obtain the combined image.

16. The image processing apparatus of claim 10, wherein the processor is further configured to:

acquiring depth data of the preset three-dimensional background image;

17. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the image processing method of any of claims 1 to 8.

18. A computer-readable storage medium comprising a computer program for use in conjunction with an electronic device capable of capturing images, the computer program being executable by a processor to perform the image processing method of any one of claims 1 to 8.