WO2020118503A1 - Procédé et appareil pour déterminer une région de mise au point d'image - Google Patents

Procédé et appareil pour déterminer une région de mise au point d'image Download PDF

Info

Publication number
WO2020118503A1
WO2020118503A1 PCT/CN2018/120200 CN2018120200W WO2020118503A1 WO 2020118503 A1 WO2020118503 A1 WO 2020118503A1 CN 2018120200 W CN2018120200 W CN 2018120200W WO 2020118503 A1 WO2020118503 A1 WO 2020118503A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
stream
audio
video image
video
Prior art date
Application number
PCT/CN2018/120200
Other languages
English (en)
Chinese (zh)
Inventor
陈亮
孙凤宇
兰传骏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/120200 priority Critical patent/WO2020118503A1/fr
Priority to CN201880088065.2A priority patent/CN111656275B/zh
Publication of WO2020118503A1 publication Critical patent/WO2020118503A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G03PHOTOGRAPHY; CINEMATOGRAPHY; ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ELECTROGRAPHY; HOLOGRAPHY
    • G03BAPPARATUS OR ARRANGEMENTS FOR TAKING PHOTOGRAPHS OR FOR PROJECTING OR VIEWING THEM; APPARATUS OR ARRANGEMENTS EMPLOYING ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ACCESSORIES THEREFOR
    • G03B13/00Viewfinders; Focusing aids for cameras; Means for focusing for cameras; Autofocus systems for cameras
    • G03B13/32Means for focusing
    • G03B13/34Power focusing
    • G03B13/36Autofocus systems

Definitions

  • Embodiments of the present application relate to the technical field of image processing, and more specifically, to a method and device for determining an image focus area.
  • the terminal's camera will collect the image, and the terminal's touch screen will display the image collected by the camera; then, the user can click on the image area of interest on the touch screen with his finger; Secondly, the touch screen focuses on the image area clicked by the user's finger; after the touch screen successfully focuses on the image area clicked by the user's finger, the user can use the terminal to take videos or photos.
  • the terminal can achieve image focusing, if the position of the image area of interest on the touch screen changes, the user needs to re-click the image area of interest on the touch screen to make the touch screen The image area clicked by the user's finger refocuses.
  • the above-mentioned related image focusing technology will passively wait for the focus instruction input by the user, and cannot actively identify the image area of interest to the user.
  • auxiliary focusing methods such as ranging method and multi-microphone array method in the industry.
  • the terminal needs to actively emit infrared light waves or ultrasonic waves, thereby increasing the device cost of the terminal's focusing system.
  • the terminal requires a larger number of microphone units to achieve better performance, which also increases the device cost of the terminal's focusing system.
  • Embodiments of the present application provide a method and apparatus for determining an image focus area. Without increasing the cost of a device, an image focus area that is most likely to emit sound can be determined in the image.
  • an embodiment of the present application provides a method for determining an image focus area, the method includes: acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of images
  • the image data collected by the collection device is generated, and the audio stream includes multiple sound frames and is generated by the sound data collected by the sound collection device;
  • a scene analysis heat map is generated based on the video image stream and audio stream, and the scene analysis heat map is used to indicate the video image stream Probability of sound in each image unit among multiple image units on the target image frame; determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and at least one image focus area
  • Each image area includes at least one image unit.
  • the method provided by the embodiments of the present application may be applied to a terminal having an image collection device and a sound collection device.
  • the terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream.
  • the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream.
  • no additional devices need to be added to the terminal, thereby reducing the device cost.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the method further includes: controlling the image acquisition device to focus on the at least one image focus area Focus.
  • generating the scene analysis heat map from the video image stream and the audio stream includes: splitting the original video file into the video image stream and the audio stream; using the neural network model to process the video image stream and the audio stream Get a scene analysis heat map.
  • the method before generating the scene analysis heat map from the video image stream and the audio stream, the method further includes: converting the audio stream from the time domain form to the frequency domain form.
  • using the neural network model to process the video image stream and the audio stream to obtain the scene analysis heat map includes: using the neural network model of the audio network to process the audio stream to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes The length of the audio stream, the frequency distribution of the audio stream, and the characteristic information of the audio stream;
  • the image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream;
  • the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
  • the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion
  • the image of the neural network model The network is used to analyze the image information and get the image analysis conclusion.
  • the neural network model fusion network is used to analyze and fuse the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally determine the video image stream.
  • the focus area of the image in the target image frame that is most likely to make sound, and the focus area of the image in the target image frame that is most likely to make sound is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used To identify the focus area of the image in the target image frame that is most likely to emit sound with higher accuracy.
  • an embodiment of the present application provides an apparatus, including: an acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of an image acquisition device The generated image data is generated, and the audio stream includes multiple sound frames and the sound data collected by the sound collection device is generated; the generation module is used to generate a scene analysis heat map based on the video image stream and the audio stream, and the scene analysis heat map is used to indicate the video Probability that there is sound on each of the multiple image units on the target image frame of the image stream; a determination module is used to determine that at least one image on the target image frame of the video image stream that meets the preset conditions is in focus according to the scene analysis heat map Area, each image area in the at least one image focus area includes at least one image unit.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the device further includes: a focusing module, configured to control the image acquisition device to focus on at least one image focusing area.
  • the generation module includes: a splitting module, which is used to split the original video file into a video image stream and an audio stream; a processing module, which is used to perform a video image stream and an audio stream using a neural network model The scene analysis heat map is processed.
  • the device further includes: a conversion module, configured to convert the audio stream from the time domain form to the frequency domain form.
  • the processing module includes: an audio processing module for processing an audio stream by using an audio network of a neural network model to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes the duration of the audio stream and the frequency distribution of the audio stream And audio stream feature information; an image processing module for processing a video image stream using an image network of a neural network model to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream, and the video image stream The width and the characteristic information of the video image stream; the fusion processing module, which is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.
  • an embodiment of the present application provides an apparatus, including one or more processors, for performing the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the device includes a memory for storing software instructions that drive the processor to work.
  • an embodiment of the present application provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, causes the computer or processor to perform the first aspect described above Or the method in any possible implementation manner of the first aspect.
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the first aspect or any possible implementation of the first aspect The way in the way.
  • FIG. 1 is a schematic diagram of a camera software of a smart phone that has not yet determined a focus area according to an embodiment of the present application
  • FIG. 2 is a schematic diagram showing that a camera software of a smartphone has determined a focus area according to an embodiment of the present application
  • FIG. 3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of this application;
  • FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a scene analysis heat map provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of this application;
  • FIG. 9 is a schematic diagram of a video image stream and an audio stream of still another original video file provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of another method for determining an image focus area provided by an embodiment of this application.
  • FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of another device provided by an embodiment of the present application.
  • An embodiment of the present application provides a method for determining an image focus area.
  • the method provided in the embodiment of the present application may be applied to terminals of different types.
  • the method provided in the embodiments of the present application may be applied to terminals such as smart phones, tablet computers, digital cameras, or smart cameras.
  • the technical solution of the present application can also be applied to electronic devices that do not have a communication function. The following uses the application of the technical solution on the terminal as an example for introduction.
  • FIG. 1 is a schematic diagram of a camera software of a smart phone provided by an embodiment of the present application, and a focus area has not been determined.
  • FIG. 2 is a diagram provided by an embodiment of the present application. A schematic diagram of the camera software of the smartphone has determined the focus area.
  • FIGS. 1 and 2 are used to enable the reader to quickly understand the technical principles of the embodiments of the present application, and are not used to limit the protection scope of the embodiments of the present application.
  • the specific parameter values mentioned in the embodiments shown in FIGS. 1 and 2 can be changed according to the principles of the embodiments of the present application, and the protection scope of the embodiments of the present application is not limited to the specific parameter values already mentioned.
  • the user wants to use the smartphone 1 to take pictures or videos of birds in the zoo.
  • the user opens the camera software of smartphone 1 and points the camera of smartphone 1 to the bird; then, smartphone 1 calls the camera to collect the image of the bird and displays the image of the bird on the touch screen of smartphone 1 on.
  • the smartphone 1 will automatically call the camera and the microphone to collect the original video file for 2 seconds.
  • the smartphone 1 of this embodiment includes a camera function, which can be regarded as a camera when performing the camera function, and the camera includes the camera and an image signal processor (ISP).
  • ISP image signal processor
  • the smartphone 1 will obtain the 2 second original video file, and use the neural network model pre-stored in the smartphone 1 to generate a scene analysis heat map based on the video image stream and audio stream of the 2 second original video file.
  • the analysis heat map is used to indicate the image area in the video image stream that is most likely to emit sound.
  • the neural network model of smartphone 1 will be calculated based on the 2-second original video file
  • the small bird's beak in the outgoing image is the image area most likely to make sound, so the generated scene analysis heat map will indicate that the small bird's beak in area A in Figure 2 is the most likely image area to make sound.
  • the smart phone 1 will focus on the small bird's beak in the area A in FIG. 2, so as to realize automatic focusing on the sound area in the image.
  • the smartphone 1 can take a picture or record a video.
  • the smartphone 1 can repeat the method for determining the image focus area provided by the embodiments of the present application continuously, so as to continuously generate a scene analysis heat map to indicate the image area in the image most likely to emit sound in real time.
  • FIG. 3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application.
  • the method shown in FIG. 3 can determine the focus area of the image most likely to emit sound in the image without increasing the cost of the device.
  • the method includes the following steps.
  • Step S11 Obtain the original video file.
  • the terminal will automatically call the camera and the microphone to collect the original video file for a predetermined period of time, and obtain the original video file for the predetermined period of time .
  • the predetermined time period is 2 seconds.
  • the predetermined time period can be preset according to factors such as the terminal's hardware configuration. When the terminal's hardware configuration is high, the predetermined time period can be shortened appropriately, for example, the predetermined time period is set to 1 second or 0.5 second, or even shorter ; When the hardware configuration of the terminal is low, you can extend the predetermined time period appropriately, for example, set the predetermined time period to 3 seconds or 4 seconds, or even longer.
  • the embodiments of the present application do not limit the specific length of the predetermined time period, and the specific values provided above are only used to explain the principle of adjusting the predetermined time period.
  • the original video file includes a video image stream and an audio stream.
  • the video image stream includes multiple image frames and is generated by image data collected by the image collection device
  • the audio stream includes multiple sound frames and is generated by sound data collected by the sound collection device.
  • the image acquisition device may be the camera of the terminal described in the previous embodiment, or may optionally include the ISP.
  • the sound collection device may be a microphone of the terminal, or may optionally include a voice processing channel or circuit that processes signals of the microphone mobile phone.
  • FIG. 4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis, and the time point T1 is earlier than the time point T2, assuming that the time length of the original video file is T1 to T2.
  • the original video file includes a video image stream and an audio stream, where the video image stream has 10 image frames (image frame P1 to image frame P10), the audio stream has 30 sound frames (not shown in the figure), the video image stream and The audio stream is used to generate scene analysis heat maps.
  • Step S12 Generate a scene analysis heat map according to the video image stream and the audio stream.
  • the scene analysis heat map is used to indicate the probability of the presence of sound on each of the multiple image units on the target image frame of the video image stream, and the image unit may be a pixel.
  • the probability that there is sound on one image unit corresponds to the probability that the corresponding object on the image unit emits sound.
  • the object can be a person, animal, musical instrument, equipment, or other object.
  • the terminal When the terminal generates a scene analysis heat map from the video image stream and the audio stream, the terminal combines the video image stream and the audio stream to calculate the probability of sound on each image unit on the target image frame of the video image stream, and according to the target The probability of the presence of sound on each image unit on the image frame generates a scene analysis heat map.
  • the scene analysis heat map is a frame of image, and the resolution of the scene analysis heat map is the same as the resolution of the target image frame of the video image stream.
  • the target image frame of the video image stream may be the last frame of multiple image frames in the video image stream. For example, referring to FIG. 4, the video image stream in FIG. 4 has 10 image frames, and the target image frame is the last image frame P10 in the video image stream.
  • FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application
  • FIG. 6 is a scene analysis provided by an embodiment of the present application.
  • Schematic diagram of the heat map For ease of example, in FIGS. 5 and 6, it is assumed that the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 5 pixels ⁇ 3 pixels. In the actual scene, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are the resolutions preset by the user. For example, assuming that the preset resolution of the user is 1920 pixels ⁇ 1080 pixels, then the video image The resolution of the stream's target image frame and the resolution of the scene analysis heat map are both 1920 pixels ⁇ 1080 pixels.
  • the terminal can combine the video image stream and the audio stream to calculate 15 pixels on the target image frame of the video image stream The probability of sound at the point.
  • each circle represents 1 pixel, that is, there are 15 pixels in FIG. 5, and each pixel in FIG. 5 has the original color. For example, assume that the original colors of the 15 pixels are all red.
  • the terminal calculates after combining the video image stream and the audio stream, it is known that the probability of sound on the pixel points (11, 12, 21, 22, 31, 32) is greater than or equal to 50%, and the pixel points (13, 14, 15 , 23, 24, 25, 33, 34, 35) The probability of sound is less than 50%. Then, the terminal generates a scene analysis heat map based on the probability of sound on 15 pixels on the target image frame.
  • the terminal in order to distinguish the difference in the probability of sound on the pixels, based on the target image frame, the terminal emits pixels with a probability of sound greater than or equal to 50% (11, 12, 21, 22, 31 , 32) becomes black, and the color of pixels (13, 14, 15, 23, 24, 25, 33, 34, 35) with a probability of less than 50% of the sound emitted by the object becomes white, which will be as shown in the figure
  • the target image frame shown in 5 is converted into the scene analysis heat map shown in FIG. 6.
  • the scene analysis heat map shown in FIG. 6 is only an example.
  • the color corresponding to each pixel in the scene analysis heat map can be set according to the actual situation.
  • the color distinguishing each pixel is not limited to the use of white and black.
  • the audio stream Before step S12, that is, before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream can be converted from the time domain form to the frequency domain form, and then, based on the video image stream and the audio stream converted into the frequency domain form Generate a scene analysis heat map.
  • the audio stream before generating a scene analysis heat map from the video image stream and audio stream, the audio stream is first Fourier transformed to obtain a short-time Fourier spectrum, and then the scene is generated from the video image stream and the short-time Fourier spectrum Analyze the heat map.
  • the scene analysis heat map can also be generated by directly using the audio stream in the time domain without performing the Fourier transform.
  • Step S13 Determine at least one image focus area on the target image frame of the video image stream that meets the preset condition according to the scene analysis heat map.
  • each image area in the at least one image focus area includes at least one image unit
  • the preset condition is the probability that there is sound on one or more image units in the at least one image focus area in the at least one image focus area Both reached the preset probability threshold.
  • the preset probability threshold is a preset probability threshold.
  • the probability that the sound exists on each image unit in at least one image unit in the area reaches a preset probability threshold, it may be determined that the area is an image focus area.
  • the terminal may control the image acquisition device to focus on at least one image focusing area.
  • FIG. 7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application. Assuming that the preset probability threshold is 50%, the terminal determines the pixel points adjacent to the target image frame of the video image stream and reaching the preset probability threshold 50% according to the scene analysis heat map (11, 12, 21, 22, 31, 32 ), the pixels (11, 12, 21, 22, 31, 32) constitute the B area in FIG. 7, and the B area is an image focus area. Then, the terminal can control the image acquisition device to focus on the image focusing area (that is, area B) in FIG. 7.
  • the method provided by the embodiment of the present application may be applied to a terminal having an image collection device and a sound collection device.
  • the terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream.
  • the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream.
  • no additional devices need to be added to the terminal.
  • the image acquisition device and the sound acquisition device can be the original devices of the terminal itself, such as the camera or microphone of the smartphone, thereby reducing the device cost .
  • a method for generating a scene analysis heat map using a piece of original video file and using the scene analysis heat map to determine the image focus area of a target image frame is introduced. Based on the above principles, the following describes how to use multiple original video files to generate multiple scene analysis heatmaps, and use multiple scene analysis heatmaps to determine the image focus area of multiple target image frames.
  • FIG. 8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis
  • T1 to T4 respectively represent four time points, where T1 is earlier than T2, T2 is earlier than T3, and T3 is earlier than T4.
  • the terminal automatically collects the original video file A between T1 and T3, where the original video file A includes the video image stream A and the audio stream A.
  • the terminal When the time reaches T3, the terminal will acquire the original video file A, and generate a scene analysis heat map A based on the video image stream A (image frame p1 to image frame p10) and audio stream A of the original video file A, and then analyze the scene
  • the heat map A determines at least one image focus area on the image frame p10 of the video image stream A that meets the preset condition, that is, the terminal determines the image focus area on the image frame p10 at time T3.
  • the terminal In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T4, the terminal will obtain the original video file B automatically collected between T2 and T4, and according to the video image stream B of the original video file B ( Image frame P2 to image frame P11) and audio stream B to generate a scene analysis heat map B, and then determine at least one image focus area on the image frame P11 of the video image stream B that meets the preset conditions according to the scene analysis heat map B, that is, the terminal is At time T4, the image focus area on the image frame P11 is determined.
  • the terminal may determine a corresponding scene analysis heat map for each image frame, and use the scene analysis heat map corresponding to each image frame to determine at least one image satisfying the preset condition for each image frame Focus area, so that the image focus area can be determined more accurately for each image frame.
  • FIG. 9 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis
  • T1 to T3 respectively represent three time points, where T1 is earlier than T2 and T2 is earlier than T3.
  • the terminal will automatically collect the original video file A between T1 and T2, where the original video file A includes the video image stream A (image frame P1 to image frame P10) and Audio stream A.
  • the terminal When the time reaches T2, the terminal will obtain the original video file A, and generate a scene analysis heat map A according to the video image stream A and the audio stream A of the original video file A, and then determine the video image stream A according to the scene analysis heat map A At least one image focus area on the image frame P10 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P10 at time T2. Then, from T2 to T3, the image frame P11 to the image frame P18 all determine at least one image focus area that satisfies the preset condition according to the scene analysis heat map A.
  • the terminal In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T3, the terminal will obtain the original video file B automatically collected between T2 and T4, where the original video file B includes the video image stream B( Image frame P10 to Image frame P19) and audio stream B.
  • the terminal When the time reaches T3, the terminal will obtain the original video file B, and generate a scene analysis heat map B based on the video image stream B and audio stream B of the original video file B, and then determine the video image stream B according to the scene analysis heat map B At least one image focus area on the image frame P19 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P19 at time T3.
  • the terminal In the second way, the terminal generates a scene analysis heat map at intervals so that multiple image frames can reuse one scene analysis heat map, thereby reducing the number of scene analysis heat map calculations per unit time So that the processing resources of the terminal can be occupied less.
  • FIG. 10 is a flowchart of another method for determining an image focus area provided by an embodiment of the present application.
  • the method shown in FIG. 10 is a specific implementation of step S12 in FIG. 3, that is, “
  • the specific implementation of "Generating a scene analysis heat map based on video image stream and audio stream” includes the following steps.
  • Step S21 Split the original video file into a video image stream and an audio stream.
  • multimedia video processing tools such as software tools to split the original video file into a video image stream and audio stream.
  • Step S22 Acquire a pre-stored neural network model.
  • the neural network model can be generated by a server with strong computing power.
  • the neural network model generated by the server includes an audio network, an image network, and a fusion network.
  • the terminal can download the neural network model generated on the server to the local memory in advance.
  • the terminal can obtain the pre-stored neural network model in the local memory.
  • the server will input a large number of video samples into the neural network model, so that the neural network model learns from the large number of video samples.
  • the neural network model can identify the image focus area in the target image frame of the video image stream that is most likely to emit sound.
  • Step S23 Use the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream.
  • the terminal uses the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix, where the duration of the audio stream is 2 seconds, the frequency distribution of the audio stream is 0-8Khz, and the characteristic information of the audio stream is, for example, at least one Stereo channels.
  • Step S24 Use the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream (image length), the width of the video image stream (image width) and the feature information of the video image stream.
  • the terminal uses the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix, where the duration of the video image stream is 2 seconds, the length of the video image stream is 1920 pixels, and the width of the video image stream is With 1080 pixels, the feature information of the video image stream is, for example, RBG three channels or other color gamut channels.
  • the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
  • the terminal uses the neural network model to process the video image stream and the audio stream to obtain a scene analysis heat map, it is more like a human process of processing sound information and image information.
  • the audio network of the neural network model is similar to the human ear.
  • the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion.
  • the image network of the neural network model is similar to the human eye.
  • the image network of the neural network model is used to analyze image information and obtain image analysis conclusions.
  • the fusion network of the neural network model is similar to the human brain.
  • the fusion network of the neural network model is used to analyze and fuse the sound analysis conclusions obtained by the audio network and the image analysis conclusions obtained by the image network, and finally determine the target of the video image stream
  • the focus area of the image that is most likely to make sound in the image frame, and the focus area of the image that is most likely to make sound in the target image frame is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used to identify The focus area of the image in the target image frame that is most likely to emit sound has higher accuracy.
  • FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application.
  • T1 and T2 represent two time points, T1 is earlier than T2.
  • the terminal automatically collects the original video file 10 between T1 and T2, where the original video file 10 includes a video image stream 101 and an audio stream 102.
  • the terminal When the time reaches T2, the terminal will acquire the original video file 10 and split the original video file 10 into a video image stream 101 (image frame p1 to image frame p10) and audio stream 102, and then input the video image stream 101 to In the image network 201 of the neural network model 20, the audio stream 102 is input into the audio network 202 of the neural network model 20.
  • the image network 201 and the audio network 202 of the neural network model 20 in the terminal respectively receive the video image stream 101 and the audio stream 102
  • the image network 201 of the neural network model 20 processes the video image stream 101 to obtain a four-dimensional image matrix.
  • the audio network 202 of the neural network model 20 processes the audio stream 102 to obtain a three-dimensional audio matrix
  • the image network 201 and the audio network 202 respectively send the obtained four-dimensional image matrix and three-dimensional audio matrix to the fusion network 203 of the neural network model 20 .
  • the fusion network 203 of the neural network model 20 will fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map 30.
  • FIG. 12 is a schematic diagram of a device provided by an embodiment of the present application.
  • the device may be located in the terminal described above, and is used in combination with an image collection device such as a camera and a sound collection device such as a microphone.
  • the device includes the following modules.
  • the obtaining module 11 is used to obtain an original video file.
  • the original video file includes a video image stream and an audio stream.
  • the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device.
  • the audio stream includes multiple sound frames and is composed of The sound data collected by the sound collecting device is generated.
  • the generating module 12 is configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream.
  • the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream.
  • the determining module 13 is configured to determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and each image area in the at least one image focus area includes at least one image unit.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • step S12 in the method embodiment shown in FIG. 3 above.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the device shown in FIG. 12 may further include: a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area.
  • a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area.
  • the generating module 12 may further include: a splitting module, which is used to split the original video file into a video image stream and an audio stream; and a processing module, which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map.
  • a splitting module which is used to split the original video file into a video image stream and an audio stream
  • a processing module which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map.
  • the apparatus shown in FIG. 12 may further include: a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form.
  • a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form.
  • the processing module 13 may further include: an audio processing module, configured to process an audio stream using an audio network of a neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix includes the duration and audio frequency of the audio stream The frequency distribution of the stream and the characteristic information of the audio stream; the image processing module is used to process the video image stream using the image network of the neural network model to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream;
  • the fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.
  • any one or more of the modules in FIG. 12 above may be implemented by software, hardware, or a combination of software and hardware.
  • the software includes software program instructions and is executed by one or more processors.
  • the hardware may include digital logic circuits, algorithm circuits, programmable logic gate arrays, processors, dedicated circuits, or algorithm circuits. The above circuit may be located in one or more chips.
  • FIG. 13 is a schematic diagram of yet another device provided by an embodiment of the present application.
  • the device 2 includes an image acquisition device 21, an audio acquisition device 22, a central processor 23, an image processor 24, a memory (random access memory, RAM) 25, a non-volatile memory (NVM) memory 26, and a bus 27.
  • the bus 27 is used to communicate with other components.
  • the device 2 shown in FIG. 13 is equivalent to the circuit board, chip or chipset in the smart phone 1 in FIGS. 1 and 2, and can selectively run various types of software, such as application software, driver software, or operating system software.
  • the central processor 23 is used to control one or more other components, and the image processor 24 is used to perform the method of this embodiment.
  • the image processor 24 may include one or more processors to perform the method of the previous embodiment Process.
  • S11 in FIG. 3 may be executed by an ISP or a dedicated processor.
  • At least part of the flow of S12 in FIG. 3 and FIG. 10 may be executed by at least one of a neural processing unit (NPU), a data signal processor, or a central processor 23.
  • S13 in FIG. 3 may be executed by the ISP or the central processor 23.
  • the NPU is a device built with the neural network model, and is dedicated to neural network operations.
  • the central processor 23 may also run artificial intelligence software to perform corresponding operations using neural network models.
  • Each processor mentioned above can execute the necessary software to work.
  • some processors, such as ISP may also be pure hardware devices.
  • the device 2 in FIG. 13 refer to the detailed description of the smartphone 1 in the embodiments corresponding to FIGS. 1 and 2, and refer to the detailed description of the terminal in the embodiments corresponding to FIGS. 3 to 11.
  • Computer-readable media includes computer storage media and communication media, where communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium may be any available medium that can be accessed by a computer. Take this as an example but not limited to: computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or can be used to carry or store in the form of instructions or data structures The desired program code and any other medium that can be accessed by the computer. Also.
  • any connection can become a computer-readable medium as appropriate.
  • the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable , Fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave
  • disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, where the discs usually replicate data magnetically, while discs Use laser to copy data optically.
  • CDs compact discs
  • DVDs digital versatile discs
  • floppy discs and Blu-ray discs

Abstract

L'invention concerne un procédé et un appareil pour déterminer une région de mise au point d'image. Le procédé consiste à : acquérir un fichier vidéo d'origine, le fichier vidéo d'origine comprenant un flux d'image vidéo et un flux audio (S11) ; générer une carte thermique d'analyse de scène en fonction du flux d'image vidéo et du flux audio (S12), la carte thermique d'analyse de scène étant utilisée pour indiquer la probabilité qu'un son existe sur chaque unité d'image parmi les multiples unités d'image sur un cadre d'image cible du flux d'image vidéo ; et déterminer, en fonction de la carte thermique d'analyse de scène, au moins une région de mise au point d'image, satisfaisant à une condition prédéfinie, sur la trame d'image cible du flux d'image vidéo (S13), chaque région d'image dans ladite au moins une région de mise au point d'image comprenant au moins une unité d'image. Le présent procédé peut déterminer une région, au niveau de laquelle un son est très probablement produit, sur une trame d'image cible d'un flux d'image vidéo ; de plus, outre un appareil de collecte d'images et un appareil de collecte de sons, il n'est pas nécessaire d'ajouter de dispositif supplémentaire à un terminal, ce qui permet de réduire le coût des dispositifs.
PCT/CN2018/120200 2018-12-11 2018-12-11 Procédé et appareil pour déterminer une région de mise au point d'image WO2020118503A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/120200 WO2020118503A1 (fr) 2018-12-11 2018-12-11 Procédé et appareil pour déterminer une région de mise au point d'image
CN201880088065.2A CN111656275B (zh) 2018-12-11 2018-12-11 一种确定图像对焦区域的方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/120200 WO2020118503A1 (fr) 2018-12-11 2018-12-11 Procédé et appareil pour déterminer une région de mise au point d'image

Publications (1)

Publication Number Publication Date
WO2020118503A1 true WO2020118503A1 (fr) 2020-06-18

Family

ID=71076197

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120200 WO2020118503A1 (fr) 2018-12-11 2018-12-11 Procédé et appareil pour déterminer une région de mise au point d'image

Country Status (2)

Country Link
CN (1) CN111656275B (fr)
WO (1) WO2020118503A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113852756A (zh) * 2021-09-03 2021-12-28 维沃移动通信(杭州)有限公司 图像获取方法、装置、设备和存储介质
US11463656B1 (en) * 2021-07-06 2022-10-04 Dell Products, Lp System and method for received video performance optimizations during a video conference session

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112857560B (zh) * 2021-02-06 2022-07-22 河海大学 一种基于声音频率的声学成像方法
CN113255685B (zh) * 2021-07-13 2021-10-01 腾讯科技(深圳)有限公司 一种图像处理方法、装置、计算机设备以及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068308A (zh) * 2007-05-10 2007-11-07 华为技术有限公司 一种控制图像采集装置进行目标定位的系统及方法
CN103905810A (zh) * 2014-03-17 2014-07-02 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体处理装置
CN103957359A (zh) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 摄像装置及其对焦方法
CN104036789A (zh) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体装置
CN104378635A (zh) * 2014-10-28 2015-02-25 西交利物浦大学 基于麦克风阵列辅助的视频感兴趣区域的编码方法
CN108073875A (zh) * 2016-11-14 2018-05-25 广东技术师范学院 一种基于单目摄像头的带噪音语音识别系统及方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009031951A (ja) * 2007-07-25 2009-02-12 Sony Corp 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
JP4735991B2 (ja) * 2008-03-18 2011-07-27 ソニー株式会社 画像処理装置および方法、プログラム並びに記録媒体
US20110311144A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Rgb/depth camera for improving speech recognition
US10218954B2 (en) * 2013-08-15 2019-02-26 Cellular South, Inc. Video to data
JP6761230B2 (ja) * 2015-08-21 2020-09-23 キヤノン株式会社 画像処理装置、その制御方法、プログラム及び撮像装置
CN108876672A (zh) * 2018-06-06 2018-11-23 合肥思博特软件开发有限公司 一种远程教育教师自动识别图像优化跟踪方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068308A (zh) * 2007-05-10 2007-11-07 华为技术有限公司 一种控制图像采集装置进行目标定位的系统及方法
CN104036789A (zh) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体装置
CN103905810A (zh) * 2014-03-17 2014-07-02 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体处理装置
CN103957359A (zh) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 摄像装置及其对焦方法
CN104378635A (zh) * 2014-10-28 2015-02-25 西交利物浦大学 基于麦克风阵列辅助的视频感兴趣区域的编码方法
CN108073875A (zh) * 2016-11-14 2018-05-25 广东技术师范学院 一种基于单目摄像头的带噪音语音识别系统及方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11463656B1 (en) * 2021-07-06 2022-10-04 Dell Products, Lp System and method for received video performance optimizations during a video conference session
CN113852756A (zh) * 2021-09-03 2021-12-28 维沃移动通信(杭州)有限公司 图像获取方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN111656275B (zh) 2021-07-20
CN111656275A (zh) 2020-09-11

Similar Documents

Publication Publication Date Title
US10951833B2 (en) Method and device for switching between cameras, and terminal
WO2020118503A1 (fr) Procédé et appareil pour déterminer une région de mise au point d'image
TW202105244A (zh) 圖像處理方法及裝置、電子設備和電腦可讀儲存介質
JP2019092147A (ja) 情報交換方法、装置、オーディオ端末及びコンピュータ可読記憶媒体
CN105430247A (zh) 一种利用摄像设备拍摄照片的方法与设备
US11336826B2 (en) Method and apparatus having a function of constant automatic focusing when exposure changes
US20180260941A1 (en) Preserving color in image brightness adjustment for exposure fusion
WO2021190625A1 (fr) Procédé et dispositif de capture d'image
US20150319402A1 (en) Providing video recording support in a co-operative group
US11348254B2 (en) Visual search method, computer device, and storage medium
US11405226B1 (en) Methods and apparatus for assessing network presence
CN112463391B (zh) 内存控制方法、内存控制装置、存储介质与电子设备
CN111447360A (zh) 应用程序控制方法及装置、存储介质、电子设备
CN111784567B (zh) 用于转换图像的方法、装置、电子设备和计算机可读介质
CN110347597B (zh) 图片服务器的接口测试方法、装置、存储介质与移动终端
CN113747086A (zh) 数字人视频生成方法、装置、电子设备及存储介质
TWI792444B (zh) 攝像頭的控制方法、裝置、介質和電子設備
US10783616B2 (en) Method and apparatus for sharing and downloading light field image
WO2021149238A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations
US11798561B2 (en) Method, apparatus, and non-transitory computer readable medium for processing audio of virtual meeting room
US11631252B1 (en) Visual media management for mobile devices
US11601381B2 (en) Methods and apparatus for establishing network presence
US11451695B2 (en) System and method to configure an image capturing device with a wireless network
CN114298931A (zh) 图像处理方法、装置、电子设备及存储介质
CN111815656A (zh) 视频处理方法、装置、电子设备和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18942954

Country of ref document: EP

Kind code of ref document: A1