WO2020118503A1 - Method and apparatus for determining image focusing region - Google Patents

Method and apparatus for determining image focusing region Download PDF

Info

Publication number
WO2020118503A1
WO2020118503A1 PCT/CN2018/120200 CN2018120200W WO2020118503A1 WO 2020118503 A1 WO2020118503 A1 WO 2020118503A1 CN 2018120200 W CN2018120200 W CN 2018120200W WO 2020118503 A1 WO2020118503 A1 WO 2020118503A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
stream
audio
video image
video
Prior art date
Application number
PCT/CN2018/120200
Other languages
French (fr)
Chinese (zh)
Inventor
陈亮
孙凤宇
兰传骏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/120200 priority Critical patent/WO2020118503A1/en
Priority to CN201880088065.2A priority patent/CN111656275B/en
Publication of WO2020118503A1 publication Critical patent/WO2020118503A1/en

Links

Images

Classifications

    • GPHYSICS
    • G03PHOTOGRAPHY; CINEMATOGRAPHY; ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ELECTROGRAPHY; HOLOGRAPHY
    • G03BAPPARATUS OR ARRANGEMENTS FOR TAKING PHOTOGRAPHS OR FOR PROJECTING OR VIEWING THEM; APPARATUS OR ARRANGEMENTS EMPLOYING ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ACCESSORIES THEREFOR
    • G03B13/00Viewfinders; Focusing aids for cameras; Means for focusing for cameras; Autofocus systems for cameras
    • G03B13/32Means for focusing
    • G03B13/34Power focusing
    • G03B13/36Autofocus systems

Definitions

  • Embodiments of the present application relate to the technical field of image processing, and more specifically, to a method and device for determining an image focus area.
  • the terminal's camera will collect the image, and the terminal's touch screen will display the image collected by the camera; then, the user can click on the image area of interest on the touch screen with his finger; Secondly, the touch screen focuses on the image area clicked by the user's finger; after the touch screen successfully focuses on the image area clicked by the user's finger, the user can use the terminal to take videos or photos.
  • the terminal can achieve image focusing, if the position of the image area of interest on the touch screen changes, the user needs to re-click the image area of interest on the touch screen to make the touch screen The image area clicked by the user's finger refocuses.
  • the above-mentioned related image focusing technology will passively wait for the focus instruction input by the user, and cannot actively identify the image area of interest to the user.
  • auxiliary focusing methods such as ranging method and multi-microphone array method in the industry.
  • the terminal needs to actively emit infrared light waves or ultrasonic waves, thereby increasing the device cost of the terminal's focusing system.
  • the terminal requires a larger number of microphone units to achieve better performance, which also increases the device cost of the terminal's focusing system.
  • Embodiments of the present application provide a method and apparatus for determining an image focus area. Without increasing the cost of a device, an image focus area that is most likely to emit sound can be determined in the image.
  • an embodiment of the present application provides a method for determining an image focus area, the method includes: acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of images
  • the image data collected by the collection device is generated, and the audio stream includes multiple sound frames and is generated by the sound data collected by the sound collection device;
  • a scene analysis heat map is generated based on the video image stream and audio stream, and the scene analysis heat map is used to indicate the video image stream Probability of sound in each image unit among multiple image units on the target image frame; determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and at least one image focus area
  • Each image area includes at least one image unit.
  • the method provided by the embodiments of the present application may be applied to a terminal having an image collection device and a sound collection device.
  • the terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream.
  • the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream.
  • no additional devices need to be added to the terminal, thereby reducing the device cost.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the method further includes: controlling the image acquisition device to focus on the at least one image focus area Focus.
  • generating the scene analysis heat map from the video image stream and the audio stream includes: splitting the original video file into the video image stream and the audio stream; using the neural network model to process the video image stream and the audio stream Get a scene analysis heat map.
  • the method before generating the scene analysis heat map from the video image stream and the audio stream, the method further includes: converting the audio stream from the time domain form to the frequency domain form.
  • using the neural network model to process the video image stream and the audio stream to obtain the scene analysis heat map includes: using the neural network model of the audio network to process the audio stream to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes The length of the audio stream, the frequency distribution of the audio stream, and the characteristic information of the audio stream;
  • the image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream;
  • the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
  • the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion
  • the image of the neural network model The network is used to analyze the image information and get the image analysis conclusion.
  • the neural network model fusion network is used to analyze and fuse the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally determine the video image stream.
  • the focus area of the image in the target image frame that is most likely to make sound, and the focus area of the image in the target image frame that is most likely to make sound is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used To identify the focus area of the image in the target image frame that is most likely to emit sound with higher accuracy.
  • an embodiment of the present application provides an apparatus, including: an acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of an image acquisition device The generated image data is generated, and the audio stream includes multiple sound frames and the sound data collected by the sound collection device is generated; the generation module is used to generate a scene analysis heat map based on the video image stream and the audio stream, and the scene analysis heat map is used to indicate the video Probability that there is sound on each of the multiple image units on the target image frame of the image stream; a determination module is used to determine that at least one image on the target image frame of the video image stream that meets the preset conditions is in focus according to the scene analysis heat map Area, each image area in the at least one image focus area includes at least one image unit.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the device further includes: a focusing module, configured to control the image acquisition device to focus on at least one image focusing area.
  • the generation module includes: a splitting module, which is used to split the original video file into a video image stream and an audio stream; a processing module, which is used to perform a video image stream and an audio stream using a neural network model The scene analysis heat map is processed.
  • the device further includes: a conversion module, configured to convert the audio stream from the time domain form to the frequency domain form.
  • the processing module includes: an audio processing module for processing an audio stream by using an audio network of a neural network model to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes the duration of the audio stream and the frequency distribution of the audio stream And audio stream feature information; an image processing module for processing a video image stream using an image network of a neural network model to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream, and the video image stream The width and the characteristic information of the video image stream; the fusion processing module, which is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.
  • an embodiment of the present application provides an apparatus, including one or more processors, for performing the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the device includes a memory for storing software instructions that drive the processor to work.
  • an embodiment of the present application provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, causes the computer or processor to perform the first aspect described above Or the method in any possible implementation manner of the first aspect.
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the first aspect or any possible implementation of the first aspect The way in the way.
  • FIG. 1 is a schematic diagram of a camera software of a smart phone that has not yet determined a focus area according to an embodiment of the present application
  • FIG. 2 is a schematic diagram showing that a camera software of a smartphone has determined a focus area according to an embodiment of the present application
  • FIG. 3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of this application;
  • FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a scene analysis heat map provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of this application;
  • FIG. 9 is a schematic diagram of a video image stream and an audio stream of still another original video file provided by an embodiment of the present application.
  • FIG. 10 is a flowchart of another method for determining an image focus area provided by an embodiment of this application.
  • FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of another device provided by an embodiment of the present application.
  • An embodiment of the present application provides a method for determining an image focus area.
  • the method provided in the embodiment of the present application may be applied to terminals of different types.
  • the method provided in the embodiments of the present application may be applied to terminals such as smart phones, tablet computers, digital cameras, or smart cameras.
  • the technical solution of the present application can also be applied to electronic devices that do not have a communication function. The following uses the application of the technical solution on the terminal as an example for introduction.
  • FIG. 1 is a schematic diagram of a camera software of a smart phone provided by an embodiment of the present application, and a focus area has not been determined.
  • FIG. 2 is a diagram provided by an embodiment of the present application. A schematic diagram of the camera software of the smartphone has determined the focus area.
  • FIGS. 1 and 2 are used to enable the reader to quickly understand the technical principles of the embodiments of the present application, and are not used to limit the protection scope of the embodiments of the present application.
  • the specific parameter values mentioned in the embodiments shown in FIGS. 1 and 2 can be changed according to the principles of the embodiments of the present application, and the protection scope of the embodiments of the present application is not limited to the specific parameter values already mentioned.
  • the user wants to use the smartphone 1 to take pictures or videos of birds in the zoo.
  • the user opens the camera software of smartphone 1 and points the camera of smartphone 1 to the bird; then, smartphone 1 calls the camera to collect the image of the bird and displays the image of the bird on the touch screen of smartphone 1 on.
  • the smartphone 1 will automatically call the camera and the microphone to collect the original video file for 2 seconds.
  • the smartphone 1 of this embodiment includes a camera function, which can be regarded as a camera when performing the camera function, and the camera includes the camera and an image signal processor (ISP).
  • ISP image signal processor
  • the smartphone 1 will obtain the 2 second original video file, and use the neural network model pre-stored in the smartphone 1 to generate a scene analysis heat map based on the video image stream and audio stream of the 2 second original video file.
  • the analysis heat map is used to indicate the image area in the video image stream that is most likely to emit sound.
  • the neural network model of smartphone 1 will be calculated based on the 2-second original video file
  • the small bird's beak in the outgoing image is the image area most likely to make sound, so the generated scene analysis heat map will indicate that the small bird's beak in area A in Figure 2 is the most likely image area to make sound.
  • the smart phone 1 will focus on the small bird's beak in the area A in FIG. 2, so as to realize automatic focusing on the sound area in the image.
  • the smartphone 1 can take a picture or record a video.
  • the smartphone 1 can repeat the method for determining the image focus area provided by the embodiments of the present application continuously, so as to continuously generate a scene analysis heat map to indicate the image area in the image most likely to emit sound in real time.
  • FIG. 3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application.
  • the method shown in FIG. 3 can determine the focus area of the image most likely to emit sound in the image without increasing the cost of the device.
  • the method includes the following steps.
  • Step S11 Obtain the original video file.
  • the terminal will automatically call the camera and the microphone to collect the original video file for a predetermined period of time, and obtain the original video file for the predetermined period of time .
  • the predetermined time period is 2 seconds.
  • the predetermined time period can be preset according to factors such as the terminal's hardware configuration. When the terminal's hardware configuration is high, the predetermined time period can be shortened appropriately, for example, the predetermined time period is set to 1 second or 0.5 second, or even shorter ; When the hardware configuration of the terminal is low, you can extend the predetermined time period appropriately, for example, set the predetermined time period to 3 seconds or 4 seconds, or even longer.
  • the embodiments of the present application do not limit the specific length of the predetermined time period, and the specific values provided above are only used to explain the principle of adjusting the predetermined time period.
  • the original video file includes a video image stream and an audio stream.
  • the video image stream includes multiple image frames and is generated by image data collected by the image collection device
  • the audio stream includes multiple sound frames and is generated by sound data collected by the sound collection device.
  • the image acquisition device may be the camera of the terminal described in the previous embodiment, or may optionally include the ISP.
  • the sound collection device may be a microphone of the terminal, or may optionally include a voice processing channel or circuit that processes signals of the microphone mobile phone.
  • FIG. 4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis, and the time point T1 is earlier than the time point T2, assuming that the time length of the original video file is T1 to T2.
  • the original video file includes a video image stream and an audio stream, where the video image stream has 10 image frames (image frame P1 to image frame P10), the audio stream has 30 sound frames (not shown in the figure), the video image stream and The audio stream is used to generate scene analysis heat maps.
  • Step S12 Generate a scene analysis heat map according to the video image stream and the audio stream.
  • the scene analysis heat map is used to indicate the probability of the presence of sound on each of the multiple image units on the target image frame of the video image stream, and the image unit may be a pixel.
  • the probability that there is sound on one image unit corresponds to the probability that the corresponding object on the image unit emits sound.
  • the object can be a person, animal, musical instrument, equipment, or other object.
  • the terminal When the terminal generates a scene analysis heat map from the video image stream and the audio stream, the terminal combines the video image stream and the audio stream to calculate the probability of sound on each image unit on the target image frame of the video image stream, and according to the target The probability of the presence of sound on each image unit on the image frame generates a scene analysis heat map.
  • the scene analysis heat map is a frame of image, and the resolution of the scene analysis heat map is the same as the resolution of the target image frame of the video image stream.
  • the target image frame of the video image stream may be the last frame of multiple image frames in the video image stream. For example, referring to FIG. 4, the video image stream in FIG. 4 has 10 image frames, and the target image frame is the last image frame P10 in the video image stream.
  • FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application
  • FIG. 6 is a scene analysis provided by an embodiment of the present application.
  • Schematic diagram of the heat map For ease of example, in FIGS. 5 and 6, it is assumed that the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 5 pixels ⁇ 3 pixels. In the actual scene, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are the resolutions preset by the user. For example, assuming that the preset resolution of the user is 1920 pixels ⁇ 1080 pixels, then the video image The resolution of the stream's target image frame and the resolution of the scene analysis heat map are both 1920 pixels ⁇ 1080 pixels.
  • the terminal can combine the video image stream and the audio stream to calculate 15 pixels on the target image frame of the video image stream The probability of sound at the point.
  • each circle represents 1 pixel, that is, there are 15 pixels in FIG. 5, and each pixel in FIG. 5 has the original color. For example, assume that the original colors of the 15 pixels are all red.
  • the terminal calculates after combining the video image stream and the audio stream, it is known that the probability of sound on the pixel points (11, 12, 21, 22, 31, 32) is greater than or equal to 50%, and the pixel points (13, 14, 15 , 23, 24, 25, 33, 34, 35) The probability of sound is less than 50%. Then, the terminal generates a scene analysis heat map based on the probability of sound on 15 pixels on the target image frame.
  • the terminal in order to distinguish the difference in the probability of sound on the pixels, based on the target image frame, the terminal emits pixels with a probability of sound greater than or equal to 50% (11, 12, 21, 22, 31 , 32) becomes black, and the color of pixels (13, 14, 15, 23, 24, 25, 33, 34, 35) with a probability of less than 50% of the sound emitted by the object becomes white, which will be as shown in the figure
  • the target image frame shown in 5 is converted into the scene analysis heat map shown in FIG. 6.
  • the scene analysis heat map shown in FIG. 6 is only an example.
  • the color corresponding to each pixel in the scene analysis heat map can be set according to the actual situation.
  • the color distinguishing each pixel is not limited to the use of white and black.
  • the audio stream Before step S12, that is, before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream can be converted from the time domain form to the frequency domain form, and then, based on the video image stream and the audio stream converted into the frequency domain form Generate a scene analysis heat map.
  • the audio stream before generating a scene analysis heat map from the video image stream and audio stream, the audio stream is first Fourier transformed to obtain a short-time Fourier spectrum, and then the scene is generated from the video image stream and the short-time Fourier spectrum Analyze the heat map.
  • the scene analysis heat map can also be generated by directly using the audio stream in the time domain without performing the Fourier transform.
  • Step S13 Determine at least one image focus area on the target image frame of the video image stream that meets the preset condition according to the scene analysis heat map.
  • each image area in the at least one image focus area includes at least one image unit
  • the preset condition is the probability that there is sound on one or more image units in the at least one image focus area in the at least one image focus area Both reached the preset probability threshold.
  • the preset probability threshold is a preset probability threshold.
  • the probability that the sound exists on each image unit in at least one image unit in the area reaches a preset probability threshold, it may be determined that the area is an image focus area.
  • the terminal may control the image acquisition device to focus on at least one image focusing area.
  • FIG. 7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application. Assuming that the preset probability threshold is 50%, the terminal determines the pixel points adjacent to the target image frame of the video image stream and reaching the preset probability threshold 50% according to the scene analysis heat map (11, 12, 21, 22, 31, 32 ), the pixels (11, 12, 21, 22, 31, 32) constitute the B area in FIG. 7, and the B area is an image focus area. Then, the terminal can control the image acquisition device to focus on the image focusing area (that is, area B) in FIG. 7.
  • the method provided by the embodiment of the present application may be applied to a terminal having an image collection device and a sound collection device.
  • the terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream.
  • the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream.
  • no additional devices need to be added to the terminal.
  • the image acquisition device and the sound acquisition device can be the original devices of the terminal itself, such as the camera or microphone of the smartphone, thereby reducing the device cost .
  • a method for generating a scene analysis heat map using a piece of original video file and using the scene analysis heat map to determine the image focus area of a target image frame is introduced. Based on the above principles, the following describes how to use multiple original video files to generate multiple scene analysis heatmaps, and use multiple scene analysis heatmaps to determine the image focus area of multiple target image frames.
  • FIG. 8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis
  • T1 to T4 respectively represent four time points, where T1 is earlier than T2, T2 is earlier than T3, and T3 is earlier than T4.
  • the terminal automatically collects the original video file A between T1 and T3, where the original video file A includes the video image stream A and the audio stream A.
  • the terminal When the time reaches T3, the terminal will acquire the original video file A, and generate a scene analysis heat map A based on the video image stream A (image frame p1 to image frame p10) and audio stream A of the original video file A, and then analyze the scene
  • the heat map A determines at least one image focus area on the image frame p10 of the video image stream A that meets the preset condition, that is, the terminal determines the image focus area on the image frame p10 at time T3.
  • the terminal In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T4, the terminal will obtain the original video file B automatically collected between T2 and T4, and according to the video image stream B of the original video file B ( Image frame P2 to image frame P11) and audio stream B to generate a scene analysis heat map B, and then determine at least one image focus area on the image frame P11 of the video image stream B that meets the preset conditions according to the scene analysis heat map B, that is, the terminal is At time T4, the image focus area on the image frame P11 is determined.
  • the terminal may determine a corresponding scene analysis heat map for each image frame, and use the scene analysis heat map corresponding to each image frame to determine at least one image satisfying the preset condition for each image frame Focus area, so that the image focus area can be determined more accurately for each image frame.
  • FIG. 9 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application.
  • the horizontal axis represents the time axis
  • T1 to T3 respectively represent three time points, where T1 is earlier than T2 and T2 is earlier than T3.
  • the terminal will automatically collect the original video file A between T1 and T2, where the original video file A includes the video image stream A (image frame P1 to image frame P10) and Audio stream A.
  • the terminal When the time reaches T2, the terminal will obtain the original video file A, and generate a scene analysis heat map A according to the video image stream A and the audio stream A of the original video file A, and then determine the video image stream A according to the scene analysis heat map A At least one image focus area on the image frame P10 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P10 at time T2. Then, from T2 to T3, the image frame P11 to the image frame P18 all determine at least one image focus area that satisfies the preset condition according to the scene analysis heat map A.
  • the terminal In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T3, the terminal will obtain the original video file B automatically collected between T2 and T4, where the original video file B includes the video image stream B( Image frame P10 to Image frame P19) and audio stream B.
  • the terminal When the time reaches T3, the terminal will obtain the original video file B, and generate a scene analysis heat map B based on the video image stream B and audio stream B of the original video file B, and then determine the video image stream B according to the scene analysis heat map B At least one image focus area on the image frame P19 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P19 at time T3.
  • the terminal In the second way, the terminal generates a scene analysis heat map at intervals so that multiple image frames can reuse one scene analysis heat map, thereby reducing the number of scene analysis heat map calculations per unit time So that the processing resources of the terminal can be occupied less.
  • FIG. 10 is a flowchart of another method for determining an image focus area provided by an embodiment of the present application.
  • the method shown in FIG. 10 is a specific implementation of step S12 in FIG. 3, that is, “
  • the specific implementation of "Generating a scene analysis heat map based on video image stream and audio stream” includes the following steps.
  • Step S21 Split the original video file into a video image stream and an audio stream.
  • multimedia video processing tools such as software tools to split the original video file into a video image stream and audio stream.
  • Step S22 Acquire a pre-stored neural network model.
  • the neural network model can be generated by a server with strong computing power.
  • the neural network model generated by the server includes an audio network, an image network, and a fusion network.
  • the terminal can download the neural network model generated on the server to the local memory in advance.
  • the terminal can obtain the pre-stored neural network model in the local memory.
  • the server will input a large number of video samples into the neural network model, so that the neural network model learns from the large number of video samples.
  • the neural network model can identify the image focus area in the target image frame of the video image stream that is most likely to emit sound.
  • Step S23 Use the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix.
  • the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream.
  • the terminal uses the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix, where the duration of the audio stream is 2 seconds, the frequency distribution of the audio stream is 0-8Khz, and the characteristic information of the audio stream is, for example, at least one Stereo channels.
  • Step S24 Use the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream (image length), the width of the video image stream (image width) and the feature information of the video image stream.
  • the terminal uses the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix, where the duration of the video image stream is 2 seconds, the length of the video image stream is 1920 pixels, and the width of the video image stream is With 1080 pixels, the feature information of the video image stream is, for example, RBG three channels or other color gamut channels.
  • the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
  • the terminal uses the neural network model to process the video image stream and the audio stream to obtain a scene analysis heat map, it is more like a human process of processing sound information and image information.
  • the audio network of the neural network model is similar to the human ear.
  • the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion.
  • the image network of the neural network model is similar to the human eye.
  • the image network of the neural network model is used to analyze image information and obtain image analysis conclusions.
  • the fusion network of the neural network model is similar to the human brain.
  • the fusion network of the neural network model is used to analyze and fuse the sound analysis conclusions obtained by the audio network and the image analysis conclusions obtained by the image network, and finally determine the target of the video image stream
  • the focus area of the image that is most likely to make sound in the image frame, and the focus area of the image that is most likely to make sound in the target image frame is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used to identify The focus area of the image in the target image frame that is most likely to emit sound has higher accuracy.
  • FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application.
  • T1 and T2 represent two time points, T1 is earlier than T2.
  • the terminal automatically collects the original video file 10 between T1 and T2, where the original video file 10 includes a video image stream 101 and an audio stream 102.
  • the terminal When the time reaches T2, the terminal will acquire the original video file 10 and split the original video file 10 into a video image stream 101 (image frame p1 to image frame p10) and audio stream 102, and then input the video image stream 101 to In the image network 201 of the neural network model 20, the audio stream 102 is input into the audio network 202 of the neural network model 20.
  • the image network 201 and the audio network 202 of the neural network model 20 in the terminal respectively receive the video image stream 101 and the audio stream 102
  • the image network 201 of the neural network model 20 processes the video image stream 101 to obtain a four-dimensional image matrix.
  • the audio network 202 of the neural network model 20 processes the audio stream 102 to obtain a three-dimensional audio matrix
  • the image network 201 and the audio network 202 respectively send the obtained four-dimensional image matrix and three-dimensional audio matrix to the fusion network 203 of the neural network model 20 .
  • the fusion network 203 of the neural network model 20 will fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map 30.
  • FIG. 12 is a schematic diagram of a device provided by an embodiment of the present application.
  • the device may be located in the terminal described above, and is used in combination with an image collection device such as a camera and a sound collection device such as a microphone.
  • the device includes the following modules.
  • the obtaining module 11 is used to obtain an original video file.
  • the original video file includes a video image stream and an audio stream.
  • the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device.
  • the audio stream includes multiple sound frames and is composed of The sound data collected by the sound collecting device is generated.
  • the generating module 12 is configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream.
  • the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream.
  • the determining module 13 is configured to determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and each image area in the at least one image focus area includes at least one image unit.
  • the target image frame is the last frame of multiple image frames in the video image stream.
  • step S12 in the method embodiment shown in FIG. 3 above.
  • the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
  • the device shown in FIG. 12 may further include: a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area.
  • a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area.
  • the generating module 12 may further include: a splitting module, which is used to split the original video file into a video image stream and an audio stream; and a processing module, which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map.
  • a splitting module which is used to split the original video file into a video image stream and an audio stream
  • a processing module which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map.
  • the apparatus shown in FIG. 12 may further include: a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form.
  • a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form.
  • the processing module 13 may further include: an audio processing module, configured to process an audio stream using an audio network of a neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix includes the duration and audio frequency of the audio stream The frequency distribution of the stream and the characteristic information of the audio stream; the image processing module is used to process the video image stream using the image network of the neural network model to obtain a four-dimensional image matrix.
  • the four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream;
  • the fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.
  • any one or more of the modules in FIG. 12 above may be implemented by software, hardware, or a combination of software and hardware.
  • the software includes software program instructions and is executed by one or more processors.
  • the hardware may include digital logic circuits, algorithm circuits, programmable logic gate arrays, processors, dedicated circuits, or algorithm circuits. The above circuit may be located in one or more chips.
  • FIG. 13 is a schematic diagram of yet another device provided by an embodiment of the present application.
  • the device 2 includes an image acquisition device 21, an audio acquisition device 22, a central processor 23, an image processor 24, a memory (random access memory, RAM) 25, a non-volatile memory (NVM) memory 26, and a bus 27.
  • the bus 27 is used to communicate with other components.
  • the device 2 shown in FIG. 13 is equivalent to the circuit board, chip or chipset in the smart phone 1 in FIGS. 1 and 2, and can selectively run various types of software, such as application software, driver software, or operating system software.
  • the central processor 23 is used to control one or more other components, and the image processor 24 is used to perform the method of this embodiment.
  • the image processor 24 may include one or more processors to perform the method of the previous embodiment Process.
  • S11 in FIG. 3 may be executed by an ISP or a dedicated processor.
  • At least part of the flow of S12 in FIG. 3 and FIG. 10 may be executed by at least one of a neural processing unit (NPU), a data signal processor, or a central processor 23.
  • S13 in FIG. 3 may be executed by the ISP or the central processor 23.
  • the NPU is a device built with the neural network model, and is dedicated to neural network operations.
  • the central processor 23 may also run artificial intelligence software to perform corresponding operations using neural network models.
  • Each processor mentioned above can execute the necessary software to work.
  • some processors, such as ISP may also be pure hardware devices.
  • the device 2 in FIG. 13 refer to the detailed description of the smartphone 1 in the embodiments corresponding to FIGS. 1 and 2, and refer to the detailed description of the terminal in the embodiments corresponding to FIGS. 3 to 11.
  • Computer-readable media includes computer storage media and communication media, where communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium may be any available medium that can be accessed by a computer. Take this as an example but not limited to: computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or can be used to carry or store in the form of instructions or data structures The desired program code and any other medium that can be accessed by the computer. Also.
  • any connection can become a computer-readable medium as appropriate.
  • the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable , Fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave
  • disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, where the discs usually replicate data magnetically, while discs Use laser to copy data optically.
  • CDs compact discs
  • DVDs digital versatile discs
  • floppy discs and Blu-ray discs

Abstract

A method and apparatus for determining an image focusing region. The method comprises: acquiring an original video file, wherein the original video file comprises a video image stream and an audio stream (S11); generating a scene analysis heat map according to the video image stream and the audio stream (S12), wherein the scene analysis heat map is used for indicating the probability of a sound existing on each image unit from among multiple image units on a target image frame of the video image stream; and determining, according to the scene analysis heat map, at least one image focusing region, meeting a preset condition, on the target image frame of the video image stream (S13), wherein each image region in the at least one image focusing region comprises at least one image unit. The present method can determine a region, where a sound is most possibly produced, on a target image frame of a video image stream; moreover, other than an image collection apparatus and a sound collection apparatus, no extra devices need to be added to a terminal, thereby reducing the cost of the devices.

Description

一种确定图像对焦区域的方法及装置Method and device for determining image focus area 技术领域Technical field
本申请实施例涉及图像处理技术领域,更具体的说,涉及确定图像对焦区域的方法及装置。Embodiments of the present application relate to the technical field of image processing, and more specifically, to a method and device for determining an image focus area.
背景技术Background technique
目前,在相关的图像对焦技术中,在终端打开相机软件以后,终端的摄像头会采集图像,并且终端的触摸屏会显示摄像头采集的图像;然后,用户可以用手指点击触摸屏上感兴趣的图像区域;其次,触摸屏对用户的手指点击的图像区域进行对焦;在触摸屏对用户的手指点击的图像区域成功对焦以后,用户便可以利用终端拍摄视频或照片。At present, in related image focusing technology, after the terminal opens the camera software, the terminal's camera will collect the image, and the terminal's touch screen will display the image collected by the camera; then, the user can click on the image area of interest on the touch screen with his finger; Secondly, the touch screen focuses on the image area clicked by the user's finger; after the touch screen successfully focuses on the image area clicked by the user's finger, the user can use the terminal to take videos or photos.
在上述相关的图像对焦技术中,虽然终端可以实现图像对焦,但是,如果用户感兴趣的图像区域在触摸屏中的位置发生变化,那么用户需要重新点击触摸屏上感兴趣的图像区域,以使触摸屏对用户的手指点击的图像区域重新对焦。上述相关的图像对焦技术会被动的等待用户输入的对焦指令,无法主动的识别用户感兴趣的图像区域。In the above related image focusing technology, although the terminal can achieve image focusing, if the position of the image area of interest on the touch screen changes, the user needs to re-click the image area of interest on the touch screen to make the touch screen The image area clicked by the user's finger refocuses. The above-mentioned related image focusing technology will passively wait for the focus instruction input by the user, and cannot actively identify the image area of interest to the user.
目前,行业内还存在一些测距法和多麦克风阵列法等辅助对焦方法。对于测距法而言,终端需要主动发射红外光波或超声波,从而增加了终端的对焦系统的器件成本。对于多麦克风阵列法而言,终端需要较多数目的麦克风单元才能取得较好的性能,同样增加了终端的对焦系统的器件成本。At present, there are some auxiliary focusing methods such as ranging method and multi-microphone array method in the industry. For the ranging method, the terminal needs to actively emit infrared light waves or ultrasonic waves, thereby increasing the device cost of the terminal's focusing system. For the multi-microphone array method, the terminal requires a larger number of microphone units to achieve better performance, which also increases the device cost of the terminal's focusing system.
因此,在相关的多种对焦方法中,通常存在终端的器件成本高和无法主动的对用户感兴趣的图像区域进行对焦的问题。Therefore, in various related focusing methods, there are usually problems that the device cost of the terminal is high and the image area that is of interest to the user cannot be actively focused.
发明内容Summary of the invention
本申请实施例提供一种确定图像对焦区域的方法及装置,在不提高器件成本的情况下,可以在图像中确定最有可能发出声音的图像对焦区域。Embodiments of the present application provide a method and apparatus for determining an image focus area. Without increasing the cost of a device, an image focus area that is most likely to emit sound can be determined in the image.
本申请实施例是这样实现的:The embodiment of the present application is implemented as follows:
第一方面,本申请实施例提供了一种确定图像对焦区域的方法,该方法包括:获取原始视频文件,原始视频文件包括视频图像流和音频流,视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,音频流包括多个声音帧且由声音采集装置采集的声音数据生成;根据视频图像流和音频流生成场景分析热图,场景分析热图用于指示视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率;根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,至少一个图像对焦区域中每个图像区域包括至少一个图像单位。In a first aspect, an embodiment of the present application provides a method for determining an image focus area, the method includes: acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of images The image data collected by the collection device is generated, and the audio stream includes multiple sound frames and is generated by the sound data collected by the sound collection device; a scene analysis heat map is generated based on the video image stream and audio stream, and the scene analysis heat map is used to indicate the video image stream Probability of sound in each image unit among multiple image units on the target image frame; determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and at least one image focus area Each image area includes at least one image unit.
在第一方面中,本申请实施例提供的方法可以应用在具有图像采集装置和声音采集装置的终端上。该终端可以主动的通过图像采集装置和声音采集装置获取原始视频文件,并根据视频图像流和音频流生成场景分析热图,最后根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,所以本申请实施例可以确定出视频图像流的目标图像帧上最有可能发出声音的区域。而且,在终端上除 了图像采集装置和声音采集装置以外,无需增加额外的器件,从而降低了器件成本。In the first aspect, the method provided by the embodiments of the present application may be applied to a terminal having an image collection device and a sound collection device. The terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream. Finally, the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream. Moreover, in addition to the image acquisition device and the sound acquisition device, no additional devices need to be added to the terminal, thereby reducing the device cost.
在一种可能的实现方式中,目标图像帧为视频图像流中多个图像帧的最后一帧。In a possible implementation, the target image frame is the last frame of multiple image frames in the video image stream.
在一种可能的实现方式中,预设条件为至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。In a possible implementation manner, the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
在一种可能的实现方式中,在根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域以后,方法还包括:控制图像采集装置对至少一个图像对焦区域进行对焦。In a possible implementation, after determining at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, the method further includes: controlling the image acquisition device to focus on the at least one image focus area Focus.
在一种可能的实现方式中,根据视频图像流和音频流生成场景分析热图包括:将原始视频文件拆分成视频图像流和音频流;利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图。In a possible implementation, generating the scene analysis heat map from the video image stream and the audio stream includes: splitting the original video file into the video image stream and the audio stream; using the neural network model to process the video image stream and the audio stream Get a scene analysis heat map.
在一种可能的实现方式中,在根据视频图像流和音频流生成场景分析热图以前,还包括:将音频流从时域形式转换为频域形式。In a possible implementation, before generating the scene analysis heat map from the video image stream and the audio stream, the method further includes: converting the audio stream from the time domain form to the frequency domain form.
在一种可能的实现方式中,利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图包括:利用神经网络模型的音频网络对音频流进行处理得到三维音频矩阵,三维音频矩阵包括音频流的时长、音频流的频率分布和音频流的特征信息;利用神经网络模型的图像网络对视频图像流进行处理得到四维图像矩阵,四维图像矩阵包括视频图像流的时长、视频图像流的长度、视频图像流的宽度和视频图像流的特征信息;利用神经网络模型的融合网络将三维音频矩阵和四维图像矩阵进行融合处理得到场景分析热图。In a possible implementation, using the neural network model to process the video image stream and the audio stream to obtain the scene analysis heat map includes: using the neural network model of the audio network to process the audio stream to obtain a three-dimensional audio matrix. The three-dimensional audio matrix includes The length of the audio stream, the frequency distribution of the audio stream, and the characteristic information of the audio stream; the image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream; the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
其中,在终端利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图的过程中,神经网络模型的音频网络用于对声音信息进行分析并得到声音分析结论,神经网络模型的图像网络用于对图像信息进行分析并得到图像分析结论,神经网络模型的融合网络用于对音频网络得到的声音分析结论和图像网络得到的图像分析结论进行分析并融合处理,最终确定出视频图像流的目标图像帧中最有可能发出声音的图像对焦区域,并将目标图像帧中最有可能发出声音的图像对焦区域使用场景分析热图来展现出来,所以利用神经网络模型生成的场景分析热图来识别目标图像帧中最有可能发出声音的图像对焦区域具有更高的准确性。Among them, in the process that the terminal uses the neural network model to process the video image stream and audio stream to obtain the scene analysis heat map, the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion, the image of the neural network model The network is used to analyze the image information and get the image analysis conclusion. The neural network model fusion network is used to analyze and fuse the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally determine the video image stream. The focus area of the image in the target image frame that is most likely to make sound, and the focus area of the image in the target image frame that is most likely to make sound is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used To identify the focus area of the image in the target image frame that is most likely to emit sound with higher accuracy.
第二方面,本申请实施例提供了一种装置,包括:获取模块,用于获取原始视频文件,原始视频文件包括视频图像流和音频流,视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,音频流包括多个声音帧且由声音采集装置采集的声音数据生成;生成模块,用于根据视频图像流和音频流生成场景分析热图,场景分析热图用于指示视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率;确定模块,用于根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,至少一个图像对焦区域中每个图像区域包括至少一个图像单位。In a second aspect, an embodiment of the present application provides an apparatus, including: an acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of an image acquisition device The generated image data is generated, and the audio stream includes multiple sound frames and the sound data collected by the sound collection device is generated; the generation module is used to generate a scene analysis heat map based on the video image stream and the audio stream, and the scene analysis heat map is used to indicate the video Probability that there is sound on each of the multiple image units on the target image frame of the image stream; a determination module is used to determine that at least one image on the target image frame of the video image stream that meets the preset conditions is in focus according to the scene analysis heat map Area, each image area in the at least one image focus area includes at least one image unit.
在一种可能的实现方式中,目标图像帧为视频图像流中多个图像帧的最后一帧。In a possible implementation, the target image frame is the last frame of multiple image frames in the video image stream.
在一种可能的实现方式中,预设条件为至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。In a possible implementation manner, the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.
在一种可能的实现方式中,装置还包括:对焦模块,用于控制图像采集装置对至 少一个图像对焦区域进行对焦。In a possible implementation manner, the device further includes: a focusing module, configured to control the image acquisition device to focus on at least one image focusing area.
在一种可能的实现方式中,生成模块包括:拆分模块,用于将原始视频文件拆分成视频图像流和音频流;处理模块,用于利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图。In a possible implementation, the generation module includes: a splitting module, which is used to split the original video file into a video image stream and an audio stream; a processing module, which is used to perform a video image stream and an audio stream using a neural network model The scene analysis heat map is processed.
在一种可能的实现方式中,装置还包括:转换模块,用于将音频流从时域形式转换为频域形式。In a possible implementation manner, the device further includes: a conversion module, configured to convert the audio stream from the time domain form to the frequency domain form.
在一种可能的实现方式中,处理模块包括:音频处理模块,用于利用神经网络模型的音频网络对音频流进行处理得到三维音频矩阵,三维音频矩阵包括音频流的时长、音频流的频率分布和音频流的特征信息;图像处理模块,用于利用神经网络模型的图像网络对视频图像流进行处理得到四维图像矩阵,四维图像矩阵包括视频图像流的时长、视频图像流的长度、视频图像流的宽度和视频图像流的特征信息;融合处理模块,用于利用神经网络模型的融合网络将三维音频矩阵和四维图像矩阵进行融合处理得到场景分析热图。In a possible implementation manner, the processing module includes: an audio processing module for processing an audio stream by using an audio network of a neural network model to obtain a three-dimensional audio matrix. The three-dimensional audio matrix includes the duration of the audio stream and the frequency distribution of the audio stream And audio stream feature information; an image processing module for processing a video image stream using an image network of a neural network model to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream, the length of the video image stream, and the video image stream The width and the characteristic information of the video image stream; the fusion processing module, which is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.
第三方面,本申请实施例提供了一种装置,包括一个或多个处理器,用于执行上述第一方面或第一方面的任一种可能实现方式中的方法。可选地,该装置包括存储器,用于存储驱动处理器工作的软件指令。In a third aspect, an embodiment of the present application provides an apparatus, including one or more processors, for performing the method in the foregoing first aspect or any possible implementation manner of the first aspect. Optionally, the device includes a memory for storing software instructions that drive the processor to work.
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机或处理器上运行时,使得计算机或处理器执行上述第一方面或第一方面的任一种可能实现方式中的方法。According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, causes the computer or processor to perform the first aspect described above Or the method in any possible implementation manner of the first aspect.
第五方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机或处理器上运行时,使得计算机或处理器执行上述第一方面或第一方面的任一种可能实现方式中的方法。According to a fifth aspect, embodiments of the present application provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the first aspect or any possible implementation of the first aspect The way in the way.
附图说明BRIEF DESCRIPTION
图1所示的为本申请实施例提供的一种智能手机的相机软件尚未确定对焦区域的示意图;FIG. 1 is a schematic diagram of a camera software of a smart phone that has not yet determined a focus area according to an embodiment of the present application;
图2所示的为本申请实施例提供的一种智能手机的相机软件已经确定对焦区域的示意图;FIG. 2 is a schematic diagram showing that a camera software of a smartphone has determined a focus area according to an embodiment of the present application;
图3所示的为本申请实施例提供的一种确定图像对焦区域的方法的流程图;3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application;
图4所示的为本申请实施例提供的一种原始视频文件的视频图像流和音频流的示意图;4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of this application;
图5所示的为本申请实施例提供的一种视频图像流的目标图像帧的示意图;5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application;
图6所示的为本申请实施例提供的一种场景分析热图的示意图;6 is a schematic diagram of a scene analysis heat map provided by an embodiment of the present application;
图7所示的为本申请实施例提供的一种在目标图像帧上确定出图像对焦区域的示意图;7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application;
图8所示的为本申请实施例提供的另一种原始视频文件的视频图像流和音频流的示意图;8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of this application;
图9所示的为本申请实施例提供的又一种原始视频文件的视频图像流和音频流的示意图;9 is a schematic diagram of a video image stream and an audio stream of still another original video file provided by an embodiment of the present application;
图10所示的为本申请实施例提供的另一种确定图像对焦区域的方法的流程图;10 is a flowchart of another method for determining an image focus area provided by an embodiment of this application;
图11所示的为本申请实施例提供的一种如何生成场景分析热图的示意图;FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application;
图12所示的为本申请实施例提供的一种装置的示意图;12 is a schematic diagram of an apparatus provided by an embodiment of the present application;
图13所示的为本申请实施例提供的又一种装置的示意图。FIG. 13 is a schematic diagram of another device provided by an embodiment of the present application.
具体实施方式detailed description
本申请实施例提供一种确定图像对焦区域的方法,本申请实施例提供的方法可以应用于不同种类的终端内。例如,本申请实施例提供方法可以应用于智能手机、平板电脑、数码相机或智能摄像头等终端内。或者,本申请的技术方案也可以应用于不具有通信功能的电子设备上。后续以在终端上应用所述技术方案为例作介绍。An embodiment of the present application provides a method for determining an image focus area. The method provided in the embodiment of the present application may be applied to terminals of different types. For example, the method provided in the embodiments of the present application may be applied to terminals such as smart phones, tablet computers, digital cameras, or smart cameras. Alternatively, the technical solution of the present application can also be applied to electronic devices that do not have a communication function. The following uses the application of the technical solution on the terminal as an example for introduction.
下面结合一个具体的技术场景来说明本申请实施例提供的方法。请结合图1和图2所示,图1所示的为本申请实施例提供的一种智能手机的相机软件尚未确定对焦区域的示意图,图2所示的为本申请实施例提供的一种智能手机的相机软件已经确定对焦区域的示意图。需要说明的是,图1和图2所示的实施例用于使读者快速理解本申请实施例的技术原理,并不用于限定本申请实施例的保护范围。图1和图2所示的实施例中提到的具体参数值可以根据本申请实施例的原理进行变化,本申请实施例的保护范围并不限于已经提到的具体参数值。The method provided by the embodiments of the present application will be described below in conjunction with a specific technical scenario. Please refer to FIG. 1 and FIG. 2. FIG. 1 is a schematic diagram of a camera software of a smart phone provided by an embodiment of the present application, and a focus area has not been determined. FIG. 2 is a diagram provided by an embodiment of the present application. A schematic diagram of the camera software of the smartphone has determined the focus area. It should be noted that the embodiments shown in FIGS. 1 and 2 are used to enable the reader to quickly understand the technical principles of the embodiments of the present application, and are not used to limit the protection scope of the embodiments of the present application. The specific parameter values mentioned in the embodiments shown in FIGS. 1 and 2 can be changed according to the principles of the embodiments of the present application, and the protection scope of the embodiments of the present application is not limited to the specific parameter values already mentioned.
请参见图1所示,假设用户想要使用智能手机1对动物园内的小鸟进行拍照或录像。首先,用户打开智能手机1的相机软件,并将智能手机1的摄像头对准小鸟;然后,智能手机1会调用摄像头采集小鸟的图像,并将小鸟的图像显示在智能手机1的触摸屏上。As shown in FIG. 1, assume that the user wants to use the smartphone 1 to take pictures or videos of birds in the zoo. First, the user opens the camera software of smartphone 1 and points the camera of smartphone 1 to the bird; then, smartphone 1 calls the camera to collect the image of the bird and displays the image of the bird on the touch screen of smartphone 1 on.
假设在智能手机1内预先安装有本申请实施例提供的确定图像对焦区域的方法的软件程序,智能手机1会自动的调用摄像头和麦克风采集2秒的原始视频文件。本实施例的智能手机1包括相机功能,其执行相机功能时可以认为是个相机,该相机包括所述摄像头和图像信号处理器(ISP)。然后,智能手机1会获取到该2秒的原始视频文件,并利用智能手机1预先存储的神经网络模型根据该2秒的原始视频文件的视频图像流和音频流生成场景分析热图,该场景分析热图用于指示视频图像流中最有可能发出声音的图像区域。Assuming that the software program for determining the image focus area provided by the embodiment of the present application is pre-installed in the smartphone 1, the smartphone 1 will automatically call the camera and the microphone to collect the original video file for 2 seconds. The smartphone 1 of this embodiment includes a camera function, which can be regarded as a camera when performing the camera function, and the camera includes the camera and an image signal processor (ISP). Then, the smartphone 1 will obtain the 2 second original video file, and use the neural network model pre-stored in the smartphone 1 to generate a scene analysis heat map based on the video image stream and audio stream of the 2 second original video file. The analysis heat map is used to indicate the image area in the video image stream that is most likely to emit sound.
请参见图2所示,假设智能手机1获取到该2秒的原始视频文件中,小鸟在不停的发出叫声,那么智能手机1的神经网络模型会根据该2秒的原始视频文件计算出图像中的小鸟嘴部是最有可能发出声音的图像区域,所以生成的场景分析热图会指示出图2中A区域的小鸟嘴部是最有可能发出声音的图像区域。然后,智能手机1会对图2中A区域的小鸟嘴部进行对焦,从而实现对图像中有声区域的自动对焦。在智能手机1对图2中A区域的小鸟嘴部进行对焦以后,用户便可以进行拍照或录像。而且,智能手机1可以不停的重复本申请实施例提供的确定图像对焦区域的方法,从而不断的生成场景分析热图,以实时的指示图像中最有可能发出声音的图像区域。As shown in FIG. 2, assuming that smartphone 1 obtains the 2-second original video file and the bird keeps calling, the neural network model of smartphone 1 will be calculated based on the 2-second original video file The small bird's beak in the outgoing image is the image area most likely to make sound, so the generated scene analysis heat map will indicate that the small bird's beak in area A in Figure 2 is the most likely image area to make sound. Then, the smart phone 1 will focus on the small bird's beak in the area A in FIG. 2, so as to realize automatic focusing on the sound area in the image. After the smartphone 1 focuses on the small bird's beak in the area A in FIG. 2, the user can take a picture or record a video. Moreover, the smartphone 1 can repeat the method for determining the image focus area provided by the embodiments of the present application continuously, so as to continuously generate a scene analysis heat map to indicate the image area in the image most likely to emit sound in real time.
上文通过图1和图2所示的应用例简要的介绍了本申请实施例提供的技术场景,下面介绍本申请实施例提供的确定图像对焦区域的方法的执行过程、技术原理和实施例。The technical examples provided by the embodiments of the present application are briefly described above through the application examples shown in FIG. 1 and FIG. 2, and the following describes the execution process, technical principles, and embodiments of the method for determining the image focus area provided by the embodiments of the present application.
请参见图3所示,图3所示的为本申请实施例提供的一种确定图像对焦区域的方法的流程图。图3所示的方法可以在不提高器件成本的情况下,在图像中确定最有可 能发出声音的图像对焦区域,该方法包括以下步骤。Please refer to FIG. 3, which is a flowchart of a method for determining an image focus area provided by an embodiment of the present application. The method shown in FIG. 3 can determine the focus area of the image most likely to emit sound in the image without increasing the cost of the device. The method includes the following steps.
步骤S11、获取原始视频文件。在步骤S11中,结合图1所示的实施例可知,终端在开启相机软件以后,终端会自动的调用摄像头和麦克风采集预定时间段的原始视频文件,并获取到该预定时间段的原始视频文件。Step S11: Obtain the original video file. In step S11, it can be seen from the embodiment shown in FIG. 1 that after the terminal starts the camera software, the terminal will automatically call the camera and the microphone to collect the original video file for a predetermined period of time, and obtain the original video file for the predetermined period of time .
在图1的例子中,预定时间段为2秒。预定时间段可以根据终端的硬件配置等因素预先设定,在终端的硬件配置较高时,可以适当的缩短预定时间段,例如,将预定时间段设定为1秒或0.5秒,甚至更短;在终端的硬件配置较低时,可以适当的延长预定时间段,例如,将预定时间段设定为3秒或4秒,甚至更长。本申请实施例并不限定预定时间段的具体长度,上述提供的具体数值仅用于说明调整预定时间段的原理。In the example of FIG. 1, the predetermined time period is 2 seconds. The predetermined time period can be preset according to factors such as the terminal's hardware configuration. When the terminal's hardware configuration is high, the predetermined time period can be shortened appropriately, for example, the predetermined time period is set to 1 second or 0.5 second, or even shorter ; When the hardware configuration of the terminal is low, you can extend the predetermined time period appropriately, for example, set the predetermined time period to 3 seconds or 4 seconds, or even longer. The embodiments of the present application do not limit the specific length of the predetermined time period, and the specific values provided above are only used to explain the principle of adjusting the predetermined time period.
其中,原始视频文件包括视频图像流和音频流,视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,音频流包括多个声音帧且由声音采集装置采集的声音数据生成。图像采集装置可以为以前实施例中所述终端的摄像头,也可以选择性包括ISP。声音采集装置可以为终端的麦克风,也可以选择性包括处理麦克风手机的信号的语音处理通道或电路。The original video file includes a video image stream and an audio stream. The video image stream includes multiple image frames and is generated by image data collected by the image collection device, and the audio stream includes multiple sound frames and is generated by sound data collected by the sound collection device. The image acquisition device may be the camera of the terminal described in the previous embodiment, or may optionally include the ISP. The sound collection device may be a microphone of the terminal, or may optionally include a voice processing channel or circuit that processes signals of the microphone mobile phone.
例如,请结合图4所示,图4所示的为本申请实施例提供的一种原始视频文件的视频图像流和音频流的示意图。在图4中,横轴表示时间轴,时间点T1要早于时间点T2,假设原始视频文件的时间长度为T1至T2。原始视频文件包括视频图像流和音频流,其中,视频图像流具有10个图像帧(图像帧P1至图像帧P10),音频流具有30个声音帧(图中未示出),视频图像流和音频流用于生成场景分析热图。For example, please refer to FIG. 4, which is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of the present application. In FIG. 4, the horizontal axis represents the time axis, and the time point T1 is earlier than the time point T2, assuming that the time length of the original video file is T1 to T2. The original video file includes a video image stream and an audio stream, where the video image stream has 10 image frames (image frame P1 to image frame P10), the audio stream has 30 sound frames (not shown in the figure), the video image stream and The audio stream is used to generate scene analysis heat maps.
步骤S12、根据视频图像流和音频流生成场景分析热图。其中,场景分析热图用于指示视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率,图像单位可以为像素点。一个图像单位上存在声音的概率对应于该图像单位上对应的被拍摄物体发出声音的概率。该物体可以是人、动物、乐器、设备或其他物体。Step S12: Generate a scene analysis heat map according to the video image stream and the audio stream. Among them, the scene analysis heat map is used to indicate the probability of the presence of sound on each of the multiple image units on the target image frame of the video image stream, and the image unit may be a pixel. The probability that there is sound on one image unit corresponds to the probability that the corresponding object on the image unit emits sound. The object can be a person, animal, musical instrument, equipment, or other object.
在终端根据视频图像流和音频流生成场景分析热图的过程中,终端会结合视频图像流和音频流计算出视频图像流的目标图像帧上每个图像单位上存在声音的概率,并根据目标图像帧上每个图像单位上存在声音的概率生成场景分析热图。场景分析热图为一帧图像,场景分析热图的分辨率与视频图像流的目标图像帧的分辨率相同。其中,视频图像流的目标图像帧可以为视频图像流中多个图像帧的最后一帧。例如,请结合图4所示,图4中的视频图像流具有10个图像帧,目标图像帧为视频图像流中的最后一个图像帧P10。When the terminal generates a scene analysis heat map from the video image stream and the audio stream, the terminal combines the video image stream and the audio stream to calculate the probability of sound on each image unit on the target image frame of the video image stream, and according to the target The probability of the presence of sound on each image unit on the image frame generates a scene analysis heat map. The scene analysis heat map is a frame of image, and the resolution of the scene analysis heat map is the same as the resolution of the target image frame of the video image stream. The target image frame of the video image stream may be the last frame of multiple image frames in the video image stream. For example, referring to FIG. 4, the video image stream in FIG. 4 has 10 image frames, and the target image frame is the last image frame P10 in the video image stream.
请参见图5和图6所示,图5所示的为本申请实施例提供的一种视频图像流的目标图像帧的示意图,图6所示的为本申请实施例提供的一种场景分析热图的示意图。为了便于举例,在图5和图6中,假设视频图像流的目标图像帧的分辨率和场景分析热图的分辨率均为5像素×3像素。在实际场景中,视频图像流的目标图像帧的分辨率和场景分析热图的分辨率为用户预先设置的分辨率,例如,假设用户预先设置的分辨率为1920像素×1080像素,那么视频图像流的目标图像帧的分辨率和场景分析热图的分辨率均为1920像素×1080像素。Please refer to FIG. 5 and FIG. 6, FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application, and FIG. 6 is a scene analysis provided by an embodiment of the present application. Schematic diagram of the heat map. For ease of example, in FIGS. 5 and 6, it is assumed that the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 5 pixels×3 pixels. In the actual scene, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are the resolutions preset by the user. For example, assuming that the preset resolution of the user is 1920 pixels×1080 pixels, then the video image The resolution of the stream's target image frame and the resolution of the scene analysis heat map are both 1920 pixels×1080 pixels.
请结合图5和图6所示,假设视频图像流的目标图像帧的分辨率为5像素×3像素,终端可以结合视频图像流和音频流计算出视频图像流的目标图像帧上15个像素点 上存在声音的概率。5 and 6, assuming that the resolution of the target image frame of the video image stream is 5 pixels × 3 pixels, the terminal can combine the video image stream and the audio stream to calculate 15 pixels on the target image frame of the video image stream The probability of sound at the point.
在图5中,每个圆圈均代表1个像素点,即图5中共有15个像素点,图5中的每个像素点均具有原始颜色,例如,假设15个像素点的原始颜色均为红色。假设终端结合视频图像流和音频流进行计算以后得知,像素点(11,12,21,22,31,32)上存在声音的概率均大于或等于50%,像素点(13,14,15,23,24,25,33,34,35)上存在声音的概率均小于50%。然后,终端会根据目标图像帧上15个像素点上存在声音的概率生成场景分析热图。In FIG. 5, each circle represents 1 pixel, that is, there are 15 pixels in FIG. 5, and each pixel in FIG. 5 has the original color. For example, assume that the original colors of the 15 pixels are all red. Assuming that the terminal calculates after combining the video image stream and the audio stream, it is known that the probability of sound on the pixel points (11, 12, 21, 22, 31, 32) is greater than or equal to 50%, and the pixel points (13, 14, 15 , 23, 24, 25, 33, 34, 35) The probability of sound is less than 50%. Then, the terminal generates a scene analysis heat map based on the probability of sound on 15 pixels on the target image frame.
在图6中,为了区分像素点上存在声音的概率的不同,终端在目标图像帧的基础上,将物体发出声音的概率大于或等于50%的像素点(11,12,21,22,31,32)的颜色变成黑色,将物体发出声音的概率小于50%的像素点(13,14,15,23,24,25,33,34,35)的颜色变成白色,从而将如图5所示的目标图像帧转换成如图6所示的场景分析热图。In FIG. 6, in order to distinguish the difference in the probability of sound on the pixels, based on the target image frame, the terminal emits pixels with a probability of sound greater than or equal to 50% (11, 12, 21, 22, 31 , 32) becomes black, and the color of pixels (13, 14, 15, 23, 24, 25, 33, 34, 35) with a probability of less than 50% of the sound emitted by the object becomes white, which will be as shown in the figure The target image frame shown in 5 is converted into the scene analysis heat map shown in FIG. 6.
当然,图6所示的场景分析热图仅为示例,场景分析热图中的每个像素点对应的颜色可以根据实际情况进行设定,区分每个像素点的颜色并不局限于使用白色和黑色。Of course, the scene analysis heat map shown in FIG. 6 is only an example. The color corresponding to each pixel in the scene analysis heat map can be set according to the actual situation. The color distinguishing each pixel is not limited to the use of white and black.
在步骤S12以前,即根据视频图像流和音频流生成场景分析热图以前,可以将音频流从时域形式转换为频域形式,然后,再根据视频图像流和转换为频域形式的音频流生成场景分析热图。例如,在根据视频图像流和音频流生成场景分析热图以前,先将音频流进行傅里叶变换得到短时傅里叶谱,然后,再根据视频图像流和短时傅里叶谱生成场景分析热图。当然,不做所述傅里叶变换,直接使用时域的音频流也可以进行所述场景分析热图的生成。Before step S12, that is, before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream can be converted from the time domain form to the frequency domain form, and then, based on the video image stream and the audio stream converted into the frequency domain form Generate a scene analysis heat map. For example, before generating a scene analysis heat map from the video image stream and audio stream, the audio stream is first Fourier transformed to obtain a short-time Fourier spectrum, and then the scene is generated from the video image stream and the short-time Fourier spectrum Analyze the heat map. Of course, the scene analysis heat map can also be generated by directly using the audio stream in the time domain without performing the Fourier transform.
步骤S13、根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域。在步骤S13中,至少一个图像对焦区域中每个图像区域包括至少一个图像单位,预设条件为至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。其中,预设概率阈值为预先设定好的概率阈值。可选地,对于一个区域,该区域中至少一个图像单位中的每个图像单位上存在声音的概率均达到预设概率阈值时,可确定该区域是图像对焦区域。可选地,对于一个区域,该区域中至少一个图像单位中的超过预定数量的图像单位上存在声音的概率均达到预设概率阈值时或者存在声音的概率均达到预设概率阈值的图像单位的比例超多比例阈值,可确定该区域是图像对焦区域。此外,可以有其他判断方式,只要根据一个区域内的所述至少一个图像单位上存在声音的概率即可确定出该区域上存在声音的概率,从而确定该区域是否为图像对焦区域。Step S13: Determine at least one image focus area on the target image frame of the video image stream that meets the preset condition according to the scene analysis heat map. In step S13, each image area in the at least one image focus area includes at least one image unit, and the preset condition is the probability that there is sound on one or more image units in the at least one image focus area in the at least one image focus area Both reached the preset probability threshold. The preset probability threshold is a preset probability threshold. Optionally, for an area, when the probability that the sound exists on each image unit in at least one image unit in the area reaches a preset probability threshold, it may be determined that the area is an image focus area. Optionally, for a region, when the probability that the sound exists on at least one image unit in the region exceeds the predetermined number of image units all reaches a preset probability threshold or the image unit whose probability of sound presence all reaches the preset probability threshold If the ratio exceeds the multi-scale threshold, it can be determined that the area is the image focus area. In addition, there may be other ways of judging, as long as the probability of sound exists in the at least one image unit in an area can determine the probability of sound in the area, so as to determine whether the area is the image focus area.
在步骤S13以后,终端可以控制图像采集装置对至少一个图像对焦区域进行对焦。例如,请结合图7所示,图7所示的为本申请实施例提供的一种在目标图像帧上确定出图像对焦区域的示意图。假设预设概率阈值为50%,终端根据场景分析热图确定视频图像流的目标图像帧上位置相邻且达到预设概率阈值50%的像素点(11,12,21,22,31,32),像素点(11,12,21,22,31,32)组成了图7中的B区域,B区域即为1个图像对焦区域。然后,终端便可控制图像采集装置对图7中的图像对焦区域(即B区域)进行对焦。After step S13, the terminal may control the image acquisition device to focus on at least one image focusing area. For example, please refer to FIG. 7, which is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application. Assuming that the preset probability threshold is 50%, the terminal determines the pixel points adjacent to the target image frame of the video image stream and reaching the preset probability threshold 50% according to the scene analysis heat map (11, 12, 21, 22, 31, 32 ), the pixels (11, 12, 21, 22, 31, 32) constitute the B area in FIG. 7, and the B area is an image focus area. Then, the terminal can control the image acquisition device to focus on the image focusing area (that is, area B) in FIG. 7.
在图3所示的实施例中,本申请实施例提供的方法可以应用在具有图像采集装置 和声音采集装置的终端上。该终端可以主动的通过图像采集装置和声音采集装置获取原始视频文件,并根据视频图像流和音频流生成场景分析热图,最后根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,所以本申请实施例可以确定出视频图像流的目标图像帧上最有可能发出声音的区域。而且,在终端上除了图像采集装置和声音采集装置以外,无需增加额外的器件,图像采集装置和声音采集装置可以是终端本身原有的器件,如智能手机的摄像头或麦克风,从而降低了器件成本。In the embodiment shown in FIG. 3, the method provided by the embodiment of the present application may be applied to a terminal having an image collection device and a sound collection device. The terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream. Finally, the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream. Moreover, in addition to the image acquisition device and the sound acquisition device, no additional devices need to be added to the terminal. The image acquisition device and the sound acquisition device can be the original devices of the terminal itself, such as the camera or microphone of the smartphone, thereby reducing the device cost .
在图3所示的实施例中,介绍了利用一段原始视频文件生成一个场景分析热图,并利用该场景分析热图确定目标图像帧的图像对焦区域的方法。基于上述原理,下面介绍一下如何利用多段原始视频文件生成多个场景分析热图,并利用多个场景分析热图确定多个目标图像帧的图像对焦区域的方法。In the embodiment shown in FIG. 3, a method for generating a scene analysis heat map using a piece of original video file and using the scene analysis heat map to determine the image focus area of a target image frame is introduced. Based on the above principles, the following describes how to use multiple original video files to generate multiple scene analysis heatmaps, and use multiple scene analysis heatmaps to determine the image focus area of multiple target image frames.
第一种方式,请参见图8所示,图8所示的为本申请实施例提供的另一种原始视频文件的视频图像流和音频流的示意图。在图8中,横轴表示时间轴,T1至T4分别表示4个时间点,其中,T1早于T2,T2早于T3,T3早于T4。假设终端的相机软件在T1时刻被用户打开,终端会自动的在T1至T3之间采集到原始视频文件A,其中,原始视频文件A包括视频图像流A和音频流A。在时间到达T3时刻时,终端会获取原始视频文件A,并根据原始视频文件A的视频图像流A(图像帧p1至图像帧p10)和音频流A生成场景分析热图A,再根据场景分析热图A确定视频图像流A的图像帧p10上满足预设条件的至少一个图像对焦区域,即终端在T3时刻确定出图像帧p10上的图像对焦区域。For the first way, please refer to FIG. 8, which is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In FIG. 8, the horizontal axis represents the time axis, and T1 to T4 respectively represent four time points, where T1 is earlier than T2, T2 is earlier than T3, and T3 is earlier than T4. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal automatically collects the original video file A between T1 and T3, where the original video file A includes the video image stream A and the audio stream A. When the time reaches T3, the terminal will acquire the original video file A, and generate a scene analysis heat map A based on the video image stream A (image frame p1 to image frame p10) and audio stream A of the original video file A, and then analyze the scene The heat map A determines at least one image focus area on the image frame p10 of the video image stream A that meets the preset condition, that is, the terminal determines the image focus area on the image frame p10 at time T3.
为了持续的确定终端当前显示的图像帧的图像对焦区域,在时间到达T4时刻时,终端会获取T2至T4之间自动采集的原始视频文件B,并根据原始视频文件B的视频图像流B(图像帧P2至图像帧P11)和音频流B生成场景分析热图B,再根据场景分析热图B确定视频图像流B的图像帧P11上满足预设条件的至少一个图像对焦区域,即终端在T4时刻确定出图像帧P11上的图像对焦区域。In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T4, the terminal will obtain the original video file B automatically collected between T2 and T4, and according to the video image stream B of the original video file B ( Image frame P2 to image frame P11) and audio stream B to generate a scene analysis heat map B, and then determine at least one image focus area on the image frame P11 of the video image stream B that meets the preset conditions according to the scene analysis heat map B, that is, the terminal is At time T4, the image focus area on the image frame P11 is determined.
在第一种方式中,终端可以为每个图像帧确定出对应的场景分析热图,并利用每个图像帧对应的场景分析热图为每个图像帧确定出满足预设条件的至少一个图像对焦区域,从而可以更加精准的为每个图像帧确定出图像对焦区域。In the first way, the terminal may determine a corresponding scene analysis heat map for each image frame, and use the scene analysis heat map corresponding to each image frame to determine at least one image satisfying the preset condition for each image frame Focus area, so that the image focus area can be determined more accurately for each image frame.
第二种方式,请参见图9所示,图9所示的为本申请实施例提供的又一种原始视频文件的视频图像流和音频流的示意图。在图9中,横轴表示时间轴,T1至T3分别表示3个时间点,其中,T1早于T2,T2早于T3。假设终端的相机软件在T1时刻被用户打开,终端会自动的在T1至T2之间采集到原始视频文件A,其中,原始视频文件A包括视频图像流A(图像帧P1至图像帧P10)和音频流A。在时间到达T2时刻时,终端会获取原始视频文件A,并根据原始视频文件A的视频图像流A和音频流A生成场景分析热图A,再根据场景分析热图A确定视频图像流A的图像帧P10上满足预设条件的至少一个图像对焦区域,即终端在T2时刻确定出图像帧P10上的图像对焦区域。然后,在T2至T3之间,图像帧P11至图像帧P18均根据场景分析热图A确定满足预设条件的至少一个图像对焦区域。For the second way, please refer to FIG. 9, which is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In FIG. 9, the horizontal axis represents the time axis, and T1 to T3 respectively represent three time points, where T1 is earlier than T2 and T2 is earlier than T3. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal will automatically collect the original video file A between T1 and T2, where the original video file A includes the video image stream A (image frame P1 to image frame P10) and Audio stream A. When the time reaches T2, the terminal will obtain the original video file A, and generate a scene analysis heat map A according to the video image stream A and the audio stream A of the original video file A, and then determine the video image stream A according to the scene analysis heat map A At least one image focus area on the image frame P10 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P10 at time T2. Then, from T2 to T3, the image frame P11 to the image frame P18 all determine at least one image focus area that satisfies the preset condition according to the scene analysis heat map A.
为了持续的确定终端当前显示的图像帧的图像对焦区域,在时间到达T3时刻时, 终端会获取T2至T4之间自动采集的原始视频文件B,其中,原始视频文件B包括视频图像流B(图像帧P10至图像帧P19)和音频流B。在时间到达T3时刻时,终端会获取原始视频文件B,并根据原始视频文件B的视频图像流B和音频流B生成场景分析热图B,再根据场景分析热图B确定视频图像流B的图像帧P19上满足预设条件的至少一个图像对焦区域,即终端在T3时刻确定出图像帧P19上的图像对焦区域。In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T3, the terminal will obtain the original video file B automatically collected between T2 and T4, where the original video file B includes the video image stream B( Image frame P10 to Image frame P19) and audio stream B. When the time reaches T3, the terminal will obtain the original video file B, and generate a scene analysis heat map B based on the video image stream B and audio stream B of the original video file B, and then determine the video image stream B according to the scene analysis heat map B At least one image focus area on the image frame P19 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P19 at time T3.
在第二种方式中,终端每隔一段时间生成1个场景分析热图,以使多个图像帧可以复用1个场景分析热图,从而在单位时间内减少了计算场景分析热图的次数,从而可以较少的占用终端的处理资源。In the second way, the terminal generates a scene analysis heat map at intervals so that multiple image frames can reuse one scene analysis heat map, thereby reducing the number of scene analysis heat map calculations per unit time So that the processing resources of the terminal can be occupied less.
请参见图10所示,图10所示的为本申请实施例提供的另一种确定图像对焦区域的方法的流程图,图10所示的方法为图3的步骤S12的具体实现,即“根据视频图像流和音频流生成场景分析热图”的具体实现,该方法包括以下步骤。Please refer to FIG. 10, which is a flowchart of another method for determining an image focus area provided by an embodiment of the present application. The method shown in FIG. 10 is a specific implementation of step S12 in FIG. 3, that is, “ The specific implementation of "Generating a scene analysis heat map based on video image stream and audio stream" includes the following steps.
步骤S21、将原始视频文件拆分成视频图像流和音频流。其中,可以利用多媒体视频处理工具,如软件工具将原始视频文件拆分成视频图像流和音频流。Step S21: Split the original video file into a video image stream and an audio stream. Among them, you can use multimedia video processing tools, such as software tools to split the original video file into a video image stream and audio stream.
步骤S22、获取到预先存储的神经网络模型。其中,神经网络模型可以由运算能力强大的服务器来生成,服务器生成的神经网络模型包括音频网络、图像网络和融合网络。终端可以预先将服务器上生成的神经网络模型下载到本地存储器内,在终端需要使用神经网络模型时,终端便可以在本地存储器内获取到预先存储的神经网络模型。服务器在生成神经网络模型的过程中,会将大量的视频样本输入到神经网络模型中,以使神经网络模型对大量的视频样本进行学习。在服务器生成神经网络模型以后,神经网络模型便可以识别出视频图像流的目标图像帧中最有可能发出声音的图像对焦区域。Step S22: Acquire a pre-stored neural network model. Among them, the neural network model can be generated by a server with strong computing power. The neural network model generated by the server includes an audio network, an image network, and a fusion network. The terminal can download the neural network model generated on the server to the local memory in advance. When the terminal needs to use the neural network model, the terminal can obtain the pre-stored neural network model in the local memory. In the process of generating the neural network model, the server will input a large number of video samples into the neural network model, so that the neural network model learns from the large number of video samples. After the server generates the neural network model, the neural network model can identify the image focus area in the target image frame of the video image stream that is most likely to emit sound.
步骤S23、利用神经网络模型的音频网络对音频流进行处理得到三维音频矩阵。其中,三维音频矩阵包括音频流的时长、音频流的频率分布和音频流的特征信息。例如,假设终端利用神经网络模型的音频网络对音频流进行处理得到三维音频矩阵,其中,音频流的时长为2秒,音频流的频率分布为0-8Khz,音频流的特征信息例如为至少一个立体声道。Step S23: Use the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix. Among them, the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream. For example, suppose that the terminal uses the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix, where the duration of the audio stream is 2 seconds, the frequency distribution of the audio stream is 0-8Khz, and the characteristic information of the audio stream is, for example, at least one Stereo channels.
步骤S24、利用神经网络模型的图像网络对视频图像流进行处理得到四维图像矩阵。其中,四维图像矩阵包括视频图像流的时长、视频图像流的长度(图像长度)、视频图像流的宽度(图像宽度)和视频图像流的特征信息。例如,假设终端利用神经网络模型的图像网络对视频图像流进行处理得到四维图像矩阵,其中,视频图像流的时长为2秒,视频图像流的长度为1920个像素点,视频图像流的宽度为1080个像素点,视频图像流的特征信息例如为RBG三通道或其他色域通道。Step S24: Use the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix. Among them, the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream (image length), the width of the video image stream (image width) and the feature information of the video image stream. For example, suppose that the terminal uses the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix, where the duration of the video image stream is 2 seconds, the length of the video image stream is 1920 pixels, and the width of the video image stream is With 1080 pixels, the feature information of the video image stream is, for example, RBG three channels or other color gamut channels.
步骤S25、利用神经网络模型的融合网络将三维音频矩阵和四维图像矩阵进行融合处理得到场景分析热图。在图10所示的实施例中,在终端利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图的过程中,更像是人类处理声音信息和图像信息的过程。神经网络模型的音频网络类似于人类的耳朵,神经网络模型的音频网络用于对声音信息进行分析并得到声音分析结论。神经网络模型的图像网络类似于人类的眼睛,神经网络模型的图像网络用于对图像信息进行分析并得到图像分析结论。神经网络模型的融合网络类似于人类的大脑,神经网络模型的融合网络用于对音频网 络得到的声音分析结论和图像网络得到的图像分析结论进行分析并融合处理,最终确定出视频图像流的目标图像帧中最有可能发出声音的图像对焦区域,并将目标图像帧中最有可能发出声音的图像对焦区域使用场景分析热图来展现出来,所以利用神经网络模型生成的场景分析热图来识别目标图像帧中最有可能发出声音的图像对焦区域具有更高的准确性。In step S25, the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map. In the embodiment shown in FIG. 10, in the process that the terminal uses the neural network model to process the video image stream and the audio stream to obtain a scene analysis heat map, it is more like a human process of processing sound information and image information. The audio network of the neural network model is similar to the human ear. The audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion. The image network of the neural network model is similar to the human eye. The image network of the neural network model is used to analyze image information and obtain image analysis conclusions. The fusion network of the neural network model is similar to the human brain. The fusion network of the neural network model is used to analyze and fuse the sound analysis conclusions obtained by the audio network and the image analysis conclusions obtained by the image network, and finally determine the target of the video image stream The focus area of the image that is most likely to make sound in the image frame, and the focus area of the image that is most likely to make sound in the target image frame is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used to identify The focus area of the image in the target image frame that is most likely to emit sound has higher accuracy.
请参见图11所示,图11所示的为本申请实施例提供的一种如何生成场景分析热图的示意图。在图11中,T1和T2分别表示2个时间点,T1早于T2。假设终端的相机软件在T1时刻被用户打开,终端会自动的在T1至T2之间采集到原始视频文件10,其中,原始视频文件10包括视频图像流101和音频流102。Please refer to FIG. 11, which is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application. In FIG. 11, T1 and T2 represent two time points, T1 is earlier than T2. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal automatically collects the original video file 10 between T1 and T2, where the original video file 10 includes a video image stream 101 and an audio stream 102.
在时间到达T2时刻时,终端会获取原始视频文件10,并将原始视频文件10拆分成视频图像流101(图像帧p1至图像帧p10)和音频流102,再将视频图像流101输入至神经网络模型20的图像网络201中,再将音频流102输入至神经网络模型20的音频网络202中。When the time reaches T2, the terminal will acquire the original video file 10 and split the original video file 10 into a video image stream 101 (image frame p1 to image frame p10) and audio stream 102, and then input the video image stream 101 to In the image network 201 of the neural network model 20, the audio stream 102 is input into the audio network 202 of the neural network model 20.
在终端内的神经网络模型20的图像网络201和音频网络202分别接收到视频图像流101和音频流102以后,神经网络模型20的图像网络201会对视频图像流101进行处理得到四维图像矩阵,神经网络模型20的音频网络202会对音频流102进行处理得到三维音频矩阵,图像网络201和音频网络202会分别将得到的四维图像矩阵和三维音频矩阵发送给神经网络模型20的融合网络203中。神经网络模型20的融合网络203会将三维音频矩阵和四维图像矩阵进行融合处理得到场景分析热图30。After the image network 201 and the audio network 202 of the neural network model 20 in the terminal respectively receive the video image stream 101 and the audio stream 102, the image network 201 of the neural network model 20 processes the video image stream 101 to obtain a four-dimensional image matrix. The audio network 202 of the neural network model 20 processes the audio stream 102 to obtain a three-dimensional audio matrix, and the image network 201 and the audio network 202 respectively send the obtained four-dimensional image matrix and three-dimensional audio matrix to the fusion network 203 of the neural network model 20 . The fusion network 203 of the neural network model 20 will fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map 30.
为了便于理解实施例,请参见图12所示,图12所示的为本申请实施例提供的一种装置的示意图。该装置可以位于之前所述终端中,用于结合图像采集装置如摄像头和声音采集装置如麦克风来使用。该装置包括以下模块。To facilitate understanding of the embodiment, please refer to FIG. 12, which is a schematic diagram of a device provided by an embodiment of the present application. The device may be located in the terminal described above, and is used in combination with an image collection device such as a camera and a sound collection device such as a microphone. The device includes the following modules.
获取模块11,用于获取原始视频文件,原始视频文件包括视频图像流和音频流,视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,音频流包括多个声音帧且由声音采集装置采集的声音数据生成。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S11的详细描述。The obtaining module 11 is used to obtain an original video file. The original video file includes a video image stream and an audio stream. The video image stream includes multiple image frames and is generated by image data collected by an image acquisition device. The audio stream includes multiple sound frames and is composed of The sound data collected by the sound collecting device is generated. For a specific and detailed implementation, please refer to the detailed description of step S11 in the method embodiment shown in FIG. 3 above.
生成模块12,用于根据视频图像流和音频流生成场景分析热图,场景分析热图用于指示视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S12的详细描述。The generating module 12 is configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.
确定模块13,用于根据场景分析热图确定视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,至少一个图像对焦区域中每个图像区域包括至少一个图像单位。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S13的详细描述。The determining module 13 is configured to determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and each image area in the at least one image focus area includes at least one image unit. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.
在一种可实现的实施例中,目标图像帧为视频图像流中多个图像帧的最后一帧。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S12的详细描述。In a practical embodiment, the target image frame is the last frame of multiple image frames in the video image stream. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.
在一种可实现的实施例中,预设条件为至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S13的详细描述。In a realizable embodiment, the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.
在一种可实现的实施例中,图12所示的装置还可以包括:对焦模块14,用于控 制图像采集装置对至少一个图像对焦区域进行对焦。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S13的详细描述。In an implementable embodiment, the device shown in FIG. 12 may further include: a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.
在一种可实现的实施例中,生成模块12还可以包括:拆分模块,用于将原始视频文件拆分成视频图像流和音频流;处理模块,用于利用神经网络模型对视频图像流和音频流进行处理得到场景分析热图。具体详细的实现方式,请参考上述图10所示的方法实施例中步骤S21至步骤S25的详细描述。In a practical embodiment, the generating module 12 may further include: a splitting module, which is used to split the original video file into a video image stream and an audio stream; and a processing module, which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map. For a specific and detailed implementation, please refer to the detailed description of steps S21 to S25 in the method embodiment shown in FIG. 10 above.
在一种可实现的实施例中,图12所示的装置还可以包括:转换模块15,用于将音频流从时域形式转换为频域形式。具体详细的实现方式,请参考上述图3所示的方法实施例中步骤S12的详细描述。In a realizable embodiment, the apparatus shown in FIG. 12 may further include: a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.
在一种可实现的实施例中,处理模块13还可以包括:音频处理模块,用于利用神经网络模型的音频网络对音频流进行处理得到三维音频矩阵,三维音频矩阵包括音频流的时长、音频流的频率分布和音频流的特征信息;图像处理模块,用于利用神经网络模型的图像网络对视频图像流进行处理得到四维图像矩阵,四维图像矩阵包括视频图像流的时长、视频图像流的长度、视频图像流的宽度和视频图像流的特征信息;融合处理模块,用于利用神经网络模型的融合网络将三维音频矩阵和四维图像矩阵进行融合处理得到场景分析热图。具体详细的实现方式,请参考上述图10所示的方法实施例中步骤S23至步骤S25的详细描述。In an implementable embodiment, the processing module 13 may further include: an audio processing module, configured to process an audio stream using an audio network of a neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix includes the duration and audio frequency of the audio stream The frequency distribution of the stream and the characteristic information of the audio stream; the image processing module is used to process the video image stream using the image network of the neural network model to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream; the fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model. For a specific and detailed implementation, please refer to the detailed description of steps S23 to S25 in the method embodiment shown in FIG. 10 above.
例如,以上图12中任一个或多个模块可以由软件、硬件或软件与硬件结合实现。所述软件包括软件程序指令并被一个或多个处理器执行。所述硬件可包括数字逻辑电路、算法电路、可编程逻辑门阵列、处理器、专电路或算法电路等。上述电路可位于一个或多个芯片中。For example, any one or more of the modules in FIG. 12 above may be implemented by software, hardware, or a combination of software and hardware. The software includes software program instructions and is executed by one or more processors. The hardware may include digital logic circuits, algorithm circuits, programmable logic gate arrays, processors, dedicated circuits, or algorithm circuits. The above circuit may be located in one or more chips.
为了更清楚介绍本申请的一种具体实施方式,请参见图13所示,图13所示的为本申请实施例提供的又一种装置的示意图。该装置2包括图像采集装置21、音频采集装置22、中央处理器23、图像处理器24、内存(random access memory,RAM)25、非易失性(non-volatile memory,NVM)存储器26和总线27。总线27用于连通其他部件。图13所示的装置2等同于图1和图2中的智能手机1内部的电路板、芯片或芯片组,其可选择性运行各类软件,如应用软件、驱动软件或操作系统软件。中央处理器23用于控制其他一个或多个部件,而图像处理器24用于执行本实施例的方法,该图像处理器24内可包括一个或多个处理器以执行之前实施例中的方法流程。仅作为举例而非限定,图3中S11可以由ISP执行或由专用处理器执行。图3中S12以及图10的流程中的至少部分可以由神经处理单元(NPU)、数据信号处理器或中央处理器23中至少一个执行。图3中S13可以由ISP或中央处理器23执行。NPU是内置有所述神经网络模型的设备,专用于神经网络运算。可选地,中央处理器23也可以运行人工智能软件以利用神经网络模型执行相应运算。以上提到的每个处理器可以执行必要的软件来工作。可选地,部分处理器,如ISP也可以是纯硬件设备。关于图13中的装置2可以参见图1和图2对应的实施例中对于智能手机1的详细说明,以及可以参见图3至图11对应的实施例中对于终端的详细说明。In order to more clearly introduce a specific implementation manner of the present application, please refer to FIG. 13, which is a schematic diagram of yet another device provided by an embodiment of the present application. The device 2 includes an image acquisition device 21, an audio acquisition device 22, a central processor 23, an image processor 24, a memory (random access memory, RAM) 25, a non-volatile memory (NVM) memory 26, and a bus 27. The bus 27 is used to communicate with other components. The device 2 shown in FIG. 13 is equivalent to the circuit board, chip or chipset in the smart phone 1 in FIGS. 1 and 2, and can selectively run various types of software, such as application software, driver software, or operating system software. The central processor 23 is used to control one or more other components, and the image processor 24 is used to perform the method of this embodiment. The image processor 24 may include one or more processors to perform the method of the previous embodiment Process. For example only and not limitation, S11 in FIG. 3 may be executed by an ISP or a dedicated processor. At least part of the flow of S12 in FIG. 3 and FIG. 10 may be executed by at least one of a neural processing unit (NPU), a data signal processor, or a central processor 23. S13 in FIG. 3 may be executed by the ISP or the central processor 23. The NPU is a device built with the neural network model, and is dedicated to neural network operations. Alternatively, the central processor 23 may also run artificial intelligence software to perform corresponding operations using neural network models. Each processor mentioned above can execute the necessary software to work. Optionally, some processors, such as ISP, may also be pure hardware devices. For the device 2 in FIG. 13, refer to the detailed description of the smartphone 1 in the embodiments corresponding to FIGS. 1 and 2, and refer to the detailed description of the terminal in the embodiments corresponding to FIGS. 3 to 11.
需要说明的是,当上述实施例中涉及软件所实现的功能时,相关软件或软件中的模块可存储在计算机可读介质中或作为计算机可读介质上的一个或多个指令或代码进 行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是计算机能够存取的任何可用介质。以此为例但不限于:计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光盘存储、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质。此外。任何连接可以适当的成为计算机可读介质。例如,如果软件是使用同轴电缆、光纤光缆、双绞线、数字用户线(DSL)或者诸如红外线、无线电和微波之类的无线技术从网站、服务器或者其他远程源传输的,那么同轴电缆、光纤光缆、双绞线、DSL或者诸如红外线、无线和微波之类的无线技术包括在所属介质的定影中。如本申请所使用的,盘(Disk)和碟(disc)包括压缩光碟(CD)、激光碟、光碟、数字通用光碟(DVD)、软盘和蓝光光碟,其中盘通常磁性的复制数据,而碟则用激光来光学的复制数据。上面的组合也应当包括在计算机可读介质的保护范围之内。It should be noted that when the above embodiments involve functions implemented by software, related software or modules in the software may be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium . Computer-readable media includes computer storage media and communication media, where communication media includes any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer. Take this as an example but not limited to: computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or can be used to carry or store in the form of instructions or data structures The desired program code and any other medium that can be accessed by the computer. Also. Any connection can become a computer-readable medium as appropriate. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable , Fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave are included in the fixing of the media. As used in this application, disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, where the discs usually replicate data magnetically, while discs Use laser to copy data optically. The above combination should also be included in the protection scope of the computer-readable medium.
此外,以上实施例仅用以说明本申请的技术方案而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,然而本领域的普通技术人员应当理解:其依然可对前述各实施例所记载的技术方案进行修改,或对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。In addition, the above embodiments are only used to illustrate the technical solutions of the present application rather than limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (16)

  1. 一种确定图像对焦区域的方法,其特征在于,所述方法包括:A method for determining the focus area of an image, characterized in that the method includes:
    获取原始视频文件,所述原始视频文件包括视频图像流和音频流,所述视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,所述音频流包括多个声音帧且由声音采集装置采集的声音数据生成;Obtain an original video file, the original video file includes a video image stream and an audio stream, the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device, and the audio stream includes multiple sound frames and is generated by The sound data collected by the sound collecting device is generated;
    根据所述视频图像流和所述音频流生成场景分析热图,所述场景分析热图用于指示所述视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率;A scene analysis heat map is generated according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability that there is sound on each image unit in a plurality of image units on a target image frame of the video image stream ;
    根据所述场景分析热图确定所述视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,所述至少一个图像对焦区域中每个图像区域包括至少一个图像单位。According to the scene analysis heat map, determine at least one image focus area on the target image frame of the video image stream that meets a preset condition, and each image area in the at least one image focus area includes at least one image unit.
  2. 根据权利要求1所述的确定图像对焦区域的方法,其特征在于:所述目标图像帧为所述视频图像流中多个图像帧的最后一帧。The method for determining an image focus area according to claim 1, wherein the target image frame is the last frame of a plurality of image frames in the video image stream.
  3. 根据权利要求1或2所述的确定图像对焦区域的方法,其特征在于:所述预设条件为所述至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。The method for determining an image focus area according to claim 1 or 2, wherein the preset condition is that there is one or more image units in the at least one image unit in the at least one image focus area The probabilities of sound all reach the preset probability threshold.
  4. 根据权利要求1至3任意一项所述的确定图像对焦区域的方法,其特征在于,在根据所述场景分析热图确定所述视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域以后,所述方法还包括:The method for determining an image focus area according to any one of claims 1 to 3, wherein at least one image satisfying a preset condition is determined on the target image frame of the video image stream according to the scene analysis heat map After the focus area, the method further includes:
    控制所述图像采集装置对所述至少一个图像对焦区域进行对焦。Controlling the image acquisition device to focus on the at least one image focusing area.
  5. 根据权利要求1至4任意一项所述的确定图像对焦区域的方法,其特征在于,根据所述视频图像流和所述音频流生成场景分析热图包括:The method for determining an image focus area according to any one of claims 1 to 4, wherein generating a scene analysis heat map according to the video image stream and the audio stream includes:
    将所述原始视频文件拆分成所述视频图像流和所述音频流;Split the original video file into the video image stream and the audio stream;
    利用神经网络模型对所述视频图像流和所述音频流进行处理得到所述场景分析热图。A neural network model is used to process the video image stream and the audio stream to obtain the scene analysis heat map.
  6. 根据权利要求1至5任意一项所述的确定图像对焦区域的方法,其特征在于,在根据所述视频图像流和所述音频流生成场景分析热图以前,还包括:The method for determining an image focus area according to any one of claims 1 to 5, further comprising: before generating a scene analysis heat map from the video image stream and the audio stream, further comprising:
    将所述音频流从时域形式转换为频域形式。Converting the audio stream from a time domain form to a frequency domain form.
  7. 根据权利要求5所述的确定图像对焦区域的方法,其特征在于,利用神经网络模型对所述视频图像流和所述音频流进行处理得到所述场景分析热图包括:The method for determining an image focus area according to claim 5, wherein the processing of the video image stream and the audio stream using a neural network model to obtain the scene analysis heat map includes:
    利用所述神经网络模型的音频网络对所述音频流进行处理得到三维音频矩阵,所述三维音频矩阵包括所述音频流的时长、所述音频流的频率分布和所述音频流的特征信息;Processing the audio stream by using the audio network of the neural network model to obtain a three-dimensional audio matrix, where the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream, and feature information of the audio stream;
    利用所述神经网络模型的图像网络对所述视频图像流进行处理得到四维图像矩阵, 所述四维图像矩阵包括视频图像流的时长、视频图像流的长度、视频图像流的宽度和视频图像流的特征信息;The image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix, where the four-dimensional image matrix includes the duration of the video image stream, the length of the video image stream, the width of the video image stream, and the Feature information
    利用所述神经网络模型的融合网络将所述三维音频矩阵和所述四维图像矩阵进行融合处理得到场景分析热图。The fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
  8. 一种装置,其特征在于,包括:An apparatus is characterized by comprising:
    获取模块,用于获取原始视频文件,所述原始视频文件包括视频图像流和音频流,所述视频图像流包括多个图像帧且由图像采集装置采集的图像数据生成,所述音频流包括多个声音帧且由声音采集装置采集的声音数据生成;An acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device, and the audio stream includes multiple Sound frames generated by the sound data collected by the sound collection device;
    生成模块,用于根据所述视频图像流和所述音频流生成场景分析热图,所述场景分析热图用于指示所述视频图像流的目标图像帧上多个图像单位中每个图像单位上存在声音的概率;A generating module, configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate each of the plurality of image units on the target image frame of the video image stream The probability of sound on
    确定模块,用于根据所述场景分析热图确定所述视频图像流的目标图像帧上满足预设条件的至少一个图像对焦区域,所述至少一个图像对焦区域中每个图像区域包括至少一个图像单位。A determining module, configured to determine at least one image focusing area on the target image frame of the video image stream that satisfies a preset condition according to the scene analysis heat map, and each image area in the at least one image focusing area includes at least one image unit.
  9. 根据权利要求8所述的装置,其特征在于:所述目标图像帧为所述视频图像流中多个图像帧的最后一帧。The apparatus according to claim 8, wherein the target image frame is the last frame of a plurality of image frames in the video image stream.
  10. 根据权利要求8或9所述的装置,其特征在于:所述预设条件为所述至少一个图像对焦区域中的所述至少一个图像单位中一个或多个图像单位上存在声音的概率均达到预设概率阈值。The device according to claim 8 or 9, wherein the preset condition is that the probability that the sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches Preset probability threshold.
  11. 根据权利要求8至10任意一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 8 to 10, wherein the device further comprises:
    对焦模块,用于控制所述图像采集装置对所述至少一个图像对焦区域进行对焦。The focusing module is used to control the image acquisition device to focus on the at least one image focusing area.
  12. 根据权利要求8至11任意一项所述的装置,其特征在于,所述生成模块包括:The device according to any one of claims 8 to 11, wherein the generating module comprises:
    拆分模块,用于将所述原始视频文件拆分成所述视频图像流和所述音频流;A splitting module, used to split the original video file into the video image stream and the audio stream;
    处理模块,用于利用神经网络模型对所述视频图像流和所述音频流进行处理得到所述场景分析热图。The processing module is configured to process the video image stream and the audio stream by using a neural network model to obtain the scene analysis heat map.
  13. 根据权利要求8至12任意一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 8 to 12, wherein the device further comprises:
    转换模块,用于将所述音频流从时域形式转换为频域形式。A conversion module is used to convert the audio stream from a time domain form to a frequency domain form.
  14. 根据权利要求12所述的装置,其特征在于,所述处理模块包括:The apparatus according to claim 12, wherein the processing module comprises:
    音频处理模块,用于利用所述神经网络模型的音频网络对所述音频流进行处理得到三维音频矩阵,所述三维音频矩阵包括所述音频流的时长、所述音频流的频率分布和所述音频流的特征信息;An audio processing module, configured to process the audio stream by using the audio network of the neural network model to obtain a three-dimensional audio matrix, the three-dimensional audio matrix including the duration of the audio stream, the frequency distribution of the audio stream and the Characteristic information of audio stream;
    图像处理模块,用于利用所述神经网络模型的图像网络对所述视频图像流进行处 理得到四维图像矩阵,所述四维图像矩阵包括视频图像流的时长、视频图像流的长度、视频图像流的宽度和视频图像流的特征信息;An image processing module for processing the video image stream by using the image network of the neural network model to obtain a four-dimensional image matrix, the four-dimensional image matrix including the length of the video image stream, the length of the video image stream, and the video image stream Characteristic information of width and video image stream;
    融合处理模块,用于利用所述神经网络模型的融合网络将所述三维音频矩阵和所述四维图像矩阵进行融合处理得到场景分析热图。The fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix by using the fusion network of the neural network model to obtain a scene analysis heat map.
  15. 一种装置,其特征在于,包括一个或多个处理器以及存储器;An apparatus, characterized in that it includes one or more processors and a memory;
    其中,所述一个或多个处理器用于读取存储在所述存储器中的软件代码并执行如所述权利要求1-7任一所述的方法。Wherein, the one or more processors are used to read the software codes stored in the memory and execute the method according to any one of claims 1-7.
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有软件代码,所述软件代码为在被一个或多个处理器读取后能够执行如所述权利要求1-7任一所述的方法的代码。A computer-readable storage medium, characterized in that the computer-readable storage medium stores software codes, which can be executed after being read by one or more processors as claimed in claims 1-7 Code for any of the described methods.
PCT/CN2018/120200 2018-12-11 2018-12-11 Method and apparatus for determining image focusing region WO2020118503A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/120200 WO2020118503A1 (en) 2018-12-11 2018-12-11 Method and apparatus for determining image focusing region
CN201880088065.2A CN111656275B (en) 2018-12-11 2018-12-11 Method and device for determining image focusing area

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/120200 WO2020118503A1 (en) 2018-12-11 2018-12-11 Method and apparatus for determining image focusing region

Publications (1)

Publication Number Publication Date
WO2020118503A1 true WO2020118503A1 (en) 2020-06-18

Family

ID=71076197

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120200 WO2020118503A1 (en) 2018-12-11 2018-12-11 Method and apparatus for determining image focusing region

Country Status (2)

Country Link
CN (1) CN111656275B (en)
WO (1) WO2020118503A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113852756A (en) * 2021-09-03 2021-12-28 维沃移动通信(杭州)有限公司 Image acquisition method, device, equipment and storage medium
US11463656B1 (en) * 2021-07-06 2022-10-04 Dell Products, Lp System and method for received video performance optimizations during a video conference session

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112857560B (en) * 2021-02-06 2022-07-22 河海大学 Acoustic imaging method based on sound frequency
CN113255685B (en) * 2021-07-13 2021-10-01 腾讯科技(深圳)有限公司 Image processing method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068308A (en) * 2007-05-10 2007-11-07 华为技术有限公司 System and method for controlling image collector to make target positioning
CN103905810A (en) * 2014-03-17 2014-07-02 北京智谷睿拓技术服务有限公司 Multimedia processing method and multimedia processing device
CN103957359A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Camera shooting device and focusing method thereof
CN104036789A (en) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 Multimedia processing method and multimedia device
CN104378635A (en) * 2014-10-28 2015-02-25 西交利物浦大学 Video region-of-interest (ROI) encoding method based on microphone array assistance
CN108073875A (en) * 2016-11-14 2018-05-25 广东技术师范学院 A kind of band noisy speech identifying system and method based on monocular cam

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009031951A (en) * 2007-07-25 2009-02-12 Sony Corp Information processor, information processing method, and computer program
JP4735991B2 (en) * 2008-03-18 2011-07-27 ソニー株式会社 Image processing apparatus and method, program, and recording medium
US20110311144A1 (en) * 2010-06-17 2011-12-22 Microsoft Corporation Rgb/depth camera for improving speech recognition
US10218954B2 (en) * 2013-08-15 2019-02-26 Cellular South, Inc. Video to data
JP6761230B2 (en) * 2015-08-21 2020-09-23 キヤノン株式会社 Image processing device, its control method, program and imaging device
CN108876672A (en) * 2018-06-06 2018-11-23 合肥思博特软件开发有限公司 A kind of long-distance education teacher automatic identification image optimization tracking and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068308A (en) * 2007-05-10 2007-11-07 华为技术有限公司 System and method for controlling image collector to make target positioning
CN104036789A (en) * 2014-01-03 2014-09-10 北京智谷睿拓技术服务有限公司 Multimedia processing method and multimedia device
CN103905810A (en) * 2014-03-17 2014-07-02 北京智谷睿拓技术服务有限公司 Multimedia processing method and multimedia processing device
CN103957359A (en) * 2014-05-15 2014-07-30 深圳市中兴移动通信有限公司 Camera shooting device and focusing method thereof
CN104378635A (en) * 2014-10-28 2015-02-25 西交利物浦大学 Video region-of-interest (ROI) encoding method based on microphone array assistance
CN108073875A (en) * 2016-11-14 2018-05-25 广东技术师范学院 A kind of band noisy speech identifying system and method based on monocular cam

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11463656B1 (en) * 2021-07-06 2022-10-04 Dell Products, Lp System and method for received video performance optimizations during a video conference session
CN113852756A (en) * 2021-09-03 2021-12-28 维沃移动通信(杭州)有限公司 Image acquisition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111656275B (en) 2021-07-20
CN111656275A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
US10951833B2 (en) Method and device for switching between cameras, and terminal
WO2020118503A1 (en) Method and apparatus for determining image focusing region
TW202105244A (en) Image processing method and device, electronic equipment and storage medium
TW202030623A (en) Cross-modal information retrieval method and device, and storage medium
JP2019092147A (en) Information exchanging method and device, audio terminal, and computer-readable storage medium
CN105430247A (en) Method and device for taking photograph by using image pickup device
US11336826B2 (en) Method and apparatus having a function of constant automatic focusing when exposure changes
US20180260941A1 (en) Preserving color in image brightness adjustment for exposure fusion
US11348254B2 (en) Visual search method, computer device, and storage medium
WO2021190625A1 (en) Image capture method and device
US11405226B1 (en) Methods and apparatus for assessing network presence
CN111447360A (en) Application program control method and device, storage medium and electronic equipment
CN110780955A (en) Method and equipment for processing emoticon message
CN112381709B (en) Image processing method, model training method, device, equipment and medium
CN111784567B (en) Method, apparatus, electronic device, and computer-readable medium for converting image
CN110347597B (en) Interface testing method and device of picture server, storage medium and mobile terminal
CN113747086A (en) Digital human video generation method and device, electronic equipment and storage medium
US10783616B2 (en) Method and apparatus for sharing and downloading light field image
WO2021149238A1 (en) Information processing device, information processing method, and information processing program
US11798561B2 (en) Method, apparatus, and non-transitory computer readable medium for processing audio of virtual meeting room
US11631252B1 (en) Visual media management for mobile devices
US11601381B2 (en) Methods and apparatus for establishing network presence
CN114298931A (en) Image processing method, image processing device, electronic equipment and storage medium
TW202243461A (en) Method and apparatus for controlling camera, and medium and electronic device
CN111815656A (en) Video processing method, video processing device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18942954

Country of ref document: EP

Kind code of ref document: A1