WO2020118503A1

WO2020118503A1 - Method and apparatus for determining image focusing region

Info

Publication number: WO2020118503A1
Application number: PCT/CN2018/120200
Authority: WO
Inventors: 陈亮; 孙凤宇; 兰传骏
Original assignee: 华为技术有限公司
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-06-18
Also published as: CN111656275A; CN111656275B

Abstract

A method and apparatus for determining an image focusing region. The method comprises: acquiring an original video file, wherein the original video file comprises a video image stream and an audio stream (S11); generating a scene analysis heat map according to the video image stream and the audio stream (S12), wherein the scene analysis heat map is used for indicating the probability of a sound existing on each image unit from among multiple image units on a target image frame of the video image stream; and determining, according to the scene analysis heat map, at least one image focusing region, meeting a preset condition, on the target image frame of the video image stream (S13), wherein each image region in the at least one image focusing region comprises at least one image unit. The present method can determine a region, where a sound is most possibly produced, on a target image frame of a video image stream; moreover, other than an image collection apparatus and a sound collection apparatus, no extra devices need to be added to a terminal, thereby reducing the cost of the devices.

Description

Method and device for determining image focus area

Technical field

Embodiments of the present application relate to the technical field of image processing, and more specifically, to a method and device for determining an image focus area.

Background technique

At present, in related image focusing technology, after the terminal opens the camera software, the terminal's camera will collect the image, and the terminal's touch screen will display the image collected by the camera; then, the user can click on the image area of interest on the touch screen with his finger; Secondly, the touch screen focuses on the image area clicked by the user's finger; after the touch screen successfully focuses on the image area clicked by the user's finger, the user can use the terminal to take videos or photos.

In the above related image focusing technology, although the terminal can achieve image focusing, if the position of the image area of interest on the touch screen changes, the user needs to re-click the image area of interest on the touch screen to make the touch screen The image area clicked by the user's finger refocuses. The above-mentioned related image focusing technology will passively wait for the focus instruction input by the user, and cannot actively identify the image area of interest to the user.

At present, there are some auxiliary focusing methods such as ranging method and multi-microphone array method in the industry. For the ranging method, the terminal needs to actively emit infrared light waves or ultrasonic waves, thereby increasing the device cost of the terminal's focusing system. For the multi-microphone array method, the terminal requires a larger number of microphone units to achieve better performance, which also increases the device cost of the terminal's focusing system.

Therefore, in various related focusing methods, there are usually problems that the device cost of the terminal is high and the image area that is of interest to the user cannot be actively focused.

Summary of the invention

Embodiments of the present application provide a method and apparatus for determining an image focus area. Without increasing the cost of a device, an image focus area that is most likely to emit sound can be determined in the image.

The embodiment of the present application is implemented as follows:

In a first aspect, an embodiment of the present application provides a method for determining an image focus area, the method includes: acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of images The image data collected by the collection device is generated, and the audio stream includes multiple sound frames and is generated by the sound data collected by the sound collection device; a scene analysis heat map is generated based on the video image stream and audio stream, and the scene analysis heat map is used to indicate the video image stream Probability of sound in each image unit among multiple image units on the target image frame; determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and at least one image focus area Each image area includes at least one image unit.

In the first aspect, the method provided by the embodiments of the present application may be applied to a terminal having an image collection device and a sound collection device. The terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream. Finally, the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream. Moreover, in addition to the image acquisition device and the sound acquisition device, no additional devices need to be added to the terminal, thereby reducing the device cost.

In a possible implementation, the target image frame is the last frame of multiple image frames in the video image stream.

In a possible implementation manner, the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold.

In a possible implementation, after determining at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, the method further includes: controlling the image acquisition device to focus on the at least one image focus area Focus.

In a possible implementation, generating the scene analysis heat map from the video image stream and the audio stream includes: splitting the original video file into the video image stream and the audio stream; using the neural network model to process the video image stream and the audio stream Get a scene analysis heat map.

In a possible implementation, before generating the scene analysis heat map from the video image stream and the audio stream, the method further includes: converting the audio stream from the time domain form to the frequency domain form.

In a possible implementation, using the neural network model to process the video image stream and the audio stream to obtain the scene analysis heat map includes: using the neural network model of the audio network to process the audio stream to obtain a three-dimensional audio matrix. The three-dimensional audio matrix includes The length of the audio stream, the frequency distribution of the audio stream, and the characteristic information of the audio stream; the image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream; the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.

Among them, in the process that the terminal uses the neural network model to process the video image stream and audio stream to obtain the scene analysis heat map, the audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion, the image of the neural network model The network is used to analyze the image information and get the image analysis conclusion. The neural network model fusion network is used to analyze and fuse the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally determine the video image stream. The focus area of the image in the target image frame that is most likely to make sound, and the focus area of the image in the target image frame that is most likely to make sound is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used To identify the focus area of the image in the target image frame that is most likely to emit sound with higher accuracy.

In a second aspect, an embodiment of the present application provides an apparatus, including: an acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, and the video image stream includes multiple image frames and is composed of an image acquisition device The generated image data is generated, and the audio stream includes multiple sound frames and the sound data collected by the sound collection device is generated; the generation module is used to generate a scene analysis heat map based on the video image stream and the audio stream, and the scene analysis heat map is used to indicate the video Probability that there is sound on each of the multiple image units on the target image frame of the image stream; a determination module is used to determine that at least one image on the target image frame of the video image stream that meets the preset conditions is in focus according to the scene analysis heat map Area, each image area in the at least one image focus area includes at least one image unit.

In a possible implementation manner, the device further includes: a focusing module, configured to control the image acquisition device to focus on at least one image focusing area.

In a possible implementation, the generation module includes: a splitting module, which is used to split the original video file into a video image stream and an audio stream; a processing module, which is used to perform a video image stream and an audio stream using a neural network model The scene analysis heat map is processed.

In a possible implementation manner, the device further includes: a conversion module, configured to convert the audio stream from the time domain form to the frequency domain form.

In a possible implementation manner, the processing module includes: an audio processing module for processing an audio stream by using an audio network of a neural network model to obtain a three-dimensional audio matrix. The three-dimensional audio matrix includes the duration of the audio stream and the frequency distribution of the audio stream And audio stream feature information; an image processing module for processing a video image stream using an image network of a neural network model to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream, the length of the video image stream, and the video image stream The width and the characteristic information of the video image stream; the fusion processing module, which is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model.

In a third aspect, an embodiment of the present application provides an apparatus, including one or more processors, for performing the method in the foregoing first aspect or any possible implementation manner of the first aspect. Optionally, the device includes a memory for storing software instructions that drive the processor to work.

According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, causes the computer or processor to perform the first aspect described above Or the method in any possible implementation manner of the first aspect.

According to a fifth aspect, embodiments of the present application provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the first aspect or any possible implementation of the first aspect The way in the way.

BRIEF DESCRIPTION

FIG. 1 is a schematic diagram of a camera software of a smart phone that has not yet determined a focus area according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing that a camera software of a smartphone has determined a focus area according to an embodiment of the present application;

3 is a flowchart of a method for determining an image focus area provided by an embodiment of the present application;

4 is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of this application;

5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application;

6 is a schematic diagram of a scene analysis heat map provided by an embodiment of the present application;

7 is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application;

8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of this application;

9 is a schematic diagram of a video image stream and an audio stream of still another original video file provided by an embodiment of the present application;

10 is a flowchart of another method for determining an image focus area provided by an embodiment of this application;

FIG. 11 is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application;

12 is a schematic diagram of an apparatus provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of another device provided by an embodiment of the present application.

detailed description

An embodiment of the present application provides a method for determining an image focus area. The method provided in the embodiment of the present application may be applied to terminals of different types. For example, the method provided in the embodiments of the present application may be applied to terminals such as smart phones, tablet computers, digital cameras, or smart cameras. Alternatively, the technical solution of the present application can also be applied to electronic devices that do not have a communication function. The following uses the application of the technical solution on the terminal as an example for introduction.

The method provided by the embodiments of the present application will be described below in conjunction with a specific technical scenario. Please refer to FIG. 1 and FIG. 2. FIG. 1 is a schematic diagram of a camera software of a smart phone provided by an embodiment of the present application, and a focus area has not been determined. FIG. 2 is a diagram provided by an embodiment of the present application. A schematic diagram of the camera software of the smartphone has determined the focus area. It should be noted that the embodiments shown in FIGS. 1 and 2 are used to enable the reader to quickly understand the technical principles of the embodiments of the present application, and are not used to limit the protection scope of the embodiments of the present application. The specific parameter values mentioned in the embodiments shown in FIGS. 1 and 2 can be changed according to the principles of the embodiments of the present application, and the protection scope of the embodiments of the present application is not limited to the specific parameter values already mentioned.

As shown in FIG. 1, assume that the user wants to use the smartphone 1 to take pictures or videos of birds in the zoo. First, the user opens the camera software of smartphone 1 and points the camera of smartphone 1 to the bird; then, smartphone 1 calls the camera to collect the image of the bird and displays the image of the bird on the touch screen of smartphone 1 on.

Assuming that the software program for determining the image focus area provided by the embodiment of the present application is pre-installed in the smartphone 1, the smartphone 1 will automatically call the camera and the microphone to collect the original video file for 2 seconds. The smartphone 1 of this embodiment includes a camera function, which can be regarded as a camera when performing the camera function, and the camera includes the camera and an image signal processor (ISP). Then, the smartphone 1 will obtain the 2 second original video file, and use the neural network model pre-stored in the smartphone 1 to generate a scene analysis heat map based on the video image stream and audio stream of the 2 second original video file. The analysis heat map is used to indicate the image area in the video image stream that is most likely to emit sound.

As shown in FIG. 2, assuming that smartphone 1 obtains the 2-second original video file and the bird keeps calling, the neural network model of smartphone 1 will be calculated based on the 2-second original video file The small bird's beak in the outgoing image is the image area most likely to make sound, so the generated scene analysis heat map will indicate that the small bird's beak in area A in Figure 2 is the most likely image area to make sound. Then, the smart phone 1 will focus on the small bird's beak in the area A in FIG. 2, so as to realize automatic focusing on the sound area in the image. After the smartphone 1 focuses on the small bird's beak in the area A in FIG. 2, the user can take a picture or record a video. Moreover, the smartphone 1 can repeat the method for determining the image focus area provided by the embodiments of the present application continuously, so as to continuously generate a scene analysis heat map to indicate the image area in the image most likely to emit sound in real time.

The technical examples provided by the embodiments of the present application are briefly described above through the application examples shown in FIG. 1 and FIG. 2, and the following describes the execution process, technical principles, and embodiments of the method for determining the image focus area provided by the embodiments of the present application.

Please refer to FIG. 3, which is a flowchart of a method for determining an image focus area provided by an embodiment of the present application. The method shown in FIG. 3 can determine the focus area of the image most likely to emit sound in the image without increasing the cost of the device. The method includes the following steps.

Step S11: Obtain the original video file. In step S11, it can be seen from the embodiment shown in FIG. 1 that after the terminal starts the camera software, the terminal will automatically call the camera and the microphone to collect the original video file for a predetermined period of time, and obtain the original video file for the predetermined period of time .

In the example of FIG. 1, the predetermined time period is 2 seconds. The predetermined time period can be preset according to factors such as the terminal's hardware configuration. When the terminal's hardware configuration is high, the predetermined time period can be shortened appropriately, for example, the predetermined time period is set to 1 second or 0.5 second, or even shorter ; When the hardware configuration of the terminal is low, you can extend the predetermined time period appropriately, for example, set the predetermined time period to 3 seconds or 4 seconds, or even longer. The embodiments of the present application do not limit the specific length of the predetermined time period, and the specific values provided above are only used to explain the principle of adjusting the predetermined time period.

The original video file includes a video image stream and an audio stream. The video image stream includes multiple image frames and is generated by image data collected by the image collection device, and the audio stream includes multiple sound frames and is generated by sound data collected by the sound collection device. The image acquisition device may be the camera of the terminal described in the previous embodiment, or may optionally include the ISP. The sound collection device may be a microphone of the terminal, or may optionally include a voice processing channel or circuit that processes signals of the microphone mobile phone.

For example, please refer to FIG. 4, which is a schematic diagram of a video image stream and an audio stream of an original video file provided by an embodiment of the present application. In FIG. 4, the horizontal axis represents the time axis, and the time point T1 is earlier than the time point T2, assuming that the time length of the original video file is T1 to T2. The original video file includes a video image stream and an audio stream, where the video image stream has 10 image frames (image frame P1 to image frame P10), the audio stream has 30 sound frames (not shown in the figure), the video image stream and The audio stream is used to generate scene analysis heat maps.

Step S12: Generate a scene analysis heat map according to the video image stream and the audio stream. Among them, the scene analysis heat map is used to indicate the probability of the presence of sound on each of the multiple image units on the target image frame of the video image stream, and the image unit may be a pixel. The probability that there is sound on one image unit corresponds to the probability that the corresponding object on the image unit emits sound. The object can be a person, animal, musical instrument, equipment, or other object.

When the terminal generates a scene analysis heat map from the video image stream and the audio stream, the terminal combines the video image stream and the audio stream to calculate the probability of sound on each image unit on the target image frame of the video image stream, and according to the target The probability of the presence of sound on each image unit on the image frame generates a scene analysis heat map. The scene analysis heat map is a frame of image, and the resolution of the scene analysis heat map is the same as the resolution of the target image frame of the video image stream. The target image frame of the video image stream may be the last frame of multiple image frames in the video image stream. For example, referring to FIG. 4, the video image stream in FIG. 4 has 10 image frames, and the target image frame is the last image frame P10 in the video image stream.

Please refer to FIG. 5 and FIG. 6, FIG. 5 is a schematic diagram of a target image frame of a video image stream provided by an embodiment of the present application, and FIG. 6 is a scene analysis provided by an embodiment of the present application. Schematic diagram of the heat map. For ease of example, in FIGS. 5 and 6, it is assumed that the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 5 pixels×3 pixels. In the actual scene, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are the resolutions preset by the user. For example, assuming that the preset resolution of the user is 1920 pixels×1080 pixels, then the video image The resolution of the stream's target image frame and the resolution of the scene analysis heat map are both 1920 pixels×1080 pixels.

5 and 6, assuming that the resolution of the target image frame of the video image stream is 5 pixels × 3 pixels, the terminal can combine the video image stream and the audio stream to calculate 15 pixels on the target image frame of the video image stream The probability of sound at the point.

In FIG. 5, each circle represents 1 pixel, that is, there are 15 pixels in FIG. 5, and each pixel in FIG. 5 has the original color. For example, assume that the original colors of the 15 pixels are all red. Assuming that the terminal calculates after combining the video image stream and the audio stream, it is known that the probability of sound on the pixel points (11, 12, 21, 22, 31, 32) is greater than or equal to 50%, and the pixel points (13, 14, 15 , 23, 24, 25, 33, 34, 35) The probability of sound is less than 50%. Then, the terminal generates a scene analysis heat map based on the probability of sound on 15 pixels on the target image frame.

In FIG. 6, in order to distinguish the difference in the probability of sound on the pixels, based on the target image frame, the terminal emits pixels with a probability of sound greater than or equal to 50% (11, 12, 21, 22, 31 , 32) becomes black, and the color of pixels (13, 14, 15, 23, 24, 25, 33, 34, 35) with a probability of less than 50% of the sound emitted by the object becomes white, which will be as shown in the figure The target image frame shown in 5 is converted into the scene analysis heat map shown in FIG. 6.

Of course, the scene analysis heat map shown in FIG. 6 is only an example. The color corresponding to each pixel in the scene analysis heat map can be set according to the actual situation. The color distinguishing each pixel is not limited to the use of white and black.

Before step S12, that is, before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream can be converted from the time domain form to the frequency domain form, and then, based on the video image stream and the audio stream converted into the frequency domain form Generate a scene analysis heat map. For example, before generating a scene analysis heat map from the video image stream and audio stream, the audio stream is first Fourier transformed to obtain a short-time Fourier spectrum, and then the scene is generated from the video image stream and the short-time Fourier spectrum Analyze the heat map. Of course, the scene analysis heat map can also be generated by directly using the audio stream in the time domain without performing the Fourier transform.

Step S13: Determine at least one image focus area on the target image frame of the video image stream that meets the preset condition according to the scene analysis heat map. In step S13, each image area in the at least one image focus area includes at least one image unit, and the preset condition is the probability that there is sound on one or more image units in the at least one image focus area in the at least one image focus area Both reached the preset probability threshold. The preset probability threshold is a preset probability threshold. Optionally, for an area, when the probability that the sound exists on each image unit in at least one image unit in the area reaches a preset probability threshold, it may be determined that the area is an image focus area. Optionally, for a region, when the probability that the sound exists on at least one image unit in the region exceeds the predetermined number of image units all reaches a preset probability threshold or the image unit whose probability of sound presence all reaches the preset probability threshold If the ratio exceeds the multi-scale threshold, it can be determined that the area is the image focus area. In addition, there may be other ways of judging, as long as the probability of sound exists in the at least one image unit in an area can determine the probability of sound in the area, so as to determine whether the area is the image focus area.

After step S13, the terminal may control the image acquisition device to focus on at least one image focusing area. For example, please refer to FIG. 7, which is a schematic diagram of determining an image focus area on a target image frame provided by an embodiment of the present application. Assuming that the preset probability threshold is 50%, the terminal determines the pixel points adjacent to the target image frame of the video image stream and reaching the preset probability threshold 50% according to the scene analysis heat map (11, 12, 21, 22, 31, 32 ), the pixels (11, 12, 21, 22, 31, 32) constitute the B area in FIG. 7, and the B area is an image focus area. Then, the terminal can control the image acquisition device to focus on the image focusing area (that is, area B) in FIG. 7.

In the embodiment shown in FIG. 3, the method provided by the embodiment of the present application may be applied to a terminal having an image collection device and a sound collection device. The terminal can actively obtain the original video file through the image acquisition device and the sound acquisition device, and generate a scene analysis heat map according to the video image stream and audio stream. Finally, the scene analysis heat map determines that the target image frame of the video image stream meets the preset At least one image focus area of the condition, so the embodiment of the present application can determine the area most likely to emit sound on the target image frame of the video image stream. Moreover, in addition to the image acquisition device and the sound acquisition device, no additional devices need to be added to the terminal. The image acquisition device and the sound acquisition device can be the original devices of the terminal itself, such as the camera or microphone of the smartphone, thereby reducing the device cost .

In the embodiment shown in FIG. 3, a method for generating a scene analysis heat map using a piece of original video file and using the scene analysis heat map to determine the image focus area of a target image frame is introduced. Based on the above principles, the following describes how to use multiple original video files to generate multiple scene analysis heatmaps, and use multiple scene analysis heatmaps to determine the image focus area of multiple target image frames.

For the first way, please refer to FIG. 8, which is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In FIG. 8, the horizontal axis represents the time axis, and T1 to T4 respectively represent four time points, where T1 is earlier than T2, T2 is earlier than T3, and T3 is earlier than T4. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal automatically collects the original video file A between T1 and T3, where the original video file A includes the video image stream A and the audio stream A. When the time reaches T3, the terminal will acquire the original video file A, and generate a scene analysis heat map A based on the video image stream A (image frame p1 to image frame p10) and audio stream A of the original video file A, and then analyze the scene The heat map A determines at least one image focus area on the image frame p10 of the video image stream A that meets the preset condition, that is, the terminal determines the image focus area on the image frame p10 at time T3.

In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T4, the terminal will obtain the original video file B automatically collected between T2 and T4, and according to the video image stream B of the original video file B ( Image frame P2 to image frame P11) and audio stream B to generate a scene analysis heat map B, and then determine at least one image focus area on the image frame P11 of the video image stream B that meets the preset conditions according to the scene analysis heat map B, that is, the terminal is At time T4, the image focus area on the image frame P11 is determined.

In the first way, the terminal may determine a corresponding scene analysis heat map for each image frame, and use the scene analysis heat map corresponding to each image frame to determine at least one image satisfying the preset condition for each image frame Focus area, so that the image focus area can be determined more accurately for each image frame.

For the second way, please refer to FIG. 9, which is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In FIG. 9, the horizontal axis represents the time axis, and T1 to T3 respectively represent three time points, where T1 is earlier than T2 and T2 is earlier than T3. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal will automatically collect the original video file A between T1 and T2, where the original video file A includes the video image stream A (image frame P1 to image frame P10) and Audio stream A. When the time reaches T2, the terminal will obtain the original video file A, and generate a scene analysis heat map A according to the video image stream A and the audio stream A of the original video file A, and then determine the video image stream A according to the scene analysis heat map A At least one image focus area on the image frame P10 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P10 at time T2. Then, from T2 to T3, the image frame P11 to the image frame P18 all determine at least one image focus area that satisfies the preset condition according to the scene analysis heat map A.

In order to continuously determine the image focus area of the image frame currently displayed by the terminal, when the time reaches T3, the terminal will obtain the original video file B automatically collected between T2 and T4, where the original video file B includes the video image stream B( Image frame P10 to Image frame P19) and audio stream B. When the time reaches T3, the terminal will obtain the original video file B, and generate a scene analysis heat map B based on the video image stream B and audio stream B of the original video file B, and then determine the video image stream B according to the scene analysis heat map B At least one image focus area on the image frame P19 that satisfies the preset condition, that is, the terminal determines the image focus area on the image frame P19 at time T3.

In the second way, the terminal generates a scene analysis heat map at intervals so that multiple image frames can reuse one scene analysis heat map, thereby reducing the number of scene analysis heat map calculations per unit time So that the processing resources of the terminal can be occupied less.

Please refer to FIG. 10, which is a flowchart of another method for determining an image focus area provided by an embodiment of the present application. The method shown in FIG. 10 is a specific implementation of step S12 in FIG. 3, that is, “ The specific implementation of "Generating a scene analysis heat map based on video image stream and audio stream" includes the following steps.

Step S21: Split the original video file into a video image stream and an audio stream. Among them, you can use multimedia video processing tools, such as software tools to split the original video file into a video image stream and audio stream.

Step S22: Acquire a pre-stored neural network model. Among them, the neural network model can be generated by a server with strong computing power. The neural network model generated by the server includes an audio network, an image network, and a fusion network. The terminal can download the neural network model generated on the server to the local memory in advance. When the terminal needs to use the neural network model, the terminal can obtain the pre-stored neural network model in the local memory. In the process of generating the neural network model, the server will input a large number of video samples into the neural network model, so that the neural network model learns from the large number of video samples. After the server generates the neural network model, the neural network model can identify the image focus area in the target image frame of the video image stream that is most likely to emit sound.

Step S23: Use the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix. Among them, the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream. For example, suppose that the terminal uses the audio network of the neural network model to process the audio stream to obtain a three-dimensional audio matrix, where the duration of the audio stream is 2 seconds, the frequency distribution of the audio stream is 0-8Khz, and the characteristic information of the audio stream is, for example, at least one Stereo channels.

Step S24: Use the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix. Among them, the four-dimensional image matrix includes the length of the video image stream, the length of the video image stream (image length), the width of the video image stream (image width) and the feature information of the video image stream. For example, suppose that the terminal uses the image network of the neural network model to process the video image stream to obtain a four-dimensional image matrix, where the duration of the video image stream is 2 seconds, the length of the video image stream is 1920 pixels, and the width of the video image stream is With 1080 pixels, the feature information of the video image stream is, for example, RBG three channels or other color gamut channels.

In step S25, the fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map. In the embodiment shown in FIG. 10, in the process that the terminal uses the neural network model to process the video image stream and the audio stream to obtain a scene analysis heat map, it is more like a human process of processing sound information and image information. The audio network of the neural network model is similar to the human ear. The audio network of the neural network model is used to analyze the sound information and obtain the sound analysis conclusion. The image network of the neural network model is similar to the human eye. The image network of the neural network model is used to analyze image information and obtain image analysis conclusions. The fusion network of the neural network model is similar to the human brain. The fusion network of the neural network model is used to analyze and fuse the sound analysis conclusions obtained by the audio network and the image analysis conclusions obtained by the image network, and finally determine the target of the video image stream The focus area of the image that is most likely to make sound in the image frame, and the focus area of the image that is most likely to make sound in the target image frame is displayed using the scene analysis heat map, so the scene analysis heat map generated by the neural network model is used to identify The focus area of the image in the target image frame that is most likely to emit sound has higher accuracy.

Please refer to FIG. 11, which is a schematic diagram of how to generate a scene analysis heat map provided by an embodiment of the present application. In FIG. 11, T1 and T2 represent two time points, T1 is earlier than T2. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal automatically collects the original video file 10 between T1 and T2, where the original video file 10 includes a video image stream 101 and an audio stream 102.

When the time reaches T2, the terminal will acquire the original video file 10 and split the original video file 10 into a video image stream 101 (image frame p1 to image frame p10) and audio stream 102, and then input the video image stream 101 to In the image network 201 of the neural network model 20, the audio stream 102 is input into the audio network 202 of the neural network model 20.

After the image network 201 and the audio network 202 of the neural network model 20 in the terminal respectively receive the video image stream 101 and the audio stream 102, the image network 201 of the neural network model 20 processes the video image stream 101 to obtain a four-dimensional image matrix. The audio network 202 of the neural network model 20 processes the audio stream 102 to obtain a three-dimensional audio matrix, and the image network 201 and the audio network 202 respectively send the obtained four-dimensional image matrix and three-dimensional audio matrix to the fusion network 203 of the neural network model 20 . The fusion network 203 of the neural network model 20 will fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map 30.

To facilitate understanding of the embodiment, please refer to FIG. 12, which is a schematic diagram of a device provided by an embodiment of the present application. The device may be located in the terminal described above, and is used in combination with an image collection device such as a camera and a sound collection device such as a microphone. The device includes the following modules.

The obtaining module 11 is used to obtain an original video file. The original video file includes a video image stream and an audio stream. The video image stream includes multiple image frames and is generated by image data collected by an image acquisition device. The audio stream includes multiple sound frames and is composed of The sound data collected by the sound collecting device is generated. For a specific and detailed implementation, please refer to the detailed description of step S11 in the method embodiment shown in FIG. 3 above.

The generating module 12 is configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability of the presence of sound on each image unit in the multiple image units on the target image frame of the video image stream. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.

The determining module 13 is configured to determine at least one image focus area on the target image frame of the video image stream that satisfies the preset condition according to the scene analysis heat map, and each image area in the at least one image focus area includes at least one image unit. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.

In a practical embodiment, the target image frame is the last frame of multiple image frames in the video image stream. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.

In a realizable embodiment, the preset condition is that the probability that sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches a preset probability threshold. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.

In an implementable embodiment, the device shown in FIG. 12 may further include: a focusing module 14 for controlling the image acquisition device to focus on at least one image focusing area. For a specific and detailed implementation manner, please refer to the detailed description of step S13 in the method embodiment shown in FIG. 3 above.

In a practical embodiment, the generating module 12 may further include: a splitting module, which is used to split the original video file into a video image stream and an audio stream; and a processing module, which is used to use a neural network model to convert the video image stream And audio stream processing to get a scene analysis heat map. For a specific and detailed implementation, please refer to the detailed description of steps S21 to S25 in the method embodiment shown in FIG. 10 above.

In a realizable embodiment, the apparatus shown in FIG. 12 may further include: a conversion module 15 for converting the audio stream from the time domain form to the frequency domain form. For a specific and detailed implementation, please refer to the detailed description of step S12 in the method embodiment shown in FIG. 3 above.

In an implementable embodiment, the processing module 13 may further include: an audio processing module, configured to process an audio stream using an audio network of a neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix includes the duration and audio frequency of the audio stream The frequency distribution of the stream and the characteristic information of the audio stream; the image processing module is used to process the video image stream using the image network of the neural network model to obtain a four-dimensional image matrix. The four-dimensional image matrix includes the length of the video image stream and the length of the video image stream , The width of the video image stream and the feature information of the video image stream; the fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map by using the fusion network of the neural network model. For a specific and detailed implementation, please refer to the detailed description of steps S23 to S25 in the method embodiment shown in FIG. 10 above.

For example, any one or more of the modules in FIG. 12 above may be implemented by software, hardware, or a combination of software and hardware. The software includes software program instructions and is executed by one or more processors. The hardware may include digital logic circuits, algorithm circuits, programmable logic gate arrays, processors, dedicated circuits, or algorithm circuits. The above circuit may be located in one or more chips.

In order to more clearly introduce a specific implementation manner of the present application, please refer to FIG. 13, which is a schematic diagram of yet another device provided by an embodiment of the present application. The device 2 includes an image acquisition device 21, an audio acquisition device 22, a central processor 23, an image processor 24, a memory (random access memory, RAM) 25, a non-volatile memory (NVM) memory 26, and a bus 27. The bus 27 is used to communicate with other components. The device 2 shown in FIG. 13 is equivalent to the circuit board, chip or chipset in the smart phone 1 in FIGS. 1 and 2, and can selectively run various types of software, such as application software, driver software, or operating system software. The central processor 23 is used to control one or more other components, and the image processor 24 is used to perform the method of this embodiment. The image processor 24 may include one or more processors to perform the method of the previous embodiment Process. For example only and not limitation, S11 in FIG. 3 may be executed by an ISP or a dedicated processor. At least part of the flow of S12 in FIG. 3 and FIG. 10 may be executed by at least one of a neural processing unit (NPU), a data signal processor, or a central processor 23. S13 in FIG. 3 may be executed by the ISP or the central processor 23. The NPU is a device built with the neural network model, and is dedicated to neural network operations. Alternatively, the central processor 23 may also run artificial intelligence software to perform corresponding operations using neural network models. Each processor mentioned above can execute the necessary software to work. Optionally, some processors, such as ISP, may also be pure hardware devices. For the device 2 in FIG. 13, refer to the detailed description of the smartphone 1 in the embodiments corresponding to FIGS. 1 and 2, and refer to the detailed description of the terminal in the embodiments corresponding to FIGS. 3 to 11.

It should be noted that when the above embodiments involve functions implemented by software, related software or modules in the software may be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium . Computer-readable media includes computer storage media and communication media, where communication media includes any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer. Take this as an example but not limited to: computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage media or other magnetic storage devices, or can be used to carry or store in the form of instructions or data structures The desired program code and any other medium that can be accessed by the computer. Also. Any connection can become a computer-readable medium as appropriate. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable , Fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave are included in the fixing of the media. As used in this application, disks and discs include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, where the discs usually replicate data magnetically, while discs Use laser to copy data optically. The above combination should also be included in the protection scope of the computer-readable medium.

In addition, the above embodiments are only used to illustrate the technical solutions of the present application rather than limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for determining the focus area of an image, characterized in that the method includes:

Obtain an original video file, the original video file includes a video image stream and an audio stream, the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device, and the audio stream includes multiple sound frames and is generated by The sound data collected by the sound collecting device is generated;

A scene analysis heat map is generated according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate the probability that there is sound on each image unit in a plurality of image units on a target image frame of the video image stream ;

According to the scene analysis heat map, determine at least one image focus area on the target image frame of the video image stream that meets a preset condition, and each image area in the at least one image focus area includes at least one image unit.
The method for determining an image focus area according to claim 1, wherein the target image frame is the last frame of a plurality of image frames in the video image stream.
The method for determining an image focus area according to claim 1 or 2, wherein the preset condition is that there is one or more image units in the at least one image unit in the at least one image focus area The probabilities of sound all reach the preset probability threshold.
The method for determining an image focus area according to any one of claims 1 to 3, wherein at least one image satisfying a preset condition is determined on the target image frame of the video image stream according to the scene analysis heat map After the focus area, the method further includes:

Controlling the image acquisition device to focus on the at least one image focusing area.
The method for determining an image focus area according to any one of claims 1 to 4, wherein generating a scene analysis heat map according to the video image stream and the audio stream includes:

Split the original video file into the video image stream and the audio stream;

A neural network model is used to process the video image stream and the audio stream to obtain the scene analysis heat map.
The method for determining an image focus area according to any one of claims 1 to 5, further comprising: before generating a scene analysis heat map from the video image stream and the audio stream, further comprising:

Converting the audio stream from a time domain form to a frequency domain form.
The method for determining an image focus area according to claim 5, wherein the processing of the video image stream and the audio stream using a neural network model to obtain the scene analysis heat map includes:

Processing the audio stream by using the audio network of the neural network model to obtain a three-dimensional audio matrix, where the three-dimensional audio matrix includes the duration of the audio stream, the frequency distribution of the audio stream, and feature information of the audio stream;

The image network of the neural network model is used to process the video image stream to obtain a four-dimensional image matrix, where the four-dimensional image matrix includes the duration of the video image stream, the length of the video image stream, the width of the video image stream, and the Feature information

The fusion network of the neural network model is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix to obtain a scene analysis heat map.
An apparatus is characterized by comprising:

An acquisition module for acquiring an original video file, the original video file includes a video image stream and an audio stream, the video image stream includes multiple image frames and is generated by image data collected by an image acquisition device, and the audio stream includes multiple Sound frames generated by the sound data collected by the sound collection device;

A generating module, configured to generate a scene analysis heat map according to the video image stream and the audio stream, and the scene analysis heat map is used to indicate each of the plurality of image units on the target image frame of the video image stream The probability of sound on

A determining module, configured to determine at least one image focusing area on the target image frame of the video image stream that satisfies a preset condition according to the scene analysis heat map, and each image area in the at least one image focusing area includes at least one image unit.
The apparatus according to claim 8, wherein the target image frame is the last frame of a plurality of image frames in the video image stream.
The device according to claim 8 or 9, wherein the preset condition is that the probability that the sound exists on one or more image units in the at least one image unit in the at least one image focus area reaches Preset probability threshold.
The device according to any one of claims 8 to 10, wherein the device further comprises:

The focusing module is used to control the image acquisition device to focus on the at least one image focusing area.
The device according to any one of claims 8 to 11, wherein the generating module comprises:

A splitting module, used to split the original video file into the video image stream and the audio stream;

The processing module is configured to process the video image stream and the audio stream by using a neural network model to obtain the scene analysis heat map.
The device according to any one of claims 8 to 12, wherein the device further comprises:

A conversion module is used to convert the audio stream from a time domain form to a frequency domain form.
The apparatus according to claim 12, wherein the processing module comprises:

An audio processing module, configured to process the audio stream by using the audio network of the neural network model to obtain a three-dimensional audio matrix, the three-dimensional audio matrix including the duration of the audio stream, the frequency distribution of the audio stream and the Characteristic information of audio stream;

An image processing module for processing the video image stream by using the image network of the neural network model to obtain a four-dimensional image matrix, the four-dimensional image matrix including the length of the video image stream, the length of the video image stream, and the video image stream Characteristic information of width and video image stream;

The fusion processing module is used to fuse the three-dimensional audio matrix and the four-dimensional image matrix by using the fusion network of the neural network model to obtain a scene analysis heat map.
An apparatus, characterized in that it includes one or more processors and a memory;

Wherein, the one or more processors are used to read the software codes stored in the memory and execute the method according to any one of claims 1-7.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores software codes, which can be executed after being read by one or more processors as claimed in claims 1-7 Code for any of the described methods.