CN111656275A

CN111656275A - Method and device for determining image focusing area

Info

Publication number: CN111656275A
Application number: CN201880088065.2A
Authority: CN
Inventors: 陈亮; 孙凤宇; 兰传骏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-09-11
Anticipated expiration: 2038-12-11
Also published as: WO2020118503A1; CN111656275B

Abstract

A method and device for determining an image focusing area, the method comprising: acquiring an original video file, the original video file including a video image stream and an audio stream (S11); generating a scene analysis heat map from the video image stream and the audio stream (S12), the scene analysis heat map indicating a probability of sound being present on each of a plurality of image units on a target image frame of the video image stream; at least one image focus area (13) on a target image frame of the video image stream satisfying a preset condition is determined from the scene analysis heatmap, each image area of the at least one image focus area comprising at least one image unit. The method can determine the most possible sound-emitting area on the target image frame of the video image stream, and no additional device is needed on the terminal except for the image acquisition device and the sound acquisition device, thereby reducing the device cost.

Description

Method and device for determining image focusing area

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method and a device for determining an image focusing area.

Background

At present, in a related image focusing technology, after a terminal opens camera software, a camera of the terminal acquires an image, and a touch screen of the terminal displays the image acquired by the camera; then, the user can point on the interested image area on the touch screen with a finger; secondly, focusing an image area clicked by a finger of a user by the touch screen; after the touch screen successfully focuses on the image area clicked by the finger of the user, the user can shoot videos or photos by using the terminal.

In the related image focusing technology, although the terminal may perform image focusing, if the position of the image area of interest of the user in the touch screen changes, the user needs to re-click the image area of interest on the touch screen to re-focus the image area clicked by the finger of the user. The related image focusing techniques may passively wait for a focusing instruction input by a user, and may not actively identify an image area of interest to the user.

At present, some auxiliary focusing methods such as a distance measurement method and a multi-microphone array method exist in the industry. For the distance measurement method, the terminal needs to actively emit infrared light waves or ultrasonic waves, thereby increasing the device cost of the focusing system of the terminal. For the multi-microphone array method, the terminal needs more microphone units to obtain better performance, and the device cost of the focusing system of the terminal is increased.

Therefore, in the related various focusing methods, there are problems that the device cost of the terminal is high and the image area in which the user is interested cannot be actively focused.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining an image focusing area, which can determine the image focusing area which is most likely to make a sound in an image under the condition of not increasing the device cost.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a method for determining an image focusing area, where the method includes: acquiring an original video file, wherein the original video file comprises a video image stream and an audio stream, the video image stream comprises a plurality of image frames and is generated by image data acquired by an image acquisition device, and the audio stream comprises a plurality of sound frames and is generated by sound data acquired by a sound acquisition device; generating a scene analysis heat map from the video image stream and the audio stream, the scene analysis heat map indicating a probability of sound being present on each of a plurality of image units on a target image frame of the video image stream; at least one image focusing area satisfying a preset condition on a target image frame of the video image stream is determined according to the scene analysis heat map, and each image area in the at least one image focusing area comprises at least one image unit.

In a first aspect, the method provided by the embodiment of the application can be applied to a terminal having an image acquisition device and a sound acquisition device. The terminal can actively acquire an original video file through the image acquisition device and the sound acquisition device, generate a scene analysis heat map according to the video image stream and the audio stream, and finally determine at least one image focusing area meeting preset conditions on a target image frame of the video image stream according to the scene analysis heat map, so that the area which is most likely to make sound on the target image frame of the video image stream can be determined. Besides, the terminal is provided with the image acquisition device and the sound acquisition device, and no additional device is required, so that the device cost is reduced.

In one possible implementation, the target image frame is a last frame of a plurality of image frames in the video image stream.

In one possible implementation, the preset condition is that the probabilities of sound existing on one or more image units of the at least one image unit in the at least one image focus area all reach a preset probability threshold.

In one possible implementation, after determining, from the scene analysis heatmap, at least one image focus area on a target image frame of the video image stream that satisfies a preset condition, the method further comprises: and controlling the image acquisition device to focus on at least one image focusing area.

In one possible implementation, generating the scene analysis heatmap from the video image stream and the audio stream includes: splitting an original video file into a video image stream and an audio stream; and processing the video image stream and the audio stream by using the neural network model to obtain a scene analysis heat map.

In one possible implementation, before generating the scene analysis heatmap from the video image stream and the audio stream, the method further includes: the audio stream is converted from a time domain form to a frequency domain form.

In one possible implementation, the processing the video image stream and the audio stream using the neural network model to obtain the scene analysis heatmap includes: processing the audio stream by using an audio network of the neural network model to obtain a three-dimensional audio matrix, wherein the three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream; processing the video image stream by using an image network of a neural network model to obtain a four-dimensional image matrix, wherein the four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream, the width of the video image stream and the characteristic information of the video image stream; and fusing the three-dimensional audio matrix and the four-dimensional image matrix by using a fusion network of the neural network model to obtain a scene analysis heat map.

Wherein, in the process that the terminal utilizes the neural network model to process the video image stream and the audio stream to obtain the scene analysis heat map, the audio network of the neural network model is used for analyzing the sound information and obtaining a sound analysis conclusion, the image network of the neural network model is used for analyzing the image information and obtaining an image analysis conclusion, the fusion network of the neural network model is used for analyzing and fusing the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally, the image focusing area which is most likely to send out sound in the target image frame of the video image stream is determined, and the most likely image focus area of the target image frame to be sounded is revealed using the scene analysis heat map, it is more accurate to use the scene analysis heatmap generated by the neural network model to identify the image focus areas in the target image frame that are most likely to emit sound.

In a second aspect, an embodiment of the present application provides an apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original video file, the original video file comprises a video image stream and an audio stream, the video image stream comprises a plurality of image frames and is generated by image data acquired by an image acquisition device, and the audio stream comprises a plurality of sound frames and is generated by sound data acquired by a sound acquisition device; a generating module for generating a scene analysis heat map from the video image stream and the audio stream, the scene analysis heat map indicating a probability of sound being present on each of a plurality of image units on a target image frame of the video image stream; the determining module is used for determining at least one image focusing area meeting preset conditions on a target image frame of the video image stream according to the scene analysis heat map, and each image area in the at least one image focusing area comprises at least one image unit.

In one possible implementation, the apparatus further includes: and the focusing module is used for controlling the image acquisition device to focus at least one image focusing area.

In one possible implementation, the generating module includes: the splitting module is used for splitting an original video file into a video image stream and an audio stream; and the processing module is used for processing the video image stream and the audio stream by utilizing the neural network model to obtain a scene analysis heat map.

In one possible implementation, the apparatus further includes: a conversion module for converting the audio stream from a time domain form to a frequency domain form.

In one possible implementation, the processing module includes: the audio processing module is used for processing the audio stream by utilizing an audio network of the neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream; the image processing module is used for processing the video image stream by utilizing an image network of the neural network model to obtain a four-dimensional image matrix, and the four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream, the width of the video image stream and the characteristic information of the video image stream; and the fusion processing module is used for carrying out fusion processing on the three-dimensional audio matrix and the four-dimensional image matrix by utilizing a fusion network of the neural network model to obtain a scene analysis heat map.

In a third aspect, an embodiment of the present application provides an apparatus, which includes one or more processors and is configured to execute the method in the first aspect or any one of the possible implementation manners of the first aspect. Optionally, the apparatus comprises a memory for storing software instructions for driving the processor to operate.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having instructions stored therein, which when executed on a computer or a processor, cause the computer or the processor to perform the method of the first aspect or any of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer or a processor, cause the computer or the processor to perform the method of the first aspect or any of the possible implementations of the first aspect.

Drawings

Fig. 1 is a schematic diagram illustrating that a focusing area of camera software of a smart phone is not determined yet according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating that camera software of a smart phone according to an embodiment of the present application has determined a focusing area;

FIG. 3 is a flowchart illustrating a method for determining an image focusing area according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating a video image stream and an audio stream of an original video file according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a target image frame of a video image stream according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a scene analysis heatmap provided in an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a method for determining an image focusing area on a target image frame according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating a video image stream and an audio stream of another original video file provided by an embodiment of the present application;

fig. 9 is a schematic diagram illustrating a video image stream and an audio stream of another original video file provided by an embodiment of the present application;

FIG. 10 is a flowchart illustrating another method for determining an in-focus area of an image according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating how a scene analysis heatmap may be generated according to an embodiment of the present application;

FIG. 12 is a schematic view of an apparatus according to an embodiment of the present disclosure;

fig. 13 is a schematic view of another apparatus provided in the embodiment of the present application.

Detailed Description

The embodiment of the application provides a method for determining an image focusing area, and the method provided by the embodiment of the application can be applied to different types of terminals. For example, the method provided by the embodiment of the application can be applied to terminals such as smart phones, tablet computers, digital cameras or smart cameras. Alternatively, the technical solution of the present application may also be applied to an electronic device without a communication function. The technical solution applied to the terminal is described as an example in the following.

The method provided by the embodiment of the present application is described below with reference to a specific technical scenario. Please refer to fig. 1 and fig. 2, wherein fig. 1 is a schematic diagram of a smart phone provided in an embodiment of the present application, in which a focusing area is not determined yet, and fig. 2 is a schematic diagram of a smart phone provided in an embodiment of the present application, in which a focusing area is determined already. It should be noted that the embodiments shown in fig. 1 and fig. 2 are used for the reader to quickly understand the technical principle of the embodiments of the present application, and are not used to limit the protection scope of the embodiments of the present application. The values of the specific parameters mentioned in the embodiments shown in fig. 1 and 2 may vary according to the principles of the embodiments of the present application, and the scope of the embodiments of the present application is not limited to the values of the specific parameters already mentioned.

Referring to fig. 1, it is assumed that a user wants to take a picture or record a video of a bird in a zoo using a smartphone 1. Firstly, a user opens the camera software of the smart phone 1 and aims a camera of the smart phone 1 at a bird; then, the smartphone 1 may invoke a camera to capture an image of the bird, and display the image of the bird on the touch screen of the smartphone 1.

Assuming that a software program of the method for determining the image focusing area provided by the embodiment of the present application is pre-installed in the smart phone 1, the smart phone 1 may automatically call the camera and the microphone to acquire an original video file for 2 seconds. The smartphone 1 of the present embodiment includes a camera function, which may be regarded as a camera when executing the camera function, including the camera and an Image Signal Processor (ISP). Then, the smartphone 1 acquires the 2-second original video file, and generates a scene analysis heat map according to the video image stream and the audio stream of the 2-second original video file by using a neural network model stored in advance in the smartphone 1, where the scene analysis heat map is used to indicate an image area most likely to make a sound in the video image stream.

Referring to fig. 2, assuming that the smart phone 1 obtains the 2 seconds of original video file and the bird makes a sound continuously, the neural network model of the smart phone 1 calculates, according to the 2 seconds of original video file, that the bird's mouth in the image is the most likely-to-make image area, so that the generated scene analysis heat map indicates that the bird's mouth in the area a in fig. 2 is the most likely-to-make image area. Then, the smartphone 1 focuses on the bird's mouth in the area a in fig. 2, so as to realize automatic focusing on the voiced area in the image. After the smart phone 1 focuses on the mouth of the bird in the area a in fig. 2, the user can take a picture or record a video. Moreover, the smartphone 1 may continuously repeat the method for determining the image focusing area according to the embodiment, so as to continuously generate the scene analysis heat map to indicate the image area most likely to make a sound in the image in real time.

The technical scenario provided by the embodiment of the present application is briefly described above by the application examples shown in fig. 1 and fig. 2, and the implementation process, technical principle and embodiment of the method for determining an image focusing area provided by the embodiment of the present application are described below.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method for determining an image focusing area according to an embodiment of the present disclosure. The method shown in fig. 3 can determine the image focusing area in which the sound is most likely to be emitted in the image without increasing the cost of the device, and includes the following steps.

And step S11, acquiring the original video file. In step S11, as can be seen from the embodiment shown in fig. 1, after the terminal turns on the camera software, the terminal automatically calls the camera and the microphone to capture an original video file of a predetermined time period, and acquires the original video file of the predetermined time period.

In the example of fig. 1, the predetermined period of time is 2 seconds. The predetermined time period may be preset according to factors such as the hardware configuration of the terminal, and when the hardware configuration of the terminal is high, the predetermined time period may be appropriately shortened, for example, the predetermined time period is set to 1 second or 0.5 second, or even shorter; when the hardware configuration of the terminal is low, the predetermined period may be extended appropriately, for example, the predetermined period is set to 3 seconds or 4 seconds, or even longer. The specific length of the predetermined time period is not limited in the embodiments of the present application, and the specific values provided above are only used to illustrate the principle of adjusting the predetermined time period.

The original video file comprises a video image stream and an audio stream, the video image stream comprises a plurality of image frames and is generated by image data collected by an image collecting device, and the audio stream comprises a plurality of sound frames and is generated by sound data collected by a sound collecting device. The image acquisition device may be a camera of the terminal in the previous embodiment, and may also optionally include an ISP. The sound collection device can be a microphone of the terminal, and can also selectively comprise a voice processing channel or a circuit for processing signals of a microphone mobile phone.

For example, please refer to fig. 4, where fig. 4 is a schematic diagram illustrating a video image stream and an audio stream of an original video file according to an embodiment of the present application. In fig. 4, the horizontal axis represents a time axis, and a time point T1 is earlier than a time point T2, assuming that the time length of the original video file is T1 to T2. The original video file includes a video image stream having 10 image frames (image frame P1 through image frame P10) and an audio stream having 30 sound frames (not shown in the figure), which are used to generate a scene analysis heat map.

And step S12, generating a scene analysis heat map according to the video image stream and the audio stream. The scene analysis heatmap is used to indicate a probability of sound being present in each of a plurality of image units, which may be pixels, in a target image frame of the video image stream. The probability of sound existing in one image unit corresponds to the probability of sound being emitted from the corresponding subject in the image unit. The object may be a human, animal, musical instrument, device, or other object.

In the process that the terminal generates the scene analysis heat map according to the video image stream and the audio stream, the terminal can calculate the probability of sound existing on each image unit on the target image frame of the video image stream by combining the video image stream and the audio stream, and generate the scene analysis heat map according to the probability of sound existing on each image unit on the target image frame. The scene analysis heat map is a frame of image having a resolution that is the same as a resolution of a target image frame of the video image stream. The target image frame of the video image stream may be a last frame of a plurality of image frames in the video image stream. For example, as shown in fig. 4, the video image stream in fig. 4 has 10 image frames, and the target image frame is the last image frame P10 in the video image stream.

Referring to fig. 5 and 6, fig. 5 is a schematic diagram of a target image frame of a video image stream according to an embodiment of the present disclosure, and fig. 6 is a schematic diagram of a scene analysis heat map according to an embodiment of the present disclosure. For ease of illustration, in fig. 5 and 6, it is assumed that the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 5 pixels by 3 pixels. In an actual scene, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are the resolutions preset by the user, for example, assuming that the resolution preset by the user is 1920 pixels × 1080 pixels, the resolution of the target image frame of the video image stream and the resolution of the scene analysis heat map are both 1920 pixels × 1080 pixels.

Referring to fig. 5 and fig. 6, assuming that the resolution of the target image frame of the video image stream is 5 × 3 pixels, the terminal may calculate the probability of sound existing on 15 pixels on the target image frame of the video image stream by combining the video image stream and the audio stream.

In fig. 5, each circle represents 1 pixel, that is, 15 pixels in fig. 5, and each pixel in fig. 5 has an original color, for example, it is assumed that the original colors of the 15 pixels are red. It is assumed that the terminal performs calculation by combining the video image stream and the audio stream, and then the probability that sound exists on the pixel points (11, 12, 21, 22, 31, 32) is greater than or equal to 50%, and the probability that sound exists on the pixel points (13, 14, 15, 23, 24, 25, 33, 34, 35) is less than 50%. Then, the terminal generates a scene analysis heat map according to the probability of sound existing on 15 pixel points on the target image frame.

In fig. 6, in order to distinguish the difference in the probability of sound existing at the pixel points, the terminal changes the color of the pixel points (11, 12, 21, 22, 31, 32) having the probability of sound emission from the object of 50% or more to black and the color of the pixel points (13, 14, 15, 23, 24, 25, 33, 34, 35) having the probability of sound emission from the object of 50% to white on the basis of the target image frame, thereby converting the target image frame shown in fig. 5 into the scene analysis heatmap shown in fig. 6.

Of course, the scene analysis heat map shown in fig. 6 is only an example, the color corresponding to each pixel in the scene analysis heat map may be set according to actual conditions, and the color for distinguishing each pixel is not limited to white and black.

Before step S12, i.e., before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream may be converted from a time-domain form to a frequency-domain form, and then the scene analysis heat map may be generated from the video image stream and the audio stream converted to the frequency-domain form. For example, before generating the scene analysis heat map from the video image stream and the audio stream, the audio stream is fourier-transformed to obtain a short-time fourier spectrum, and then the scene analysis heat map is generated from the video image stream and the short-time fourier spectrum. Of course, the generation of the scene analysis heat map may be performed by directly using the audio stream in the time domain without performing the fourier transform.

Step S13, determining at least one image focusing area satisfying a preset condition on a target image frame of the video image stream according to the scene analysis heat map. In step S13, each of the at least one image focusing area includes at least one image unit, and the preset condition is that the probabilities of the presence of sound on one or more image units of the at least one image unit in the at least one image focusing area all reach a preset probability threshold. The preset probability threshold is a preset probability threshold. Alternatively, for a region, when the probability of sound existing on each of at least one image unit in the region reaches a preset probability threshold, the region may be determined to be an image focusing region. Alternatively, for a region, when the probabilities of sound existing in more than a predetermined number of image units in at least one image unit in the region all reach a preset probability threshold or when the proportion of image units in which the probabilities of sound existing all reach the preset probability threshold exceeds a multi-proportion threshold, it may be determined that the region is an image focusing region. In addition, there may be other determination manners, as long as the probability of sound existing on the at least one image unit in a region is determined according to the probability of sound existing on the at least one image unit in the region, so as to determine whether the region is an image focusing region.

After step S13, the terminal may control the image capture device to focus the at least one image focus area. For example, referring to fig. 7, fig. 7 is a schematic diagram illustrating a method for determining an image focusing area on a target image frame according to an embodiment of the present application. Assuming that the preset probability threshold is 50%, the terminal determines, according to the scene analysis heat map, pixel points (11, 12, 21, 22, 31, 32) which are adjacent to each other and reach 50% of the preset probability threshold on a target image frame of the video image stream, and the pixel points (11, 12, 21, 22, 31, 32) constitute a B region in fig. 7, where the B region is 1 image focusing region. Then, the terminal may control the image pickup device to focus the image focusing area (i.e., B area) in fig. 7.

In the embodiment shown in fig. 3, the method provided by the embodiment of the present application can be applied to a terminal having an image acquisition device and a sound acquisition device. The terminal can actively acquire an original video file through the image acquisition device and the sound acquisition device, generate a scene analysis heat map according to the video image stream and the audio stream, and finally determine at least one image focusing area meeting preset conditions on a target image frame of the video image stream according to the scene analysis heat map, so that the area which is most likely to make sound on the target image frame of the video image stream can be determined. Besides, the terminal does not need to be additionally provided with an image acquisition device and a sound acquisition device, and the image acquisition device and the sound acquisition device can be original devices of the terminal, such as a camera or a microphone of a smart phone, so that the device cost is reduced.

In the embodiment shown in fig. 3, a method of generating a scene analysis heat map using a section of an original video file and determining an image focus area of a target image frame using the scene analysis heat map is described. Based on the above principle, a method of generating a plurality of scene analysis heat maps using a plurality of pieces of original video files and determining image focusing areas of a plurality of target image frames using the plurality of scene analysis heat maps will be described below.

In a first manner, please refer to fig. 8, where fig. 8 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In fig. 8, the horizontal axis represents the time axis, and T1 to T4 represent 4 time points, respectively, where T1 is earlier than T2, T2 is earlier than T3, and T3 is earlier than T4. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal will automatically capture the original video file a between T1 and T3, wherein the original video file a includes the video image stream a and the audio stream a. When the time reaches the time T3, the terminal acquires an original video file A, generates a scene analysis heat map A according to a video image stream A (image frame p 1-image frame p10) and an audio stream A of the original video file A, and determines at least one image focusing area meeting preset conditions on an image frame p10 of the video image stream A according to the scene analysis heat map A, namely the terminal determines the image focusing area on the image frame p10 at the time T3.

In order to continuously determine the image focusing area of the image frame currently displayed by the terminal, when the time reaches the time T4, the terminal acquires an original video file B automatically acquired between the time T2 and the time T4, generates a scene analysis heat map B according to a video image stream B (the image frame P2 to the image frame P11) and the audio stream B of the original video file B, and determines at least one image focusing area meeting preset conditions on an image frame P11 of the video image stream B according to the scene analysis heat map B, namely the terminal determines the image focusing area on the image frame P11 at the time T4.

In the first mode, the terminal may determine a corresponding scene analysis heat map for each image frame, and determine at least one image focusing area satisfying a preset condition for each image frame by using the scene analysis heat map corresponding to each image frame, so that the image focusing area may be determined for each image frame more accurately.

In a second manner, please refer to fig. 9, where fig. 9 is a schematic diagram of a video image stream and an audio stream of another original video file provided by an embodiment of the present application. In fig. 9, the horizontal axis represents the time axis, and T1 to T3 represent 3 time points, respectively, where T1 is earlier than T2, and T2 is earlier than T3. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal will automatically capture the original video file a between T1 and T2, wherein the original video file a includes the video image stream a (image frame P1 to image frame P10) and the audio stream a. When the time reaches the time T2, the terminal acquires an original video file A, generates a scene analysis heat map A according to a video image stream A and an audio stream A of the original video file A, and determines at least one image focusing area meeting preset conditions on an image frame P10 of the video image stream A according to the scene analysis heat map A, namely the terminal determines the image focusing area on the image frame P10 at the time T2. Then, between T2 to T3, each of the image frames P11 to P18 determines at least one image focusing area satisfying a preset condition from the scene analysis heat map a.

In order to continuously determine the image focusing area of the image frame currently displayed by the terminal, when the time reaches the time T3, the terminal acquires an original video file B automatically captured between T2 and T4, wherein the original video file B comprises a video image stream B (image frame P10 to image frame P19) and an audio stream B. When the time reaches the time T3, the terminal acquires an original video file B, generates a scene analysis heat map B according to a video image stream B and an audio stream B of the original video file B, and determines at least one image focusing area meeting preset conditions on an image frame P19 of the video image stream B according to the scene analysis heat map B, namely the terminal determines the image focusing area on the image frame P19 at the time T3.

In the second mode, the terminal generates 1 scene analysis heat map at intervals, so that the plurality of image frames can be multiplexed with 1 scene analysis heat map, thereby reducing the number of times of calculating the scene analysis heat map in a unit time, and occupying less processing resources of the terminal.

Referring to fig. 10, a flowchart of another method for determining an image focusing area according to an embodiment of the present application is shown in fig. 10, where the method shown in fig. 10 is a specific implementation of step S12 of fig. 3, that is, a specific implementation of "generating a scene analysis heat map according to a video image stream and an audio stream", and the method includes the following steps.

And step S21, splitting the original video file into a video image stream and an audio stream. In which an original video file can be split into a video image stream and an audio stream using a multimedia video processing tool, such as a software tool.

And step S22, acquiring a prestored neural network model. The neural network model can be generated by a server with strong computing power, and the neural network model generated by the server comprises an audio network, an image network and a fusion network. The terminal can download the neural network model generated on the server into the local memory in advance, and when the terminal needs to use the neural network model, the terminal can acquire the neural network model stored in advance in the local memory. In the process of generating the neural network model, the server inputs a large number of video samples into the neural network model, so that the neural network model learns the large number of video samples. After the server generates the neural network model, the neural network model may identify the most likely image focus regions of the target image frames of the video image stream to produce sound.

And step S23, processing the audio stream by using the audio network of the neural network model to obtain a three-dimensional audio matrix. The three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream. For example, assume that the terminal processes an audio stream using an audio network of a neural network model to obtain a three-dimensional audio matrix, where the duration of the audio stream is 2 seconds, the frequency distribution of the audio stream is 0 to 8Khz, and the feature information of the audio stream is, for example, at least one stereo channel.

And step S24, processing the video image stream by using the image network of the neural network model to obtain a four-dimensional image matrix. The four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream (image length), the width of the video image stream (image width) and the characteristic information of the video image stream. For example, assume that a terminal processes a video image stream by using an image network of a neural network model to obtain a four-dimensional image matrix, where the duration of the video image stream is 2 seconds, the length of the video image stream is 1920 pixels, the width of the video image stream is 1080 pixels, and the feature information of the video image stream is, for example, an RBG three channel or other color gamut channels.

And step S25, carrying out fusion processing on the three-dimensional audio matrix and the four-dimensional image matrix by using a fusion network of the neural network model to obtain a scene analysis heat map. In the embodiment shown in fig. 10, in the process of processing the video image stream and the audio stream by the terminal using the neural network model to obtain the scene analysis heat map, it is more like the process of processing the sound information and the image information by a human being. The audio network of the neural network model is similar to the human ear, and the audio network of the neural network model is used for analyzing the sound information and obtaining a sound analysis conclusion. The image network of the neural network model is similar to the eyes of human beings, and the image network of the neural network model is used for analyzing the image information and obtaining an image analysis conclusion. The fusion network of the neural network model is similar to the brain of a human being, the fusion network of the neural network model is used for analyzing and fusing the sound analysis conclusion obtained by the audio network and the image analysis conclusion obtained by the image network, and finally determining the image focusing area which is most likely to make sound in the target image frame of the video image stream, and displaying the image focusing area which is most likely to make sound in the target image frame by using the scene analysis heat map, so that the scene analysis heat map generated by the neural network model is used for identifying the image focusing area which is most likely to make sound in the target image frame, and the accuracy is higher.

Referring to fig. 11, fig. 11 is a schematic diagram illustrating how to generate a scene analysis heatmap according to an embodiment of the present application. In fig. 11, T1 and T2 represent 2 time points, respectively, with T1 being earlier than T2. Assuming that the camera software of the terminal is opened by the user at time T1, the terminal will automatically capture the original video file 10 between T1 and T2, wherein the original video file 10 includes the video image stream 101 and the audio stream 102.

When the time reaches T2, the terminal acquires the original video file 10, splits the original video file 10 into the video image stream 101 (image frame p1 to image frame p10) and the audio stream 102, inputs the video image stream 101 into the image network 201 of the neural network model 20, and inputs the audio stream 102 into the audio network 202 of the neural network model 20.

After the image network 201 and the audio network 202 of the neural network model 20 in the terminal receive the video image stream 101 and the audio stream 102, respectively, the image network 201 of the neural network model 20 processes the video image stream 101 to obtain a four-dimensional image matrix, the audio network 202 of the neural network model 20 processes the audio stream 102 to obtain a three-dimensional audio matrix, and the image network 201 and the audio network 202 send the obtained four-dimensional image matrix and the obtained three-dimensional audio matrix to the fusion network 203 of the neural network model 20, respectively. The fusion network 203 of the neural network model 20 fuses the three-dimensional audio matrix and the four-dimensional image matrix to obtain the scene analysis heat map 30.

For the convenience of understanding the embodiment, please refer to fig. 12, which is a schematic diagram of an apparatus provided in the embodiment of the present application and shown in fig. 12. The device may be located in the terminal as described earlier for use in conjunction with an image capturing device such as a camera and a sound capturing device such as a microphone. The apparatus includes the following modules.

The acquiring module 11 is configured to acquire an original video file, where the original video file includes a video image stream and an audio stream, the video image stream includes a plurality of image frames and is generated by image data acquired by an image acquisition device, and the audio stream includes a plurality of sound frames and is generated by sound data acquired by a sound acquisition device. For detailed implementation, please refer to the detailed description of step S11 in the embodiment of the method shown in fig. 3.

A generating module 12 is configured to generate a scene analysis heat map according to the video image stream and the audio stream, wherein the scene analysis heat map is used to indicate a probability of sound existing on each of a plurality of image units on a target image frame of the video image stream. For detailed implementation, please refer to the detailed description of step S12 in the embodiment of the method shown in fig. 3.

A determining module 13, configured to determine, according to the scene analysis heatmap, at least one image focusing area that satisfies a preset condition on a target image frame of the video image stream, where each image area in the at least one image focusing area includes at least one image unit. For detailed implementation, please refer to the detailed description of step S13 in the embodiment of the method shown in fig. 3.

In one implementable embodiment, the target image frame is a last frame of a plurality of image frames in the video image stream. For detailed implementation, please refer to the detailed description of step S12 in the embodiment of the method shown in fig. 3.

In an implementable embodiment, the preset condition is that the probabilities of the presence of sound on one or more of the at least one image unit in the at least one image focus region each reach a preset probability threshold. For detailed implementation, please refer to the detailed description of step S13 in the embodiment of the method shown in fig. 3.

In an implementable embodiment, the apparatus shown in fig. 12 may further comprise: and the focusing module 14 is used for controlling the image acquisition device to focus at least one image focusing area. For detailed implementation, please refer to the detailed description of step S13 in the embodiment of the method shown in fig. 3.

In an implementable embodiment, the generation module 12 may further comprise: the splitting module is used for splitting an original video file into a video image stream and an audio stream; and the processing module is used for processing the video image stream and the audio stream by utilizing the neural network model to obtain a scene analysis heat map. For a detailed implementation, please refer to the detailed description of step S21 to step S25 in the embodiment of the method shown in fig. 10.

In an implementable embodiment, the apparatus shown in fig. 12 may further comprise: a conversion module 15 for converting the audio stream from a time domain form to a frequency domain form. For detailed implementation, please refer to the detailed description of step S12 in the embodiment of the method shown in fig. 3.

In an implementable embodiment, the processing module 13 may further comprise: the audio processing module is used for processing the audio stream by utilizing an audio network of the neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream; the image processing module is used for processing the video image stream by utilizing an image network of the neural network model to obtain a four-dimensional image matrix, and the four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream, the width of the video image stream and the characteristic information of the video image stream; and the fusion processing module is used for carrying out fusion processing on the three-dimensional audio matrix and the four-dimensional image matrix by utilizing a fusion network of the neural network model to obtain a scene analysis heat map. For a detailed implementation, please refer to the detailed description of step S23 to step S25 in the embodiment of the method shown in fig. 10.

For example, any one or more of the modules in FIG. 12 above may be implemented by software, hardware, or a combination of software and hardware. The software includes software program instructions and is executed by one or more processors. The hardware may comprise digital logic circuitry, arithmetic circuitry, programmable gate arrays, processors, dedicated circuitry or arithmetic circuitry, or the like. The circuitry described above may be located in one or more chips.

For a more clear description of an embodiment of the present application, please refer to fig. 13, which is a schematic diagram of another apparatus provided in an embodiment of the present application, shown in fig. 13. The device 2 includes an image acquisition device 21, an audio acquisition device 22, a central processing unit 23, an image processor 24, a memory (RAM) 25, a non-volatile memory (NVM) memory 26, and a bus 27. The bus 27 is used to communicate with other components. The device 2 shown in fig. 13 is equivalent to the circuit board, chip or chipset inside the smart phone 1 shown in fig. 1 and 2, and can selectively run various types of software, such as application software, driver software or operating system software. Central processor 23 is configured to control one or more other components, and image processor 24 is configured to perform the method of the present embodiment, and one or more processors may be included in image processor 24 to perform the method flows of the previous embodiments. By way of example only, and not limitation, S11 in fig. 3 may be performed by an ISP or by a dedicated processor. S12 in fig. 3 and at least part of the process of fig. 10 may be performed by at least one of a Neural Processing Unit (NPU), a data signal processor, or a central processor 23. S13 in fig. 3 may be performed by the ISP or central processor 23. The NPU is a device with the built-in neural network model and is specially used for neural network operation. Alternatively, the central processor 23 may run artificial intelligence software to perform the corresponding operations using the neural network model. Each of the processors mentioned above may execute the necessary software to work. Alternatively, some processors, such as ISPs, may be pure hardware devices. With regard to the apparatus 2 in fig. 13, reference may be made to the detailed description of the smartphone 1 in the embodiments corresponding to fig. 1 and fig. 2, and to the detailed description of the terminal in the embodiments corresponding to fig. 3 to fig. 11.

It should be noted that, when the above-mentioned embodiments refer to functions implemented by software, the relevant software or modules in the software may be stored in a computer readable medium or transmitted as one or more instructions or codes on the computer readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. Taking this as an example but not limiting: computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Furthermore, the method is simple. Any connection is properly termed a computer-readable medium. For example, if software is transmitted from a website, a server, or other remote source using a coaxial cable, a fiber optic cable, a twisted pair, a Digital Subscriber Line (DSL), or a wireless technology such as infrared, radio, and microwave, the coaxial cable, the fiber optic cable, the twisted pair, the DSL, or the wireless technology such as infrared, radio, and microwave are included in the fixation of the medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy Disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

Furthermore, the above embodiments are only intended to illustrate the technical solutions of the present application and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments can be modified, or some technical features can be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

A method of determining an image focus area, the method comprising:

acquiring an original video file, wherein the original video file comprises a video image stream and an audio stream, the video image stream comprises a plurality of image frames and is generated by image data acquired by an image acquisition device, and the audio stream comprises a plurality of sound frames and is generated by sound data acquired by a sound acquisition device;

generating a scene analysis heat map from the video image stream and the audio stream, the scene analysis heat map indicating a probability of sound being present on each of a plurality of image units on a target image frame of the video image stream;

determining at least one image focus area satisfying a preset condition on a target image frame of the video image stream according to the scene analysis heat map, each image area of the at least one image focus area comprising at least one image unit.
The method of determining an image focusing area of claim 1, wherein: the target image frame is a last frame of a plurality of image frames in the video image stream.
The method of determining an image focusing area according to claim 1 or 2, wherein: the preset condition is that the probabilities of sound existing on one or more image units in the at least one image unit in the at least one image focus area all reach a preset probability threshold.
The method of any of claims 1 to 3, wherein after determining at least one image focus area satisfying a preset condition on a target image frame of the video image stream from the scene analysis heat map, the method further comprises:

and controlling the image acquisition device to focus the at least one image focusing area.
The method of any one of claims 1 to 4, wherein generating a scene analysis heat map from the video image stream and the audio stream comprises:

splitting the original video file into the video image stream and the audio stream;

and processing the video image stream and the audio stream by utilizing a neural network model to obtain the scene analysis heat map.
The method of any one of claims 1 to 5, further comprising, before generating a scene analysis heat map from the video image stream and the audio stream:

converting the audio stream from a time domain form to a frequency domain form.
The method of determining an image focusing area according to claim 5, wherein processing the video image stream and the audio stream using a neural network model to obtain the scene analysis heat map comprises:

processing the audio stream by using an audio network of the neural network model to obtain a three-dimensional audio matrix, wherein the three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream;

processing the video image stream by using an image network of the neural network model to obtain a four-dimensional image matrix, wherein the four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream, the width of the video image stream and the characteristic information of the video image stream;

and carrying out fusion processing on the three-dimensional audio matrix and the four-dimensional image matrix by utilizing a fusion network of the neural network model to obtain a scene analysis heat map.
An apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original video file, the original video file comprises a video image stream and an audio stream, the video image stream comprises a plurality of image frames and is generated by image data acquired by an image acquisition device, and the audio stream comprises a plurality of sound frames and is generated by sound data acquired by a sound acquisition device;

a generating module to generate a scene analysis heat map from the video image stream and the audio stream, the scene analysis heat map indicating a probability of sound being present on each of a plurality of image units on a target image frame of the video image stream;

a determining module, configured to determine, according to the scene analysis heatmap, at least one image focusing area that satisfies a preset condition on a target image frame of the video image stream, where each image area in the at least one image focusing area includes at least one image unit.
The apparatus of claim 8, wherein: the target image frame is a last frame of a plurality of image frames in the video image stream.
The apparatus of claim 8 or 9, wherein: the preset condition is that the probabilities of sound existing on one or more image units in the at least one image unit in the at least one image focus area all reach a preset probability threshold.
The apparatus of any one of claims 8 to 10, further comprising:

and the focusing module is used for controlling the image acquisition device to focus the at least one image focusing area.
The apparatus of any one of claims 8 to 11, wherein the generating means comprises:

a splitting module, configured to split the original video file into the video image stream and the audio stream;

and the processing module is used for processing the video image stream and the audio stream by utilizing a neural network model to obtain the scene analysis heat map.
The apparatus of any one of claims 8 to 12, further comprising:

a conversion module for converting the audio stream from a time domain form to a frequency domain form.
The apparatus of claim 12, wherein the processing module comprises:

the audio processing module is used for processing the audio stream by utilizing an audio network of the neural network model to obtain a three-dimensional audio matrix, and the three-dimensional audio matrix comprises the duration of the audio stream, the frequency distribution of the audio stream and the characteristic information of the audio stream;

the image processing module is used for processing the video image stream by utilizing an image network of the neural network model to obtain a four-dimensional image matrix, and the four-dimensional image matrix comprises the duration of the video image stream, the length of the video image stream, the width of the video image stream and the characteristic information of the video image stream;

and the fusion processing module is used for carrying out fusion processing on the three-dimensional audio matrix and the four-dimensional image matrix by utilizing a fusion network of the neural network model to obtain a scene analysis heat map.
An apparatus comprising one or more processors and memory;

wherein the one or more processors are configured to read software code stored in the memory and to perform the method of any of claims 1-7.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores software code, the software code being code capable of performing the method of any one of claims 1-7 when read by one or more processors.