CN112565598A

CN112565598A - Focusing method and apparatus, terminal, computer-readable storage medium, and electronic device

Info

Publication number: CN112565598A
Application number: CN202011355143.6A
Authority: CN
Inventors: 孙禄
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-26
Anticipated expiration: 2040-11-26
Also published as: CN112565598B

Abstract

The disclosure provides a focusing method and device, a terminal, a computer readable storage medium and an electronic device, and relates to the technical field of control. The method comprises the following steps: receiving sound waves through a terminal where the camera module is located; positioning according to the sound waves and determining the position of a first sound source; acquiring a first scene image in a viewfinder, and identifying the first scene image to obtain the position of a target object in the first scene image; and matching the first sound source position and the position of the target object, and determining a first focusing target to shoot based on the first focusing target. According to the technical scheme, the specific target object can be automatically focused, and the focusing convenience degree during terminal shooting is improved.

Description

Focusing method and apparatus, terminal, computer-readable storage medium, and electronic device

Technical Field

The present disclosure relates to the field of control technologies, and in particular, to a focusing method and apparatus, a terminal, and a computer-readable storage medium and an electronic device for implementing the method.

Background

Scenes photographed through terminals are increasingly appearing in the course of work, life, and entertainment. Such as video conferencing at work, taking or recording of beautiful scenes, and live scenes, among others.

In the shooting process, a general system can carry out automatic focusing, but when a shooting scene contains a plurality of focusing targets, the automatic focusing on the specified targets cannot be realized. The user needs to manually focus on the designated target, but this approach requires additional manual designation operations by the user.

It can be seen that the convenience of the focusing scheme provided by the related art needs to be further improved.

Disclosure of Invention

The disclosure provides a focusing method and device, a terminal, and a computer-readable storage medium and an electronic device for implementing the method, so that a specific target object can be automatically focused, and the focusing convenience degree of the terminal during shooting is improved at least to a certain extent.

According to an aspect of the present disclosure, there is provided a focusing method including: receiving sound waves through a terminal where the camera module is located; positioning according to the sound waves, and determining a first sound source position; acquiring a first scene image in a viewfinder of the camera module, and identifying the first scene image to obtain the position of a target object in the first scene image; and matching the first sound source position and the position of the target object so as to shoot based on the first focusing target.

According to an aspect of the present disclosure, there is provided a terminal including: the microphone array is used for receiving sound waves through a terminal where the camera module is located; the sound source positioning component is used for positioning according to the sound waves and determining a first sound source position; the camera module is used for acquiring a first scene image in the viewfinder and identifying the first scene image to obtain the position of a target object in the first scene image; and the system is used for matching the first sound source position with the position of the target object and determining a first focusing target so as to shoot based on the first focusing target.

According to an aspect of the present disclosure, there is provided a focusing apparatus including: the receiving module is used for receiving sound waves through a terminal where the camera module is located; the positioning module is used for positioning according to the sound waves and determining the position of a first sound source; the image identification module is used for acquiring a first scene image in the viewfinder and identifying the first scene image to obtain the position of a target object in the first scene image; and the matching module is used for matching the first sound source position with the position of the target object, determining a first focusing target and shooting based on the first focusing target.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the focusing method as described in the first aspect above.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the focusing method of the first aspect described above via execution of the executable instructions.

In the technical solutions provided in some embodiments of the present disclosure, on the one hand, for a target object being photographed by a camera module of a terminal, a sound wave of the target object is received by the terminal where the camera module is located, and a sound source position for preliminarily positioning the target object can be obtained by performing sound source positioning based on the sound wave. On the other hand, a scene image in a viewfinder of the camera module is obtained, the position of the target object in the scene image is identified, and the sound source position and the position of the target object are matched to determine the first focusing target. Therefore, the technical scheme combines the sound source position and the position of the target object to determine the focusing target, realizes automatic focusing on the specific target object on the basis of ensuring the accuracy of the focusing target, and improves the focusing convenience degree when the terminal shoots.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 shows a schematic diagram of an application scenario to which the focusing method of the embodiment of the present disclosure may be applied.

Fig. 2 schematically shows a flow chart of a focusing method in an embodiment of the present disclosure.

Fig. 3 schematically shows a flow chart of a sound source localization method in an embodiment of the present disclosure.

Fig. 4 schematically shows a mesh partition diagram in a sound source localization process in an embodiment of the present disclosure.

FIG. 5 schematically shows a diagram of a speech recognition model in an embodiment of the present disclosure.

Fig. 6 schematically shows a flow chart of a focusing method in another embodiment of the present disclosure.

Fig. 7 schematically shows a structure diagram of a terminal in an embodiment of the present disclosure.

Fig. 8 schematically shows a structure of a focusing device in an embodiment of the present disclosure.

FIG. 9 schematically illustrates a block diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, all of the following terms "first" and "second" are used for distinguishing purposes only and should not be construed as limiting the present disclosure.

In order to solve the technical problem of the related art that the focusing cannot be automatically and accurately performed on the specified target object, fig. 1 illustrates a schematic diagram of an application scenario to which the focusing method according to the embodiment of the present disclosure may be applied.

The focusing method can be applied to a photographing scene, a recording scene, a live broadcasting scene or a scene of image editing processing. Referring to fig. 1, the method can be applied to photographing a plurality of

target objects

103, 104, 105 using a terminal 101 to obtain a photographed image or record a video. The above focusing method is performed during the photographing/recording process. The terminal 101 may be various types of clients capable of being used for shooting, and is provided with a built-in or external microphone array to collect sounds emitted by each target object in a shooting scene. The terminal 101 may be, for example, various smart phones, tablet computers, desktop computers, vehicle-mounted devices, wearable devices, etc., which can be used to capture images or videos and can present the images or videos. The

target objects

103, 104, 105 may be any type of object to be photographed in various scenes, such as a person, an animal, and the like. The target object may be in a stationary state or in a moving state. In particular, a camera or a camera application on the terminal 101 may be used for image capturing/video recording of the target image. The camera on the terminal can comprise a plurality of camera modules, and any one or more camera modules can be called to acquire images or record videos of the target object.

Specifically, after the user opens the camera module of the terminal 101, the scene image 102 can be presented in the viewfinder. The scene image comprises a target object, and the target object can emit sound waves. The sound waves may be acquired by a built-in/built-out microphone array at the terminal for sound source localization to determine the approximate bearing of the focus target. And simultaneously, acquiring a scene image in a viewfinder of the camera module so as to accurately determine a focusing target by combining the scene image and the sound source position.

It should be noted that the focusing method provided by the embodiments of the present disclosure may be completely performed by the terminal. In addition, the focusing method provided by the embodiment of the disclosure can also be completed through the interaction of the terminal and the server. For example, in one aspect, the terminal sends the received sound wave signal to the server, so that the server performs sound source localization according to the sound wave signal to obtain a sound source position. On the other hand, the terminal acquires the scene image in the viewfinder and sends the scene image to the server according to the scene image. The server then identifies the scene image as a target object (e.g., multiple character avatars, etc.), and matches the sound source location with the target object in the scene image to determine the target of focus. And transmitting the focus target to the terminal so that the terminal performs photographing based on the focus target.

The following describes an embodiment of the focusing method provided in the present disclosure.

Fig. 2 schematically shows a flow chart of a focusing method in an embodiment of the present disclosure, and referring to fig. 2, the embodiment shown in the figure includes: step S201-step S204.

In step S201, sound waves are received by a terminal where the camera module is located.

In an exemplary embodiment, since the sound waves arrive at different positions with different intensities, in the present technical solution, the intensity of the sound waves arriving at the terminal needs to be obtained to determine the focusing accuracy, so the sound waves are received by the microphone array disposed in the terminal in the present embodiment. The microphone array can be arranged in the terminal equipment, so that sound wave collection can be conveniently completed by using a microphone sequence carried by the terminal. For example, the microphone array may be mounted on the surface of a terminal, so that the microphone array may be easily overloaded on a plurality of terminal devices to facilitate sound wave collection of different terminals.

In step S202, localization is performed according to the sound wave, and a first sound source position is determined.

In an exemplary embodiment, fig. 3 schematically illustrates a flow chart of a sound source localization method in an embodiment of the present disclosure, and referring to fig. 3, the illustrated embodiment of the figure includes:

step S301, collecting sound signals through a microphone array arranged on the terminal; step S302, processing the sound signals collected by each microphone in the microphone array to obtain beams; step S303, carrying out grid point division on an effective area corresponding to the terminal to obtain a plurality of assumed sound sources; step S304, calculating the output signal power of each assumed sound source at the microphone array based on the beams; in step S305, the position of the assumed sound source with the largest power is determined as the first sound source position.

In an exemplary embodiment, the power of each assumed sound source is determined according to the basic framework of fixed beamforming. Referring to fig. 4, for a region D within a preset distance from the terminal as described above₀It is subjected to meshing to obtain a mesh as shown in fig. 4, in which each mesh point (solid black dot in the figure) is a hypothetical sound source. A target sound source is determined among these assumed sound sources by means of sound source localization. Wherein the denser the meshing, the higher the accuracy of sound source localization, but the longer the time consumption. Conversely, the more sparse the meshing is, the lower the accuracy of sound source localization but the shorter the time consumption.

It should be noted that, considering the real-time property of the terminal determining the focus target during the shooting process, and the technical solution will combine both the scene image and the sound source localization to determine the focus target, the requirement on the precision of the sound source localization may not be high. In the specific implementation process, the grid division can be determined by considering the actual situation.

Continuing with fig. 4, for any assumed sound source x, calculating a time delay between each microphone and a reference point, wherein the reference point is the microphone in the microphone array that is closest to the assumed sound source; further, the signals of each microphone are time-shifted according to the time delay, and weighted and summed to obtain the power of the assumed sound source x. Specifically, the method comprises the following steps:

illustratively, assuming that the microphone array contains m (m ≧ 4) microphones, let y be the microphone closest to the assumed sound source x, then assume the time t from the sound source x to each microphone_i，xAnd the time t from the assumed sound source x to the microphone y_y，xDifference vector TDOA_xExpressed as: TDOA_x＝[t_y，1，x，t_y，2，x，t_y，3，x，…，t_y，i，x，…t_y，m，x]^T。

Wherein, t_y，i，x＝t_y，x-t_i，x＝(d_y，x-d_i，x)/v，t_y，i，xRepresenting the time delay between the microphone y located closest to the assumed sound source x and the i-th microphone from the assumed sound source x, v representing the speed of sound traveling in air, d_i，xRepresenting a straight line distance from the assumed sound source x to the i-th microphone.

Meanwhile, the sample point number difference can be represented as sd, and the corresponding vector can be represented as:

sd_x＝[sd_y，2，x，sd_y，3，x，sd_y，4，x，…，sd_y，i，x，…sd_y，m，x]^Tassuming that the signal sampling frequency is sf, then: sd_y，i，x＝round(sf*abs(t_y，i，x))。

Where round () is an rounding-up function and abs () is an absolute value function, thereby calculating the number of delay points from the assumed sound source x to the nearest microphone y and other microphones. Then max (sd) is calculated_1，i，x) If the signal received by the microphone y closest to the microphone y is s (t), the signal received by the ith microphone is s (t-t)_y，i，x，). The signal received by each microphone is subtracted from the sum of the microphonesAnd (4) counting the delay points of the wind y, and then performing time domain superposition on all signals to obtain the power of the assumed sound source x.

Similarly, region D can be calculated₀The power of all assumed sound sources, and the assumed sound source with the maximum power is taken as the area D₀The target sound source of (1). In the present embodiment, referring to fig. 4, a sound source S₀Is region D₀The assumed sound source with the maximum medium power is obtained as the region D₀Target sound source S₀。

The technical scheme adopts a mode of collecting sound waves at a terminal and positioning a sound source to determine the rough position of a focusing target, and the idea can be conveniently applied to a practical use scene, such as: three anchor broadcasters live broadcast at the same terminal to publicize a certain product, when the anchor A sounds, the position of the anchor A which is sounding can be used as the sound source position by the mode of the sound source positioning embodiment, and the position is further focused; when the A anchor and the B anchor sound simultaneously, the position of the anchor with larger sound is used as the sound source position through the technical scheme, and the position is further focused. Compared with the related art, the method needs to focus on the current speaker or the speaker with larger voice through manual switching. The sound source positioning mode provided by the embodiment is beneficial to automatically and accurately realizing focusing, and is convenient, quick and time-saving.

In an exemplary embodiment, in order to improve the humanization degree of the focusing of the camera module and meet the user requirements more appropriately, the sound source positioning can be realized according to whether certain preset words are contained in the sound wave. Specifically, speech recognition can be realized in an artificial intelligence-based manner:

for example, the acoustic wave recognition described above may be implemented based on a pre-trained machine learning model, i.e., a speech recognition model. Referring specifically to fig. 5, the speech recognition model 500 may be a sequence-to-sequence (sqe2sqe) model with attention mechanism, which is a pre-trained machine learning model. The method specifically comprises the following steps: the encoder 510 and decoder 520, respectively, in the speech recognition model may be referred to as: "Listener" and "Speller". The encoder 310/decoder 320 may be a Recurrent Neural Network (RNN), a Long Short-Term Memory (LSTM) Neural Network, a modified Gated Recurrent Unit (GRU) Neural Network of the LSTM, or a Bi-directional Long Short-Term Memory (Bi LSTM) Network. For example, encoder 510 and decoder 520 are both GRU networks.

For example, referring to fig. 5, the trained coding function and decoding function in the pre-trained speech recognition model 500, first, the obtained sound wave is processed to obtain the speech X to be recognized, and the speech X to be recognized is input into the pre-trained speech recognition model 500 as [ X1, X2, X3, … …, xa ], where the hidden state sequence can be obtained by the coding function of the encoder 510 as follows: h' ═ H1, H2, H3, … …, ha ], and the hidden state sequence can also be obtained by the decoding function of the decoder 520 as: [ s1, s2, … …, sb ]. Where < sos (start Of sequence) > indicates the start Of decoding Of a sentence, and < eos (end Of sequence) > indicates the end Of decoding Of the sentence. Further, the decoder 520 may determine the output characteristics Y at different time steps [ Y2, Y3, … …, yb ] based on the foregoing hidden state sequence.

Then, whether a preset vocabulary (for example, "beat me", etc.) is included in the output characteristics corresponding to the sound wave is judged, and if yes, the first sound source position is obtained by positioning the position where the sound wave is emitted.

Still taking the above usage scenario as an example: three anchor broadcasters live at the same terminal place to publicize a certain product, when A anchor says "beat me", then can regard the position of A anchor as the sound source position through the mode of above-mentioned sound source location embodiment to further focus on this department. Compared with the related art, the method needs to focus on the current speaker or the speaker with larger voice through manual switching. The sound source positioning mode provided by the embodiment is also beneficial to automatically and accurately realizing focusing, and is convenient, quick and time-saving.

It should be noted that the method for determining the sound source position in the present embodiment is not limited to the above two methods, and other sound source positioning methods may be used.

In an exemplary embodiment, in order to further improve focusing accuracy, the received sound wave is also identified before sound source localization is performed from the sound wave to ensure that the sound wave comes from the photographing target object. For example, if the target object currently being photographed is a person, sound source localization is performed after ensuring that the sound wave comes from the person through sound wave recognition. For another example, when the target object currently being photographed is a cat, sound source localization is performed after sound waves are ensured to come from the cat by sound wave recognition.

In an exemplary embodiment, in order to improve the accuracy of the focus target, the present technical solution combines the above-mentioned two aspects of the sound source position and the position of the target object in the scene image to determine the focus target. After the sound source position is determined according to the above embodiment, referring to fig. 2, in step S203, a first scene image of a finder window of the camera module is obtained, and the first scene image is identified to obtain the position of the target object in the first scene image. It should be noted that the execution sequence of "step S203" and "step S201 and step S202" is not sequential, and "step S201 and step S202" are limited to be executed again "step S203", and "step S203" and step S202 "are executed first, or" step S201 and step S202 "and" step S203 "are executed simultaneously.

The first scene image is a current image in a viewfinder of the camera module. The scene image may be an image of a photographed scene, an image of a video recording scene, or an image of a live scene. For example, the scene image includes a plurality of target objects, and the target objects may be characters, animals, flowers, and the like, wherein the characters, the animals, or the flowers may be in a static state or a moving state.

In an exemplary embodiment, when the current shooting target object in the finder window is a person, the target object may be an avatar. A pre-trained machine learning model may be used, such as through an avatar recognition model to identify a plurality of target objects included in the first scene image, so as to obtain locations of the plurality of target objects in the first scene image.

The training process of the avatar recognition model may be: obtaining sample images as many as possible, manually marking the sample images, taking the images containing the head portrait as positive samples and the images not containing the head portrait as negative samples, optimizing model parameters of the head portrait recognition model by adopting a cross entropy loss function, and finally obtaining the head portrait recognition model with the prediction accuracy meeting the actual requirement. The exemplary above-described avatar recognition model may be implemented using the following neural network models: ResNet _152, InceptionV4, AlexNet, VGG19, DenseNet. In an exemplary embodiment, the trained avatar recognition algorithm can be stored in the terminal, so that the focusing process can be realized only by the terminal without interacting with the server.

The method of recognizing the target object in the present embodiment is not limited to the above two methods, and other methods of recognizing the image may be used, and the method is not limited herein.

With continued reference to fig. 2, in step S204, the first sound source position and the position of the target object are matched, and a first focus target is determined to perform photographing based on the first focus target.

In an exemplary embodiment, a three-dimensional position relationship diagram is determined according to the position of a target object in the scene image, the position of a sound source and the position of a lens in a camera module, so that the position of the sound source and the position of the target object are matched based on the three-dimensional position relationship diagram. Specifically, the method comprises the following steps:

in an exemplary embodiment, one scheme for matching the position of the target object with the position of the sound source is: and acquiring an included angle value between a first connecting line and a second connecting line based on the three-dimensional position relation graph, wherein the first connecting line is a connecting line between the sound source position and the lens, and the second connecting line is a connecting line between each target object and the lens. And determining the target object corresponding to the minimum included angle value as a focusing target.

In an exemplary embodiment, another scheme for matching the position of the target object with the position of the sound source is: respectively calculating the distance between each target object and the first sound source position based on the three-dimensional position relation graph; and determining the target object corresponding to the minimum distance value as the first focusing target.

Through any one of the two modes, the target object closest to the sound source position/the direction in which the sound source position is located can be determined to be the focusing target, so that the lens of the camera module automatically and accurately focuses on the target object which emits sound or has the largest sound volume.

In an exemplary embodiment, identifiers are further respectively set for the target objects identified in the first scene image, and a user may actively select a focusing target by touching the identifiers. In other words, in response to the touch of any identifier, the focus target is switched to the position of the target object corresponding to the touched identifier. Thus, a manner of actively determining a focusing target according to a user's intention is provided, and its priority is greater than a focusing scheme according to sound source localization.

In an exemplary embodiment, in the case that the focus target is actively determined according to the user's intention, a control regarding an "autofocus mode" is further provided at the terminal, and in response to the control being touched, the autofocus mode is further provided according to the embodiment shown in fig. 2. Therefore, a convenient focusing mode conversion scheme is provided for the user, and the shooting experience of the user is improved.

In an exemplary embodiment, fig. 6 schematically illustrates a flow chart of a focusing method in another embodiment of the present disclosure, and referring to fig. 6, the illustrated embodiment of the figure includes:

step S61, in response to the opening action, the switching action or each interval preset duration of the terminal lens, executing the technical solutions corresponding to steps S201 to S204.

That is, the scheme of focusing based on the sound of the target object provided by the embodiment shown in fig. 6 can be applied to the following scenarios:

when the lens of the terminal camera shooting module is opened: receiving sound waves through a terminal where the camera module is located; positioning according to the sound waves and determining the position of a sound source; and acquiring a scene image in the viewfinder, and determining a focus target according to the sound source position and the scene image so as to shoot based on the focus target.

When the lens of the terminal camera module is switched: receiving sound waves through a terminal where the camera module is located; positioning according to the sound waves and determining the position of a sound source; and acquiring a scene image in the viewfinder, and determining a focus target according to the sound source position and the scene image so as to shoot based on the focus target.

In the shooting process of the terminal camera module, the time is preset at intervals: receiving sound waves through a terminal where the camera module is located; positioning according to the sound waves and determining the position of a sound source; and acquiring a scene image in the viewfinder, and determining a focus target according to the sound source position and the scene image so as to shoot based on the focus target.

Specifically, on the one hand, sound waves around the terminal (a range of a preset distance from the terminal) are collected in step S621; and sound source localization is performed based on the sound waves in step S631 to obtain a sound source position. On the other hand, the scene image in the finder window is acquired in step S622, and the scene image is recognized so as to obtain a plurality of avatars in step S632. Matching is performed according to the above two aspects, and a focus target is obtained (step S64).

In the embodiment provided by the disclosure, on the one hand, for a target object being shot by a camera module of a terminal, a sound wave of the target object is received by the terminal where the camera module is located, and sound source positioning is performed based on the sound wave, so that a sound source position for preliminarily positioning the target object can be obtained. On the other hand, a scene image in a viewfinder of the camera module is obtained, the position of the target object in the scene image is identified, and the sound source position and the position of the target object are matched to determine the first focusing target. Therefore, the technical scheme combines the sound source position and the position of the target object to determine the focusing target, realizes automatic focusing on the specific target object on the basis of ensuring the accuracy of the focusing target, and improves the focusing convenience degree when the terminal shoots.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The terminal provided by the technical scheme is introduced as follows:

as shown in fig. 7, a terminal 700 provided in this exemplary embodiment includes: the microphone array 701 is used for receiving sound waves through a terminal where the camera module is located; a sound source positioning component 702, configured to perform positioning according to the sound wave, and determine a first sound source position; the camera module 703 is configured to acquire a first scene image in the finder window, and identify the first scene image to obtain a position of a target object in the first scene image; and the system is used for matching the first sound source position with the position of the target object and determining a first focusing target so as to shoot based on the first focusing target.

The embodiment of the focusing method can be completed based on the microphone array, the sound source positioning assembly and the camera module.

The focusing device provided by the technical scheme is introduced as follows:

a focusing apparatus provided in the present exemplary embodiment is shown in fig. 8, and the focusing apparatus 800 includes: a receiving module 801, a positioning module 802, an image recognition module 803, and a matching module 804.

The receiving module 801 is configured to receive sound waves through a terminal where the camera module is located. The positioning module 802 is configured to perform positioning according to the sound wave, and determine a first sound source position; the image identification module 803 is configured to acquire a first scene image in the finder window, and identify the first scene image to obtain a position of a target object in the first scene image. And the matching module 804 is configured to match the first sound source position and the target object position, and determine a first focus target, so as to perform shooting based on the first focus target.

In an exemplary embodiment of the present disclosure, based on the above scheme, the positioning module 802 includes: a sound source determination unit and a power calculation unit are assumed.

Wherein the above-mentioned assumed sound source determining unit is configured to: collecting sound signals through a microphone array arranged on the terminal; processing the sound signals collected by each microphone in the microphone array to obtain beams; carrying out grid point division on an effective area corresponding to the terminal to obtain a plurality of assumed sound sources;

the power calculating unit is configured to: calculating the output signal power of each assumed sound source at the microphone array based on the beams; and determining a position of an assumed sound source having a maximum power as the first sound source position.

In an exemplary embodiment of the present disclosure, based on the above scheme, the power calculating unit is specifically configured to: calculating a time delay between each microphone and a reference point for each assumed sound source, wherein the reference point is a position of a microphone closest to the assumed sound source in the microphone array; and carrying out signal time shift on each microphone according to the time delay, and carrying out weighted summation to obtain the power of the assumed sound source.

In an exemplary embodiment of the present disclosure, based on the above scheme, the focusing apparatus further includes an acoustic wave identification module 805 configured to: and identifying the sound wave based on a pre-trained machine learning model so as to obtain whether the sound wave contains preset words or not. The positioning module 802 is further configured to: and responding to the fact that the preset words are contained in the target sound waves, and determining the position of the first sound source according to the target sound waves.

In an exemplary embodiment of the present disclosure, based on the above scheme, the receiving module 801 is specifically configured to: and receiving the sound wave through a microphone array arranged in the terminal, or receiving the sound wave through a microphone array mounted on the terminal.

In an exemplary embodiment of the present disclosure, based on the above scheme, the identifying module 803 is specifically configured to: and identifying the first scene image based on a pre-trained machine learning model, identifying to obtain a plurality of target objects contained in the first scene image, and obtaining positions of the plurality of target objects in the first scene image.

In an exemplary embodiment of the present disclosure, based on the above scheme, the focusing apparatus further includes a three-dimensional position determining module, configured to: before the matching module 804 matches the first sound source position and the target object position, a three-dimensional position relationship diagram is determined according to the target object position in the first scene image, the sound source position, and the position of a lens in the camera module, so as to match the first sound source position and the target object position based on the three-dimensional position relationship diagram.

In an exemplary embodiment of the present disclosure, based on the above scheme, the matching module 804 is specifically configured to: acquiring an included angle value between a first connecting line and a second connecting line based on the three-dimensional position relation graph, wherein the first connecting line is a connecting line between the first sound source position and the lens, and the second connecting line is a connecting line between each target object and the lens; and determining the target object corresponding to the minimum included angle value as the first focusing target.

In an exemplary embodiment of the present disclosure, based on the above scheme, the matching module 804 is further specifically configured to: respectively calculating the distance between each target object and the first sound source position based on the three-dimensional position relation graph; and determining the target object corresponding to the minimum distance value as the first focusing target.

In an exemplary embodiment of the present disclosure, based on the above scheme, the focusing apparatus 800 further includes: and a switching module.

Wherein, the switching module is used for: respectively setting identifiers for the target objects identified in the first scene image; and responding to the touch of the identifier, and switching the focus target to the position of the target object corresponding to the touched identifier.

In an exemplary embodiment of the present disclosure, based on the above scheme, in response to the lens switching action, the receiving module 801 is configured to: receiving sound waves through a terminal where the camera module is located to determine a second sound source position; and the image recognition module 803 is configured to: acquiring a second scene image in a viewfinder of the camera module, and identifying the second scene image to obtain the position of a target object in the second scene image; and the matching module 804 is configured to: and matching the second sound source position and the position of the target object, and determining a second focusing target to shoot based on the second focusing target.

In an exemplary embodiment of the present disclosure, based on the above scheme, at every preset time interval, the receiving module 801 is configured to: receiving sound waves through a terminal where the camera module is located to determine a third sound source position; and, the image recognition module 803 is configured to: acquiring a third scene image in a viewfinder of the camera module, and identifying the third scene image to obtain the position of a target object in the third scene image; and the matching module 804 is configured to: and matching the third sound source position with the position of the target object, and determining a third focusing target to shoot based on the third focusing target.

The specific details of each module or unit in the focusing apparatus have been described in detail in the corresponding focusing method, and therefore are not described herein again.

FIG. 9 shows a schematic diagram of an electronic device suitable for use in implementing exemplary embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The electronic device of the present disclosure includes at least a processor and a memory for storing one or more programs, which when executed by the processor, cause the processor to implement the focusing method of the exemplary embodiments of the present disclosure.

Specifically, as shown in fig. 9, the electronic device 200 may include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management Module 240, a power management Module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication Module 250, a wireless communication Module 260, an audio Module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor Module 280, a display 290, a camera Module 291, a pointer 292, a motor 293, a button 294, and a Subscriber Identity Module (SIM) card interface 295. The sensor module 280 may include a depth sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 200. In other embodiments of the present application, the electronic device 200 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural Network Processor (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors. Additionally, a memory may be provided in processor 210 for storing instructions and data.

The USB interface 230 is an interface conforming to the USB standard specification, and may specifically be a MiniUSB interface, a microsusb interface, a USB type c interface, or the like. The USB interface 230 may be used to connect a charger to charge the electronic device 200, and may also be used to transmit data between the electronic device 200 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface can also be used for connecting other electronic equipment and the like.

The charge management module 240 is configured to receive a charging input from a charger. The charger may be a wireless charger or a wired charger. The power management module 241 is used for connecting the battery 242, the charging management module 240 and the processor 210. The power management module 241 receives the input of the battery 242 and/or the charging management module 240, and supplies power to the processor 210, the internal memory 221, the display screen 290, the camera module 291, the wireless communication module 260, and the like.

The wireless communication function of the electronic device 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, a modem processor, a baseband processor, and the like.

The mobile communication module 250 may provide a solution including 2G/3G/4G/5G wireless communication applied on the electronic device 200.

The Wireless Communication module 260 may provide a solution for Wireless Communication applied to the electronic device 200, including Wireless Local Area Networks (WLANs) (e.g., Wireless Fidelity (Wi-Fi) network), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like.

The electronic device 200 implements a display function through the GPU, the display screen 290, the application processor, and the like. The GPU is an image-blurring microprocessor, connected to the display screen 290 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.

The electronic device 200 may implement a shooting function through the ISP, the camera module 291, the video codec, the GPU, the display screen 290, the application processor, and the like. In some embodiments, the electronic device 200 may include 1 or N camera modules 291, where N is a positive integer greater than 1, and if the electronic device 200 includes N cameras, one of the N cameras is a main camera, and the others may be sub cameras, such as a telephoto camera.

Internal memory 221 may be used to store computer-executable program code, including instructions. The internal memory 221 may include a program storage area and a data storage area. The external memory interface 222 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 200.

The electronic device 200 may implement an audio function through the audio module 270, the speaker 271, the receiver 272, the microphone 273, the headphone interface 274, the application processor, and the like. Such as music playing, recording, etc.

Audio module 270 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. Audio module 270 may also be used to encode and decode audio signals. In some embodiments, the audio module 270 may be disposed in the processor 210, or some functional modules of the audio module 270 may be disposed in the processor 210.

The speaker 271 is used for converting the audio electric signal into a sound signal. The electronic apparatus 200 can listen to music through the speaker 271 or listen to a handsfree phone call. The receiver 272, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic device 200 receives a call or voice information, it can receive the voice by placing the receiver 272 close to the ear of the person. The microphone 273, also known as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 273 by sounding a voice signal near the microphone 273 through the mouth. The electronic device 200 may be provided with at least one microphone 273. The earphone interface 274 is used to connect wired earphones.

For sensors included with the electronic device 200, a depth sensor is used to obtain depth information of the scene. The pressure sensor is used for sensing a pressure signal and converting the pressure signal into an electric signal. The gyro sensor may be used to determine the motion pose of the electronic device 200. The air pressure sensor is used for measuring air pressure. The magnetic sensor includes a hall sensor. The electronic device 200 may detect the opening and closing of the flip holster using a magnetic sensor. The acceleration sensor may detect the magnitude of acceleration of the electronic device 200 in various directions (typically three axes). The distance sensor is used for measuring distance. The proximity light sensor may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The fingerprint sensor is used for collecting fingerprints. The temperature sensor is used for detecting temperature. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display screen 290. The ambient light sensor is used for sensing the ambient light brightness. The bone conduction sensor may acquire a vibration signal.

The keys 294 include a power-on key, a volume key, and the like. The keys 294 may be mechanical keys. Or may be touch keys. The motor 293 may generate a vibration indication. The motor 293 may be used for both electrical vibration prompting and touch vibration feedback. Indicator 292 may be an indicator light that may be used to indicate a state of charge, a change in charge, or may be used to indicate a message, missed call, notification, etc. The SIM card interface 295 is used to connect a SIM card. The electronic device 200 interacts with the network through the SIM card to implement functions such as communication and data communication.

The present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device.

A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable storage medium may transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A focusing method, comprising:

receiving sound waves through a terminal where the camera module is located;

positioning according to the sound waves and determining a first sound source position;

acquiring a first scene image in the viewfinder, and identifying the first scene image to obtain the position of a target object in the first scene image;

and matching the first sound source position with the position of the target object, and determining a first focusing target to shoot based on the first focusing target.

2. The method of claim 1, wherein locating from the sound waves to determine a first sound source location comprises:

collecting sound signals through a microphone array arranged on the terminal;

processing the sound signals collected by each microphone in the microphone array to obtain beams;

carrying out grid point division on an effective area corresponding to a terminal to obtain a plurality of assumed sound sources;

calculating the output signal power of each assumed sound source at the microphone array based on the beams;

determining the position of the assumed sound source having the largest power as the first sound source position.

3. The method of claim 2, wherein calculating the output signal power of each assumed sound source at the microphone array based on the beams comprises:

calculating a time delay between each microphone and a reference point for each assumed sound source, wherein the reference point is a position of a microphone closest to the assumed sound source in the microphone array;

and carrying out signal time shift on each microphone according to the time delay, and carrying out weighted summation to obtain the power of the assumed sound source.

4. The method of claim 1, wherein after receiving the sound waves through a terminal where the camera module is located, the method further comprises:

identifying the sound waves based on a pre-trained machine learning model to acquire whether the sound waves contain preset words or not;

the positioning according to the sound wave and determining the position of a first sound source comprise:

and responding to the fact that the preset words are contained in the target sound waves, and determining the position of the first sound source according to the target sound waves.

5. The method of claim 1, wherein receiving the sound waves through a terminal where the camera module is located comprises:

receiving the sound waves by means of an array of microphones arranged inside the terminal, or,

the sound waves are received by a microphone array mounted on the terminal.

6. The method according to any one of claims 1 to 5, wherein identifying the first scene image to obtain the position of the target object in the first scene image comprises:

and identifying the first scene image based on a pre-trained machine learning model, identifying to obtain a plurality of target objects contained in the first scene image, and obtaining positions of the plurality of target objects in the first scene image.

7. The method according to any one of claims 1 to 5, wherein prior to matching the first sound source position and the target object position, the method further comprises:

and determining a three-dimensional position relation diagram according to the position of the target object in the first scene image, the position of the sound source and the position of a lens in the camera module, so as to match the position of the first sound source with the position of the target object based on the three-dimensional position relation diagram.

8. The method of claim 7, wherein matching the first sound source location and the target object location to obtain a first focused target comprises:

acquiring an included angle value between a first connecting line and a second connecting line based on the three-dimensional position relation graph, wherein the first connecting line is a connecting line between the first sound source position and the lens, and the second connecting line is a connecting line between each target object and the lens;

and determining the target object corresponding to the minimum included angle value as the first focusing target.

9. The method of claim 7, wherein matching the first sound source location and the target object location to obtain a first focused target comprises:

respectively calculating the distance between each target object and the first sound source position based on the three-dimensional position relation graph;

and determining the target object corresponding to the minimum distance value as the first focusing target.

10. The method according to any one of claims 1 to 5, further comprising:

respectively setting identifiers for the target objects identified in the first scene image;

and responding to the touch of the identifier, and switching the focus target to the position of the target object corresponding to the touched identifier.

11. The method according to any one of claims 1 to 5, further comprising:

responding to the opening action or the switching action of the lens in the camera module, and receiving sound waves through a terminal where the camera module is located so as to determine the position of a second sound source;

acquiring a second scene image in the viewfinder, and identifying the second scene image to obtain the position of a target object in the second scene image;

and matching the second sound source position and the position of the target object, and determining a second focusing target to shoot based on the second focusing target.

12. The method according to any one of claims 1 to 5, further comprising:

receiving sound waves through a terminal where the camera module is located at intervals of preset time to determine a third sound source position;

acquiring a third scene image in the viewfinder, and identifying the third scene image to obtain the position of a target object in the third scene image;

and matching the third sound source position with the position of the target object, and determining a third focusing target to shoot based on the third focusing target.

13. A terminal, comprising:

the microphone array is used for receiving sound waves through a terminal where the camera module is located;

the sound source positioning component is used for positioning according to the sound waves and determining a first sound source position;

the camera module is used for acquiring a first scene image in the viewfinder and identifying the first scene image to obtain the position of a target object in the first scene image; and the system is used for matching the first sound source position with the position of the target object and determining a first focusing target so as to shoot based on the first focusing target.

14. A focusing assembly, comprising:

the receiving module is used for receiving sound waves through a terminal where the camera module is located;

the positioning module is used for positioning according to the sound waves and determining the position of a first sound source;

the image identification module is used for acquiring a first scene image in the viewfinder and identifying the first scene image to obtain the position of a target object in the first scene image;

and the matching module is used for matching the first sound source position with the position of the target object, determining a first focusing target and shooting based on the first focusing target.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-12.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-12 via execution of the executable instructions.