CN112084929A

CN112084929A - Speaker recognition method, device, electronic equipment, storage medium and system

Info

Publication number: CN112084929A
Application number: CN202010923630.1A
Authority: CN
Inventors: 张国锋; 邓魁元; 韦国华; 胡小鹏
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-15

Abstract

The invention relates to the technical field of video conferences, in particular to a speaker identification method, a speaker identification device, electronic equipment, a storage medium and a speaker identification system, wherein the method comprises the steps of acquiring sound source positioning information in a conference site and panoramic images acquired by a fixed-focus lens; determining the position information of each face in the meeting place relative to the fixed-focus lens based on the position information of the face in the panoramic image to obtain first position information; converting the sound source positioning information and the first position information into position information of the sound source and each human face in the meeting place relative to the zoom lens by using a sound source positioning module and the position relation between the fixed-focus lens and the zoom lens to respectively obtain second position information and third position information; and determining the speaker and the rotation information of the zoom lens according to the second position information and the third position information so that the zoom lens collects the image of the speaker. The method can realize that the speaker can be identified accurately and in real time by utilizing the matching of the fixed-focus lens and the zoom lens.

Description

Speaker recognition method, device, electronic equipment, storage medium and system

Technical Field

The invention relates to the technical field of video conferences, in particular to a speaker identification method, a speaker identification device, electronic equipment, a storage medium and a speaker identification system.

Background

In a video conference, in order to guarantee the effect of the conference, a speaker in a conference place is generally marked, and then the speaker needs to be identified among participants. In the prior art, panoramic images of all participants in a conference hall are generally acquired by using a fixed-focus lens, and the positions of the participants in the conference hall are determined; determining a sound source in a meeting place by using a sound source positioning module; matching the sound source with the positions of the participants in the meeting place to determine a speaker; and finally, identifying the image of the speaker in the panoramic image.

However, in the above speaker recognition, the image of the speaker is finally recognized in the panoramic image, and in a conference hall where the number of conference participants is large, even if the image of the speaker is recognized in the panoramic image, the conference participants cannot accurately locate the speaker. Based on this, the inventors tried to realize the speaker recognition by providing two or more lenses, i.e., a combination of a fixed-focus lens and a zoom lens, in a meeting place. However, how to utilize two shots to realize the identification in the meeting place on the basis of high accuracy and real-time performance is an urgent problem to be solved.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speaker recognition method, apparatus, electronic device, storage medium, and system to solve the problem of speaker recognition.

According to a first aspect, an embodiment of the present invention provides a speaker identification method, including:

acquiring sound source positioning information in a meeting place and a panoramic image acquired by a fixed-focus lens; the sound source positioning information is position information of a sound source relative to a sound source positioning module;

determining the position information of each face in the meeting place relative to the fixed-focus lens based on the position information of the face in the panoramic image to obtain first position information;

converting the sound source positioning information and the first position information into position information of the sound source and each face in the meeting place relative to the zoom lens by using the sound source positioning module and the position relationship between the fixed-focus lens and the zoom lens to respectively obtain second position information and third position information;

and determining a speaker and rotation information of the zoom lens according to the second position information and the third position information, so that the zoom lens collects an image of the speaker.

In the speaker recognition method provided by this embodiment, the conversion of the position information is realized by using the sound source positioning module and the position relationship between the fixed focus lens and the zoom lens, so that the speaker is recognized by using the cooperation of the sound source positioning module, the fixed focus lens and the zoom lens; and because what mainly handled is the conversion of positional information in the identification process of speaker, guaranteed speaker identification's accuracy and real-time, realized meeting accurate switching of the person of speaking in the scene, compare the single scene of making a video recording of speaker in traditional video conference, can do the person of speaking and arrive, track the person object of speaking, increase the intellectuality at video conference terminal.

With reference to the first aspect, in a first implementation manner of the first aspect, the determining, based on location information of a face image in the panoramic image, location information of each face in the conference venue relative to the fixed-focus lens to obtain first location information includes:

determining the position relation between each face center point in the panoramic image and the panoramic image center point by using the position information of the face in the panoramic image;

acquiring the field angle of the fixed-focus lens and the parameters of the panoramic image;

and determining the first position information based on the field angle of the fixed-focus lens, the parameters of the panoramic image and the position relationship between the center point of each face and the center point of the panoramic image.

According to the speaker recognition method provided by the embodiment, the position information of the face in the meeting place relative to the fixed-focus lens is determined based on the angle of view of the fixed-focus lens, only the angle of view of the fixed-focus lens, the parameters of the panoramic image and the position relation are involved in the processing process, the first position information can be determined by using less data processing amount, and the real-time performance of speaker recognition is improved.

With reference to the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the determining the first position information based on the field angle of the fixed-focus lens, the parameters of the panoramic image, and the position relationship between the center point of each human face and the center point of the panoramic image includes:

determining angle information of each face in the meeting place relative to the fixed focus lens by using the field angle of the fixed focus lens, the parameters of the panoramic image and the position relationship between the center point of each face and the center point of the panoramic image;

calculating the focal length of the fixed-focus lens by using the parameters of the panoramic image and the field angle of the fixed-focus lens;

acquiring a preset face height;

calculating the distance from each human face in the meeting place to the fixed focus lens by using the preset human face height, the focal length of the fixed focus lens and the parameters of the panoramic image;

and determining coordinate information of each face in the meeting place relative to the fixed-focus lens based on the distance from each face in the meeting place to the fixed-focus lens and the angle information of each face in the meeting place relative to the fixed-focus lens.

According to the speaker identification method provided by the embodiment, after the angle information of the face in the meeting place relative to the fixed-focus lens is determined, the coordinate information of the face relative to the fixed-focus lens is determined by using the face height, the focal length of the fixed-focus lens and the parameters of the panoramic image, wherein the face height, the focal length of the fixed-focus lens and the parameters of the panoramic image are fixed values, and the fixed values are used for directly determining the coordinate information, so that the accuracy and the real-time performance of a calculation result can be guaranteed.

With reference to the first aspect, in a third implementation manner of the first aspect, the converting the sound source location information and the first location information into location information of the sound source and each human face in the conference venue relative to the zoom lens to obtain second location information and third location information respectively includes:

converting the sound source positioning information into angle information of the sound source relative to the zoom lens by utilizing the position relation between the sound source positioning module and the zoom lens to obtain second position information;

converting the first position information into coordinate information of the face in the meeting place relative to the zoom lens by utilizing the position relation between the fixed-focus lens and the zoom lens;

and determining angle information of each human face in the meeting place relative to the zoom lens based on coordinate information of the human face in the meeting place relative to the zoom lens to obtain the third position information.

According to the speaker identification method provided by the embodiment, after the coordinate information of the face in the meeting place relative to the zoom lens is determined, the corresponding angle information can be determined by utilizing simple data processing, the data processing amount is reduced, and the real-time property of determining the third position information is ensured.

With reference to the first aspect or any one of the first to third embodiments of the first aspect, in a fourth embodiment of the first aspect, the position information of the sound source and each human face within the meeting field with respect to the zoom lens includes angle information of the sound source and each human face within the meeting field with respect to the zoom lens; wherein, the determining a speaker and the rotation information of the zoom lens according to the second position information and the third position information so that the zoom lens collects the image of the speaker comprises:

calculating an absolute value of a difference between the second position information and the third position information;

judging whether the absolute value of the difference value is within a preset range or not;

when the absolute value of the difference value is within a preset range, determining a face corresponding to the sound source;

and determining the speaker and the rotation information of the zoom lens based on the face corresponding to the sound source so that the zoom lens collects the image of the speaker.

According to the speaker recognition method provided by the embodiment, the second position information corresponding to the sound source is compared with the third position information corresponding to the face in a difference value mode, if the absolute value of the error is within the preset range, the speaker is determined, and the sound source is compared with the face image collected by the fixed-focus lens, so that the accuracy of speaker recognition is improved.

With reference to the fourth implementation manner of the first aspect, in the fifth implementation manner of the first aspect, the determining, based on a face corresponding to the sound source, the speaker and rotation information of the zoom lens, so that the zoom lens captures an image of the speaker includes:

determining rotation information of the zoom lens based on angle information of a face corresponding to the sound source relative to the zoom lens;

acquiring an image collected by the rotated zoom lens;

determining the speaker based on the position of the face in the acquired image acquired by the zoom lens;

and controlling the image of the speaker collected by the zoom lens.

According to the speaker identification method provided by the embodiment of the invention, the rotated zoom lens is used for collecting the image, the face position in the collected image is used for determining the speaker again, the zoom lens is finally used for collecting the image of the speaker, and the accuracy of the collected image of the speaker is improved.

With reference to the fifth implementation manner of the first aspect, in the sixth implementation manner of the first aspect, the controlling the image of the speaker captured by the zoom lens includes:

acquiring an image of the speaker;

judging whether the position of the face area of the speaker in the image of the speaker meets a preset condition or not;

when the position of the face area of the speaker in the image of the speaker does not meet a preset condition, adjusting the zoom lens to adjust the position of the face area of the speaker in the image of the speaker, and determining the image of the speaker in the zoom lens;

and displaying the image of the speaker in the zoom lens.

According to the speaker identification method provided by the embodiment, the zoom lens is finely adjusted, so that the speaker is accurately positioned, and the optimal face ratio of the speaker image in the zoom lens is ensured.

According to a second aspect, an embodiment of the present invention further provides a speaker recognition apparatus, including:

the acquisition module is used for acquiring sound source positioning information in a meeting place and a panoramic image acquired by a fixed-focus lens; the sound source positioning information is position information of a sound source relative to a sound source positioning module;

the first determining module is used for determining the position information of each face in the meeting place relative to the fixed-focus lens based on the position information of the face in the panoramic image to obtain first position information;

the conversion module is used for converting the sound source positioning information and the first position information into position information of the sound source and each human face in the meeting place relative to the zoom lens by utilizing the position relation between the sound source positioning module and the fixed-focus lens as well as the zoom lens, and respectively obtaining second position information and third position information;

and the second determining module is used for determining a speaker and the rotation information of the zoom lens according to the second position information and the third position information so that the zoom lens collects the image of the speaker.

According to the speaker recognition device provided by the embodiment of the invention, the conversion of position information is realized by utilizing the position relation between the sound source positioning module and the fixed-focus lens as well as the zoom lens, so that the speaker is recognized by utilizing the cooperation of the sound source positioning module, the fixed-focus lens and the zoom lens; and because what mainly handled is the conversion of positional information in the identification process of speaker, guaranteed speaker identification's accuracy and real-time, realized meeting accurate switching of the person of speaking in the scene, compare the single scene of making a video recording of speaker in traditional video conference, can do the person of speaking and arrive, track the person object of speaking, increase the intellectuality at video conference terminal.

According to a third aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, and the processor executing the computer instructions to perform the speaker recognition method according to the first aspect or any one of the embodiments of the first aspect.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the speaker identification method according to the first aspect or any one of the implementation manners of the first aspect.

According to a fifth aspect, an embodiment of the present invention provides a video conference system, including:

the sound source positioning module is used for determining sound source positioning information in a meeting place;

the fixed focus lens is used for acquiring panoramic images in the meeting place;

the zoom lens is used for collecting an image of a speaker;

the electronic device according to the third aspect of the present invention is connected to the sound source localization module, the fixed focus lens, and the zoom lens.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 shows a block diagram of a conference system in an embodiment of the invention;

fig. 2 is a flowchart of a speaker recognition method according to an embodiment of the present invention;

fig. 3 is a flowchart of a speaker recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of sound source localization information relative to location information of a sound source localization module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of position information of a human face relative to a fixed-focus lens according to an embodiment of the invention;

FIG. 6 is a schematic diagram of position information of a human face relative to a zoom lens according to an embodiment of the present invention;

fig. 7 is a flowchart of a speaker recognition method according to an embodiment of the present invention;

8 a-8 b are schematic diagrams of images determined to be a speaker in a zoom lens according to an embodiment of the invention;

fig. 9 is a block diagram of a structure of a speaker recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a conference system, as shown in fig. 1, the conference system includes: a sound source localization module 10, at least one fixed focus lens 20, at least one zoom lens 30, and an electronic device 40. The sound source positioning module 10, the at least one fixed-focus lens 20, the at least one zoom lens 30, and the electronic device 40 may be set in the meeting place according to actual conditions, and once the setting is completed, the position relationship among the sound source positioning module 10, the at least one fixed-focus lens 20, the at least one zoom lens 30, and the electronic device 40 may be stored in the electronic device 40, or may be stored in another place, where the electronic device extracts the corresponding position relationship from another place when necessary.

The sound source positioning module 10, the at least one fixed-focus lens 20, and the at least one zoom lens 30 are all connected to the electronic device 40. The sound source positioning module 10 is configured to position a sound source in a meeting place, determine position information of the sound source relative to the sound source positioning module to obtain sound source positioning information, and send the sound source positioning information to the electronic device 40. The fixed-focus lens 20 is configured to collect a panoramic image in the venue and send the panoramic image to the electronic device 40, and the electronic device 40 determines, based on the position information of the faces in the panoramic image, position information of each face in the venue relative to the fixed-focus lens, so as to obtain first position information.

The electronic device 40 determines the speaker in the conference hall using the sound source localization information transmitted by the sound source localization module and the position information of each human face in the conference hall relative to the fixed-focus lens. After the speaker is determined, the zoom lens is controlled to collect an image of the speaker.

The electronic equipment can determine one zoom lens from at least one zoom lens by utilizing the sound source positioning information for image acquisition of a subsequent speaker; after the speaker is determined, the position information of the speaker is used to determine one zoom lens from at least one zoom lens to collect the image of the speaker, and the like.

In accordance with an embodiment of the present invention, there is provided a speaker recognition method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

In this embodiment, a speaker recognition method is provided, which can be used in the above-mentioned electronic device, such as a computer, a tablet, a server, etc., and fig. 2 is a flowchart of the speaker recognition method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

and S11, acquiring sound source positioning information in the conference hall and the panoramic image collected by the fixed-focus lens.

The sound source positioning information is the position information of the sound source relative to the sound source positioning module.

The sound source positioning module collects sound sources in a meeting place in real time in the video conference to obtain sound source positioning information in the meeting place, the sound source positioning information is sent to the electronic equipment, and accordingly the electronic equipment can obtain the sound source positioning information in the meeting place. The sound source positioning information can be regarded as the coordinate information and/or the angle information of the sound source obtained by using the sound source positioning module as the origin of coordinates.

The fixed-focus lens collects the panoramic image in the meeting place in real time and sends the collected panoramic image to the electronic equipment, and accordingly the electronic equipment can obtain the panoramic image in the meeting place. The panoramic image comprises face images of all conference participants in the conference room.

And S12, determining the position information of each face in the conference hall relative to the fixed-focus lens based on the position information of the face in the panoramic image, and obtaining first position information.

After acquiring the panoramic image in the meeting place, the electronic equipment performs face recognition processing on the panoramic image to determine the position information of the face in the panoramic image. For example, the panoramic image may be input into a face recognition network, and output as position information of all faces in the panoramic image, where the position information of a face may be represented by an upper left-corner coordinate of an identification frame corresponding to the face and a width and a height of the identification frame, or may be represented by an upper left-corner coordinate and a lower right-corner coordinate of the identification frame corresponding to the face.

Because the position information of the human face in the panoramic image depends on the position relationship between the human face and the fixed-focus lens, the electronic equipment can determine the position information of each human face in the conference hall relative to the fixed-focus lens by using the position information of the human face in the panoramic image to obtain the first position information. Details about this step will be described later.

And S13, converting the sound source positioning information and the first position information into the position information of the sound source and each human face in the meeting place relative to the zoom lens by using the sound source positioning module and the position relation between the fixed-focus lens and the zoom lens, and respectively obtaining second position information and third position information.

The electronic device acquires the position information of the sound source relative to the sound source localization module in S11, that is, acquires the position information of the sound source with the sound source localization module as the origin of coordinates; in S12, the electronic device obtains the position information of the face in the meeting place with respect to the fixed-focus lens, that is, the position information of the face in the meeting place with the fixed-focus lens as the origin of coordinates.

Because the position relationship between the sound source positioning module and the zoom lens is fixed, and the position relationship between the fixed-focus lens and the zoom lens is fixed, the electronic equipment can respectively convert the position information of the sound source relative to the sound source positioning module into the position information with the zoom lens as the origin of coordinates by utilizing the principle of coordinate translation to obtain second position information, and convert the first position information into the position information with the zoom lens as the origin of coordinates to obtain third position information. Wherein each face in the meeting place corresponds to a piece of third position information.

And S14, determining the speaker and the rotation information of the zoom lens according to the second position information and the third position information, so that the zoom lens collects the image of the speaker.

The sound source is originated from the speaker, and an error between the positional information of the sound source with respect to the zoom lens and the positional information of the face of the speaker with respect to the zoom lens is within a preset range. Therefore, the electronic equipment can determine the area where the speaker is located in the meeting place according to the third position information and the second position information corresponding to each face in the meeting place, then control the zoom lens to rotate and collect the face image of the area where the speaker is located, and finally confirm the image of the speaker.

This step will be described in detail below.

In this embodiment, a speaker recognition method is provided, which can be used in the above-mentioned electronic device, such as a computer, a tablet, a server, etc., and fig. 3 is a flowchart of the speaker recognition method according to an embodiment of the present invention, as shown in fig. 3, the flowchart includes the following steps:

and S21, acquiring sound source positioning information in the conference hall and the panoramic image collected by the fixed-focus lens.

Fig. 4 shows a schematic diagram of coordinates corresponding to sound source localization information, where an origin 0 is a position of the sound source localization module, and P (x, y, z) represents coordinates of a sound source. Theta is the horizontal angle of sound, namely the included angle between the arrival direction and the horizontal line, and theta is 0-180 degrees.

The vertical angle of sound, namely the included angle between the direction of arrival and the vertical line,

is 0-180 degrees.

Please refer to S11 in fig. 2 for details, which are not described herein.

And S22, determining the position information of each face in the conference hall relative to the fixed-focus lens based on the position information of the face in the panoramic image, and obtaining first position information.

Specifically, the step S22 includes the following steps:

s221, determining the position relation between each face center point in the panoramic image and the center point of the panoramic image by using the position information of the face in the panoramic image.

As described above, after the electronic device acquires the panoramic image, the electronic device may identify the position information of the face in the panoramic image by using the face recognition network, and after the position information is obtained, the central point of each face in the panoramic image may be determined. Because the size of the panoramic image collected by the fixed-focus lens is fixed, the electronic equipment can determine the position of the central point of the panoramic image.

After the central point of each face in the panoramic image and the central point of the panoramic image are obtained by the electronic equipment, the vertical distance from the central point of each face in the panoramic image to the central point of the panoramic image and the horizontal distance from the central point of each face in the panoramic image to the central point of the panoramic image can be obtained through calculation.

For example, the position information of a face is expressed as: (x, y, w, h), wherein x is the coordinate of the upper left corner of the face, y is the coordinate of the upper right corner of the face, w is the height of the face, and h is the width of the face. The electronic equipment can obtain face center point coordinates (FCenter X, FCenter Y) according to the face rectangular area:

FCenterX＝x+w/2；

FCenterY＝y+h/2。

the center point of the panoramic image is (CenterX, CenterY).

The horizontal distance between the center point of the face and the center point of the panoramic image is as follows: FCenterX-CenterX;

the vertical distance between the center point of the face and the center point of the panoramic image is as follows: FCenter Y-center Y.

S222, acquiring a field angle of the fixed focus lens and a parameter of the panoramic image.

The field angle of the fixed-focus lens includes a horizontal field angle of the panoramic lens and a vertical field angle of the panoramic lens, and the field angle of the fixed-focus lens may be stored in the electronic device or acquired from the outside by the electronic device. The parameters of the panoramic image are the height and the width of the panoramic image.

For example, horizontal field angle: angle _ H112.0, vertical field angle: angeljv ═ 76.0.

And S223, determining first position information based on the angle of view of the fixed-focus lens, the parameters of the panoramic image and the position relationship between the center point of each face and the center point of the panoramic image.

As shown in fig. 5, α is a vertical angle of the face with respect to the fixed-focus lens, β is half of a vertical angle of view of the fixed-focus lens, F is a focal length of the fixed-focus lens, fh is a vertical distance between a center point of the face and a center point of the panoramic image, and H is a width of the panoramic image.

(1) And determining the angle information of each face in the meeting place relative to the fixed-focus lens by utilizing the field angle of the fixed-focus lens, the parameters of the panoramic image and the position relationship between the center point of each face and the center point of the panoramic image.

Fig. 5 shows a schematic diagram of calculating the vertical angle of the face with respect to the camera, and according to the principle of similar triangle, the vertical angle of the face with respect to the fixed-focus lens is obtained by obtaining α. As shown in fig. 5, according to the trigonometric function:

tanα＝fh/F，

tanβ＝H/F，

tanα/tanβ＝fh/H，

α＝actan(tanβ*fh/H)。

wherein, β, fh and H are all known parameters, and the vertical angle α of the face relative to the fixed-focus lens can be calculated by using the formula. Similarly, the horizontal angle of the face relative to the fixed-focus lens can be obtained by using half of the horizontal field angle of the fixed-focus lens, the horizontal distance between the center point of the face and the center point of the panoramic image, and the height of the panoramic image.

(2) And calculating the focal length of the fixed-focus lens by using the parameters of the panoramic image and the field angle of the fixed-focus lens.

As shown in fig. 5, tan β ═ H/F, the focal length F of the fixed-focus lens is expressed as follows:

F＝H/tanβ。

(3) and acquiring a preset face height.

The preset face height may be preset in the electronic device, or may be acquired by the electronic device from another place. Among them, since the height of the face does not vary much, the heights of all faces in the panoramic image can be considered to be equal, estimated to be 25 cm.

(4) And calculating the distance from each human face in the meeting place to the fixed-focus lens by utilizing the preset human face height, the focal length of the fixed-focus lens and the parameters of the panoramic image.

According to the principle of the similar triangle,

the height of the face/the actual distance from the face to the fixed-focus lens is equal to the width of the panoramic image/the focal length of the fixed-focus lens.

Accordingly, the following expression can be obtained:

h/d＝H/F，

therefore, the actual distance d from the face to the fixed-focus lens can be obtained, namely d-H F/H.

(5) And determining coordinate information of each face in the meeting place relative to the fixed-focus lens based on the distance from each face in the meeting place to the fixed-focus lens and the angle information of each face in the meeting place relative to the fixed-focus lens.

The actual distance from the face to the fixed-focus lens is equivalent to the y coordinate of the face, that is, y is H F/H.

According to the trigonometric function, the x coordinate and the z coordinate of the face can be determined.

x＝y/tan(180-α)

z＝y/(sin(180-α)*tan(180-α'))

Wherein α' is the horizontal angle of the human face relative to the prime lens.

Therefore, the coordinate information (x, y, z) of each human face in the meeting place relative to the fixed-focus lens can be determined.

After the angle information of the face in the meeting place relative to the fixed-focus lens is determined, the coordinate information of the face relative to the fixed-focus lens is determined by using the face height, the focal length of the fixed-focus lens and the parameters of the panoramic image, wherein the face height, the focal length of the fixed-focus lens and the parameters of the panoramic image are fixed values, and the fixed values are used for directly determining the coordinate information, so that the accuracy and the real-time performance of a calculation result can be guaranteed.

And S23, converting the sound source positioning information and the first position information into the position information of the sound source and each human face in the meeting place relative to the zoom lens by using the sound source positioning module and the position relation between the fixed-focus lens and the zoom lens, and respectively obtaining second position information and third position information.

Specifically, the step S23 includes the following steps:

s231, converting the sound source positioning information into angle information of the sound source relative to the zoom lens by utilizing the position relation between the sound source positioning module and the zoom lens to obtain second position information.

Fig. 6 shows a schematic diagram of the converted positional information of the sound source with respect to the zoom lens. The electronic equipment obtains coordinate information (x ', y ', z ') of the sound source relative to the zoom lens by using the position relation between the sound source positioning module and the zoom lens, namely, by using a coordinate transformation principle, and then obtains angle information of the sound source relative to the zoom lens according to the coordinate information of the sound source relative to the zoom lens. The details are as follows:

π-θ'＝ac tan(y/x')

where θ' is horizontal angle information of the sound source with respect to the zoom lens,

is the vertical angle information of the sound source with respect to the zoom lens. Thereby, the electronic device can obtain the angle information of the sound source with respect to the zoom lens, i.e., the second position information.

And S232, converting the first position information into coordinate information of the human face in the meeting field relative to the zoom lens by utilizing the position relation between the fixed-focus lens and the zoom lens.

The electronic device obtains, in S22, position information of each human face in the meeting place with respect to the fixed-focus lens, that is, coordinate information of each human face in the meeting place with respect to the fixed-focus lens. The electronic equipment converts the coordinate information of each human face in the scene relative to the fixed-focus lens into the coordinate information relative to the zoom lens by utilizing the position relation between the fixed-focus lens and the zoom lens.

And S233, determining angle information of each human face in the meeting place relative to the zoom lens based on the coordinate information of the human face in the meeting place relative to the zoom lens, and obtaining third position information.

Similar to the processing method of the sound source positioning information, after the electronic equipment determines the coordinate information of the faces in the meeting place relative to the zoom lens, the electronic equipment determines the angle information of each face in the meeting place relative to the zoom lens by using the relation of the trigonometric function, and third position information is obtained.

And S24, determining the speaker and the rotation information of the zoom lens according to the second position information and the third position information, so that the zoom lens collects the image of the speaker.

Please refer to S14 in fig. 2 for details, which are not described herein.

In this embodiment, a speaker recognition method is provided, which can be used in the above-mentioned electronic device, such as a computer, a tablet, a server, etc., and fig. 7 is a flowchart of the speaker recognition method according to an embodiment of the present invention, as shown in fig. 7, the flowchart includes the following steps:

and S31, acquiring sound source positioning information in the conference hall and the panoramic image collected by the fixed-focus lens.

Please refer to S21 in fig. 3 for details, which are not described herein.

And S32, determining the position information of each face in the conference hall relative to the fixed-focus lens based on the position information of the face in the panoramic image, and obtaining first position information.

Please refer to S22 in fig. 3 for details, which are not described herein.

And S33, converting the sound source positioning information and the first position information into the position information of the sound source and each human face in the meeting place relative to the zoom lens by using the sound source positioning module and the position relation between the fixed-focus lens and the zoom lens, and respectively obtaining second position information and third position information.

Please refer to S23 in fig. 3 for details, which are not described herein.

And S34, determining the speaker and the rotation information of the zoom lens according to the second position information and the third position information, so that the zoom lens collects the image of the speaker.

Specifically, the step S34 includes the following steps:

s341, an absolute value of a difference between the second position information and the third position information is calculated.

The second position information is used for representing angle information of the sound source relative to the zoom lens, and the third position information is used for representing angle information of a human face in the meeting place relative to the zoom lens. The electronic device calculates the absolute value of the difference between the two angle information.

And S342, judging whether the absolute value of the difference value is within a preset range.

When the absolute value of the difference is within the preset range, if the face is detected in the speaking place, executing S343; otherwise, the position information of the next face with respect to the zoom lens is extracted, that is, the next third position information is extracted, and S341 is executed.

And S343, determining the face corresponding to the sound source.

It should be noted that there may be one or more than 1 detected face.

And S344, determining the speaker and the rotation information of the zoom lens based on the face corresponding to the sound source, so that the zoom lens collects the image of the speaker.

After the face corresponding to the sound source is determined, the electronic equipment determines the rotation information of the zoom lens by using the angle information of the face corresponding to the sound source relative to the zoom lens. After the electronic equipment determines the rotation information of the zoom lens, the holder of the zoom lens rotates to collect the image of the speaker. Specifically, the step S344 includes the following steps:

(1) and determining the rotation information of the zoom lens based on the angle information of the face corresponding to the sound source relative to the zoom lens.

When the number of the faces corresponding to the sound source is 1, directly determining the rotation information of the zoom lens by using the angle information of the faces relative to the zoom lens; when the number of the faces corresponding to the sound source is larger than 1, the rotation information of the zoom lens can be determined in a statistical manner, for example, the center points of all the faces corresponding to the sound source can be calculated, and the rotation information of the zoom lens can be determined by using one center point of all the faces corresponding to the sound source.

(2) And acquiring the image collected by the rotated zoom lens.

And the holder of the zoom lens rotates based on the rotation information determined in the step, and the zoom lens is used for collecting a corresponding image and sending the image to the electronic equipment.

When the zoom lens does not detect the human face, the zoom lens is used for zooming out the human face, namely zoom out; if 1-3 human faces are detected, directly executing the following step (3); if more than 3 faces are detected, the faces are magnified by a zoom lens, namely zoom in.

(3) And determining the speaker based on the position of the face in the acquired image in the image acquired by the zoom lens.

As shown in fig. 8a, the image acquired by the zoom lens has 3 faces, which are face a, face B and face C, respectively, the electronic device calculates the distance between the center point of each face and the center point of the image (i.e., the center point of the screen), and determines the face of the speaker from the face with the smallest distance. As shown for face B in figure 8 a.

(4) And controlling the image of the speaker collected by the zoom lens.

After the electronic equipment determines that the face B is the face of the speaker, the zoom lens can be controlled to collect the image of the speaker, and the zoom lens can be finely adjusted first and then the image of the speaker can be collected.

The rotated zoom lens is used for collecting images, the spokesman is determined again by using the face position in the collected images, and finally the zoom lens is used for collecting the images of the spokesman, so that the accuracy of the collected images of the spokesman is improved.

Specifically, the step (4) includes the steps of:

4.1) acquiring an image of the speaker.

As shown in fig. 8a, the electronic device acquires an image of the speaker captured by the zoom lens.

4.2) judging whether the position of the face area of the speaker in the image of the speaker meets a preset condition or not.

As shown in fig. 8a, the electronic device may calculate the proportion of the face area of the speaker in the entire image and the offset between the center point of the face and the center point of the entire image, and determine whether the face area satisfies the preset condition. For example, by calculating whether the center point and the width and height of the face B meet preset conditions, when the position of the face area of the speaker in the image of the speaker does not meet the preset conditions, 4.3) is executed; otherwise, 4.4) is executed.

4.3) adjusting the zoom lens to adjust the position of the face area of the speaker in the image of the speaker.

When the position of the face area of the speaker in the image of the speaker does not meet the preset condition, the electronic equipment adjusts the zoom lens, namely, the zoom lens is finely adjusted, so that the face of the speaker is accurately positioned.

4.4) determining the image of the speaker in the zoom lens.

After the zoom lens is trimmed, as shown in fig. 8B, the image of the speaker in the zoom lens, i.e., the face B, is determined.

4.5) displaying the image of the speaker in the zoom lens.

After the image of the speaker in the zoom lens is determined, the image in the zoom lens is switched to be output, namely, the current image of the speaker is displayed.

In the speaker recognition method provided by the embodiment, the second position information corresponding to the sound source is compared with the third position information corresponding to the face by difference, if the absolute value of the error is within the preset range, the speaker is determined, and the sound source is compared with the face image acquired by the fixed-focus lens, so that the accuracy of speaker recognition is improved; and the zoom lens is finely adjusted, so that the speaker is accurately positioned, and the optimal face ratio of the image of the speaker in the zoom lens is ensured.

In this embodiment, a speaker recognition apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

The present embodiment provides a speaker recognition apparatus, as shown in fig. 9, including:

an obtaining module 41, configured to obtain sound source positioning information in a meeting place and a panoramic image collected by a fixed-focus lens; the sound source positioning information is the position information of the sound source relative to the sound source positioning module.

A first determining module 42, configured to determine, based on the position information of the faces in the panoramic image, position information of each face in the meeting place relative to the fixed-focus lens, so as to obtain first position information.

A conversion module 43, configured to convert, by using the sound source positioning module and the positional relationship between the fixed-focus lens and the zoom lens, the sound source positioning information and the first position information into position information of the sound source and each human face in the conference hall relative to the zoom lens, so as to obtain second position information and third position information, respectively.

A second determining module 44, configured to determine, according to the second position information and the third position information, a speaker and rotation information of the zoom lens, so that the zoom lens collects an image of the speaker.

The speaker recognition device provided by this embodiment realizes conversion of position information by using the sound source positioning module and the position relationship between the fixed focus lens and the zoom lens, thereby realizing recognition of a speaker by using the cooperation of the fixed focus lens and the zoom lens; and because what mainly handled is the conversion of positional information in the identification process of speaker, guaranteed speaker identification's accuracy and real-time, realized meeting accurate switching of the person of speaking in the scene, compare the single scene of making a video recording of speaker in traditional video conference, can do the person of speaking and arrive, track the person object of speaking, increase the intellectuality at video conference terminal.

The speaker recognition means in this embodiment is presented as a functional unit, where the unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.

Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.

An embodiment of the present invention further provides an electronic device, which includes the speaker recognition apparatus shown in fig. 9.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, as shown in fig. 10, the electronic device may include: at least one processor 51, such as a CPU (Central Processing Unit), at least one communication interface 53, memory 54, at least one communication bus 52. Wherein a communication bus 52 is used to enable the connection communication between these components. The communication interface 53 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 53 may also include a standard wired interface and a standard wireless interface. The Memory 54 may be a high-speed RAM Memory (volatile Random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 54 may alternatively be at least one memory device located remotely from the processor 51. Wherein the processor 51 may be in connection with the apparatus described in fig. 9, the memory 54 stores an application program, and the processor 51 calls the program code stored in the memory 54 for performing any of the above-mentioned method steps.

The communication bus 52 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 52 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The memory 54 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 54 may also comprise a combination of the above types of memories.

The processor 51 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 51 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 54 is also used to store program instructions. The processor 51 may invoke program instructions to implement the speaker recognition method as shown in the embodiments of fig. 2, 3 and 7 of the present application.

Embodiments of the present invention further provide a non-transitory computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the speaker method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A speaker recognition method, comprising:

2. The method according to claim 1, wherein the determining position information of each human face in the meeting place relative to the prime lens based on the position information of the human face image in the panoramic image to obtain first position information comprises:

3. The method according to claim 2, wherein the determining the first position information based on the field angle of the fixed-focus lens, the parameters of the panoramic image, and the position relationship between the center point of each human face and the center point of the panoramic image comprises:

acquiring a preset face height;

4. The method according to claim 1, wherein the converting the sound source location information and the first location information into location information of the sound source and each human face in the conference hall relative to the zoom lens to obtain second location information and third location information respectively comprises:

5. The method according to any one of claims 1 to 4, wherein the position information of the sound source and each human face within the meeting field with respect to the zoom lens comprises angle information of the sound source and each human face within the meeting field with respect to the zoom lens; wherein, the determining a speaker and the rotation information of the zoom lens according to the second position information and the third position information so that the zoom lens collects the image of the speaker comprises:

6. The method according to claim 5, wherein the determining the speaker and the rotation information of the zoom lens based on the face corresponding to the sound source so that the zoom lens captures the image of the speaker comprises:

acquiring an image collected by the rotated zoom lens;

and controlling the image of the speaker collected by the zoom lens.

7. The method of claim 6, wherein said controlling the image of the speaker captured by the zoom lens comprises:

acquiring an image of the speaker;

and displaying the image of the speaker in the zoom lens.

8. A speaker recognition apparatus, comprising:

9. An electronic device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the speaker recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the speaker recognition method according to any one of claims 1 to 7.

11. A video conferencing system, comprising:

the zoom lens is used for collecting an image of a speaker;

the electronic device of claim 9, said electronic device being connected to said sound source localization module, said fixed focus lens, and said zoom lens.