CN117011924B

CN117011924B - Method and system for estimating number of speakers based on voice and image

Info

Publication number: CN117011924B
Application number: CN202311278365.6A
Authority: CN
Inventors: 白炳潮; 宛敏红; 宋伟; 朱世强
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2024-02-13
Anticipated expiration: 2043-10-07
Also published as: CN117011924A

Abstract

A speaker number estimation method and system based on voice and image, the method includes: acquiring image data and microphone array data; detecting the number of faces in an image; generating one-hot vectors according to the number of faces; calculating a frequency domain spectrum using the microphone array data; inputting the frequency domain signal into a noise estimation neural network to estimate noise vectors in a noise embedding space; inputting the noise vector and the frequency domain signal into a neural network to estimate a human voice vector of a human voice embedding space; combining one-hot vectors of the number of faces and the voice vectors into a mixed vector; passing the mixed vector through a plurality of full connection layers; outputting the full connection layer result to the speaker number embedding space to estimate speaker number vector; the speaker number vector is input into a softmax classifier to estimate the speaker number. The invention improves the accuracy and the anti-interference capability of speaker number estimation in a noise environment.

Description

Method and system for estimating number of speakers based on voice and image

Technical Field

The invention is mainly applied to the technical field of multi-mode signal processing of voice and image, particularly relates to video conference, sound source positioning, speaker number estimation and the like, and particularly relates to a speaker number estimation method and system based on voice and image.

Background

In application scenarios involving image and audio signal processing, there is often a need for speaker direction localization and speaker number estimation, and in video conferences, sound source localization is often an indispensable function, and the premise of sound source localization is to know the number of sound sources, but estimating the number of speakers is always a difficult problem, and the accuracy and effect of speaker number estimation are often not as good as desired. In the method for estimating the number of speakers based on the DNN model and the support vector machine model (publication number: CN 201710123753.5), the method for estimating the number of speakers based on the DNN model and the support vector machine model is utilized. The scheme utilizes the voice signal to estimate the number of the speakers, but does not consider the influence of noise on the number estimation of the speakers, and does not combine with image information to improve the accuracy of estimation. In a method and a system for estimating the number of competing speakers based on speaker-embedded space (publication number: CN 202010009945.5), the scheme extracts information useful for speaker estimation from amplitude signals and phase signals by the embedded space technique, and finally extracts the number of speakers by using the speaker-embedded space. Although this scheme uses the amplitude and phase information of the microphone array and mines information useful for speaker number estimation on the basis thereof, the congenital limitation of the array element number to speaker number estimation is not considered, i.e., the number of sound sources must not be larger than the number of speakers, and the scheme does not estimate the number of speakers in combination with image information.

In view of the above, there is no method for improving the speaker number estimation accuracy by fully utilizing the image and voice information.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a speaker number estimation method and system based on voice and images. The invention fully uses the image and voice multi-mode information to improve the accuracy of speaker number estimation, and processes the image and voice multi-mode information by utilizing the neural network to improve the accuracy of speaker number estimation.

In order to solve the above problems, the present invention provides a speaker number estimation method based on voice and image, comprising:

acquiring image data and microphone array data;

detecting the number of faces in an image;

generating one-hot vectors according to the number of faces;

calculating a frequency domain data vector using the microphone array data;

inputting the frequency domain data vector into a noise estimation neural network to estimate a noise vector in a noise embedding space;

inputting the noise vector and the frequency domain signal into a neural network to estimate a human voice vector of a human voice embedding space;

combining one-hot vectors of the number of faces and the voice vectors into a mixed vector;

passing the mixed vector through a plurality of full connection layers;

outputting the full connection layer result to a speaker number embedding space, and estimating a speaker number vector;

finally, the speaker number vector is input into a softmax classifier, and the speaker number is estimated.

Wherein the step of acquiring image data and microphone array data comprises: and acquiring microphone array voice data and single-frame image data of the camera by using the camera and the microphone array.

The step of detecting the number of faces in the image comprises the following steps: and detecting faces in the image by using an image detection correlation model and an algorithm, and counting the number of the faces.

Wherein, generating a one-hot vector according to the number of faces comprises: and generating a one-hot vector according to the number of the faces, wherein the dimension of the one-hot vector is equal to the number M of the microphones of the microphone array, and if no face is detected, the subsequent steps are not started.

Wherein the step of calculating the frequency domain data vector using the microphone array data includes: the microphone data of each channel is converted into a frequency domain, then frequency domain signals in a designated frequency domain range are selected, and the frequency domain signals of a plurality of channels are spliced into a frequency domain data vector.

Wherein inputting the frequency domain data vector into the noise estimation neural network to estimate a noise vector in the noise embedding space comprises: and inputting the frequency domain data vector into a plurality of full-connection layers for dimension reduction, inputting the result of the full-connection layers into a plurality of GRU layers, and finally inputting the result of the GRU layers into a noise embedding space estimation noise vector.

The method for estimating the human voice vector in the human voice embedding space by inputting the noise vector and the frequency domain signal into the neural network comprises the following steps: and splicing the noise vector and the frequency domain vector data into a vector, and then inputting the result of the full-connection layer into the voice embedding space to estimate the voice vector through a plurality of full-connection layers.

The method for fusing the one-hot vector of the number of faces and the voice vector into a mixed vector comprises the following steps: and splicing the one-hot vector of the number of faces and the voice vector into a mixed vector.

Wherein, pass through the multilayer all-connection layer with the mixed vector, export the all-connection layer result to speaker embedding space and estimate speaker number vector, include: and (3) the mixed vector passes through a plurality of full-connection layers, and the full-connection layer result is output to the speaker number embedding space to estimate the speaker number vector.

Finally, inputting the speaker number vector into a softmax classifier to estimate the number of speakers, wherein the method comprises the following steps: the speaker number vector is input into a softmax classifier, the speaker number is obtained according to the output of the softmax and the set probability threshold value, and the dimension of the output vector of the softmax is the array element number M.

A second aspect of the present invention relates to a speaker number estimation system based on voice and image, comprising:

the data acquisition module is used for acquiring image data and microphone array data;

the face quantity detection module is used for detecting the quantity of faces in the image;

the one-hot vector generation module is used for generating one-hot vectors according to the number of faces;

a frequency domain data vector calculation module that calculates a frequency domain data vector using the microphone array data;

the noise vector estimation module inputs the frequency domain signal into a noise estimation neural network to estimate noise vectors in the noise embedding space;

the human voice vector estimation module inputs the noise vector and the frequency domain signal into the neural network to estimate the human voice vector of the human voice embedded space;

the vector fusion module fuses the one-hot vectors of the number of faces and the voice vectors into a mixed vector;

the full-connection module is used for enabling the mixed vector to pass through a plurality of full-connection layers;

the speaker number vector estimation module is used for inputting the full connection layer result into the speaker number embedding space and estimating the speaker number vector;

the speaker number estimating module inputs the speaker number vector into the softmax classifier to estimate the speaker number.

A third aspect of the present invention relates to a speech and image based speaker number estimation apparatus comprising a memory and one or more processors, the memory having executable code stored therein, the one or more processors when executing the executable code, for implementing a speech and image based speaker number estimation method of the present invention.

A fourth aspect of the present invention relates to a computer readable storage medium having stored thereon a program which, when executed by a processor, implements a method of estimating the number of speakers based on speech and images of the present invention.

The innovation points of the invention are as follows: and the accuracy of speaker number estimation in a noise environment is improved by fusion processing of the audio and image information.

The working principle of the invention is (analyzing the reason for the beneficial effect of the invention): the neural network and the embedded space technology are used for extracting face quantity vectors and voice vectors which are related to speaker quantity estimation from the audio and image information, so that the anti-interference capability on noise is enhanced; the image and voice information are fully combined to be used for speaker number estimation by fusing the face number vector and the voice vector; finally, the accuracy of speaker number estimation is improved by utilizing the neural network and the embedded space to estimate the speaker number.

The beneficial effects of the invention are as follows: the multi-channel voice information of the microphone array and the image information of the camera are fully utilized, and the information which is beneficial to estimating the number of speakers is extracted from the voice and the image information by using a neural network and an embedded space technology; the neural network is utilized to estimate the noise vector in the noise embedding space, and the neural network and the voice embedding space are utilized to extract the voice vector from the noise vector and the frequency domain data vector for voice quantity estimation, so that the accuracy and the anti-interference capability of speaker quantity estimation in the noise environment are improved; the method has the advantages that the one-hot vector and the voice vector of the number of faces are fused, the image and voice information are fully combined for speaker number estimation, the neural network and the embedded space are utilized for estimating the number of speakers, the complementarity of the voice and the image on speaker number estimation is fully utilized, the reliability of speaker number estimation is enhanced, and a reliable solution is provided for speaker number estimation in a noise environment.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow diagram of an embodiment of the method of the present invention;

FIG. 2 is a schematic diagram of the main model structure of an embodiment of the method of the present invention;

FIG. 3 is a schematic diagram of a terminal structure of the present invention;

FIG. 4 is a schematic diagram of the computer readable storage medium structure of the present invention;

fig. 5 is a system configuration diagram of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of an embodiment of a speaker number estimation method based on voice and image, as shown in fig. 1, the method includes the following steps:

step S11: image data and microphone array sample data are acquired.

On a voice and image interaction terminal device, image data and multi-channel voice data are acquired in real time through a camera and a microphone array, and the image data and the voice data are time-synchronized. The linear array and other one-dimensional microphone arrays can be equilateral triangle arrays, T-shaped arrays, uniform circular arrays, uniform square arrays, coaxial circular arrays, circular/rectangular area arrays and other two-dimensional microphone arrays.

Step S12: the number of faces in the image is detected.

And detecting faces in the image by using a face detection algorithm, and counting the number of the faces. The face detection algorithm may employ an open-source yolo series algorithm. If no face is detected, no subsequent steps are entered.

And S13, generating one-hot vectors of the number of faces.

Generating one-hot vector according to face quantity estimation, oThe dimension of the ne-hot vector is the number M of array elements, if there is a face, the one-hot vectorIn the case of two faces, thenIf there are M, then +.>. If it is greater than M, the same +.>。

Step S14, calculating a frequency domain data vector using the microphone array data.

Converting the multi-channel voice signal of the microphone array into frequency domain, selecting frequency domain data with specified range, such as 500Hz to 2000Hz, and splicing the frequency domain data of multiple channels into a frequency domain data vectorWherein->Frequency domain data representing the i-th microphone.

Step S15, the frequency domain data vector is input to a noise vector of the estimated noise embedding space in the noise estimation neural network.

The frequency domain data vector is input into two full-connection layers, the full-connection output is input into two GRU layers, the output of the GRU layers is input into a noise embedding space to estimate a noise vector, wherein the noise embedding space is a matrix, and the noise vector in the noise embedding space can be obtained by multiplying the output of the GRU by the matrix. Throughout this step, it is guaranteed that the dimension of the finally estimated noise vector is equal to the dimension of the frequency domain data vector, but the dimension is processed in a lifting dimension at a layer of open source in the middle of the network.

Step S16: the noise vector and the frequency domain data vector are input into a neural network to estimate a human voice vector of a human voice embedding space.

Splicing the frequency domain data vector and the noise vector, then carrying out dimension reduction through two layers of full-connection layers, reducing the dimension to the dimension K which is the same as the frequency domain data vector, inputting the result after dimension reduction into a human voice embedding space to estimate the human voice vector S, wherein the dimension S is. The human voice embedding space is a matrix whose two dimensions are M and K, respectively.

And S17, fusing the one-hot vectors of the number of faces and the voice vectors into a mixed vector.

One-hot vector of the number of faces is splicedAnd the human voice vector S are spliced into a vector +.>Wherein->Is +.>。

Step S18, willThrough two fully connected layers.

Will beOutput of fully-connected layer through two fully-connected layers>Its dimension is->。

Step S19, willInput deviceInto the speaker number embedding space, speaker number vectors are estimated.

Will beInput into speaker number embedding space to obtain speaker number vector +.>The speaker number embedded space is +.>Matrix of speaker number vector->Is +.>。

Step S20, the speaker number vectorInput into a softmax classifier to obtain the number of speakers.

Vector the number of speaking peopleInput into a softmax classifier, will +.>The value of (1) is converted into a probability value of 0 to 1, and the index subscript of the maximum probability value is selected to be the number of speakers. At the same time, a probability threshold is set, and only when the threshold is exceeded, the number of recognized speakers is judged to be valid.

In this embodiment, the presence of faces is detected using the prior image data, and the number of faces is counted to provide the "maximum possible number of speakers" information for speaker number estimation. The voice vector is obtained from the microphone array data by utilizing the noise embedding space and the voice embedding space, so that noise interference is reduced, and key information useful for speaker number estimation is extracted by dimension reduction of the multichannel data. And finally, estimating the number of speakers from face number information and voice vector information provided by the image by utilizing a neural network and a speaker number embedding space, and estimating the number of speakers by combining a classifier. Meanwhile, compared with a mode of estimating the speaker by using a video stream, the method can estimate the number of the speaker by using only one frame of image and voice data, and greatly reduces the computational complexity and the difficulty in practical application. The method also fully utilizes the audio and image information, deeply mines the information which is helpful for estimating the number of the speakers in the voice and image data, and effectively improves the accuracy of estimating the number of the speakers.

Referring to fig. 2, the present embodiment provides a schematic structure diagram of a speaker number estimation model based on voice and image, and implements a speaker number estimation method based on voice and image of the present embodiment 1, which mainly includes:

acquiring an image and microphone array sampling data;

generating a human face number onehot vector;

the noise vector estimation structure uses two full connection layers, two GRUs and noise embedding space to estimate noise vectors;

the human voice vector estimation structure estimates human voice vectors by using vector splicing, two full-connection layers and human voice embedding space;

the speaker number vector estimation structure uses vector splicing, two layers of full-connection layers and speaker number embedding space to estimate speaker number vectors;

a softmax classifier estimates the number of speakers.

Example 2

Referring to fig. 5, the present embodiment provides a speaker number estimation system based on voice and image, and a speaker number estimation method based on voice and image of the present embodiment 1 includes:

Example 3

Referring to fig. 3, the present embodiment provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a speaker number estimation method based on voice and image of the present embodiment 1.

Example 4

Referring to fig. 4, the present embodiment provides a speaker number estimation apparatus based on voice and image, including a memory and one or more processors, where the memory stores executable codes, and the one or more processors execute the executable codes to implement a speaker number estimation method based on voice and image of embodiment 1.

At the hardware level, the device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although other hardware required by the service is possible. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the method described above with respect to fig. 1. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present invention, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Improvements to one technology can clearly distinguish between improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) and software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments of the present invention are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are to be included in the scope of the claims of the present invention.

Claims

1. A method for estimating the number of speakers based on speech and images, comprising:

acquiring synchronous image data and microphone array data, and performing face detection on the image data;

detecting the number of faces in an image;

generating one-hot vectors according to the number of faces;

calculating fourier-frequency-domain data vectors using the microphone array data; comprising the following steps: performing Fourier transform on the time domain data of each array element, selecting frequency domain data of a plurality of array elements in a designated frequency domain range, and splicing the frequency domain data of the plurality of array elements into a frequency domain data vector;

inputting the frequency domain signal into a noise estimation neural network to estimate noise vectors in a noise embedding space;

passing the mixed vector through a plurality of full connection layers;

inputting the full connection layer result into a speaker number embedding space, and estimating a speaker number vector;

2. The method of claim 1, wherein the step of acquiring synchronous image data and microphone array data, and performing face detection on the image data comprises: and simultaneously obtaining image data and a data sequence, carrying out face detection on the image data, and counting the number of faces.

3. The method for estimating the number of speakers based on voice and image according to claim 1, wherein said generating a one-hot vector based on the number of faces comprises: and generating one-hot vectors according to the number of the faces, and judging whether the number of the faces in the image exceeds the number of the array elements of the microphone array.

4. The method of claim 1, wherein inputting the frequency domain data vector into the noise estimation neural network to estimate the noise vector of the noise embedding space comprises: the frequency domain data vector is input to a plurality of full connection layers and a plurality of GRU layers, and the result of the GRU layers is input to a noise embedded space estimation noise vector.

5. The method of claim 1, wherein inputting the noise vector and the frequency domain data vector into the neural network to estimate the voice vector of the voice embedding space comprises: and splicing the noise vector and the frequency domain vector data into a vector, and then inputting the result of the full-connection layer into the voice embedding space to estimate the voice vector through a plurality of full-connection layers.

6. The method of claim 1, wherein fusing the one-hot vector of the number of faces and the voice vector into a hybrid vector, comprising: the one-hot vector of the number of faces and the voice vector are fused into a mixed vector, and the fusion method comprises the following steps of but is not limited to: vector dimension splicing, vector addition, subtraction, multiplication and division operation, and multi-layer full connection layer fusion data are used.

7. The method of claim 1, wherein the step of inputting the mixed vector into the multi-layer fully-connected layer, then inputting the result of the fully-connected layer into the speaker number embedding space to obtain the speaker number vector, and finally inputting the speaker number vector into the softmax classifier comprises the steps of:

inputting the mixed vector into a plurality of full connection layers;

inputting the result of the full connection layer into a speaker number embedding space to obtain a speaker number vector;

the speaker number vector is input into a softmax classifier to obtain the speaker number.

8. A speech and image based speaker number estimation system, comprising:

a frequency domain data vector calculation module that calculates a frequency domain data vector using the microphone array data; comprising the following steps: performing Fourier transform on the time domain data of each array element, selecting frequency domain data of a plurality of array elements in a designated frequency domain range, and splicing the frequency domain data of the plurality of array elements into a frequency domain data vector;

9. A speech and image based speaker count estimation apparatus comprising a memory and one or more processors, the memory having executable code stored therein, the one or more processors, when executing the executable code, operative to implement a speech and image based speaker count estimation method as claimed in any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements a method of estimating the number of speakers based on speech and images as claimed in any of claims 1-7.