CN112487246A

CN112487246A - Method and device for identifying speakers in multi-person video

Info

Publication number: CN112487246A
Application number: CN202011373431.4A
Authority: CN
Inventors: 陈均
Original assignee: Shenzhen Kadoxi Technology Co ltd
Current assignee: Shenzhen Kadoxi Technology Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-12

Abstract

The invention relates to the technical field of camera device control, in particular to a method and a device for identifying a speaker in a multi-person video, wherein the method comprises the steps of acquiring image data acquired by a camera, calling a preset face identification model to identify the image data of each frame, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; according to the position parameters of the speaker in the image, image capture data of the face of the speaker are obtained, and the image in the image capture data is subjected to pixel amplification, so that the video frame structuring in real-time live broadcast can be automatically realized, the live broadcast interest is improved, and the human-computer interaction capability is enhanced.

Description

Method and device for identifying speakers in multi-person video

Technical Field

The invention relates to the technical field of camera device control, in particular to a method and a device for identifying speakers in a multi-person video.

Background

Under the background of rapid development and progress of the prior art, more video and audio intelligent analysis technologies are provided so as to complete the output of the structured data of video and audio, and more humanized application experience can be provided through the fusion presentation of the structured data and the video and audio data.

In the presence of audio and video data of multiple persons, when the audio and video data are displayed on the same picture, a system cannot determine a specific speaker in the current video stream, so that the system cannot automatically embody the structured audio and video data, the structured audio and video data are often formed by post-person co-processing and fusion in the recorded audio and video data, and the system is difficult to adapt to real-time live broadcast application.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed to provide a method and apparatus for identifying a speaker in a multi-person video that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a method for identifying a speaker in a multi-person video, including:

acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data;

acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;

determining a position parameter of a speaker in the image according to the position parameter of the audio data;

and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.

Further, the calling a preset face recognition model to recognize the image data of each frame includes:

extracting human face features in the sample image;

inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;

intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;

and training the face recognition frame and the face screenshot frame through the recognition network to obtain the face recognition model.

Further, the acquiring of the multiple paths of audio data acquired by the microphone array and the determining of the position parameter of the audio data with the strongest acoustic energy of one path of people by using the preset speech recognition model include:

performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;

carrying out noise reduction suppression on signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain;

processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals;

and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.

Further, the performing voice recognition on each beam signal respectively includes:

and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.

Further, the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model includes:

acquiring pixel proportion data of an image amplification area;

calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;

and carrying out pixel amplification on the image in the image interception data according to the amplification factor.

There is also provided an apparatus for identifying a speaker in a multi-person video, comprising:

the face recognition module is used for acquiring image data acquired by the camera, calling a preset face recognition model to recognize each frame of image data and determining the position parameter of each acquired face feature in the image data to which the face feature belongs;

the voice recognition module is used for acquiring multi-channel audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;

the position confirmation module is used for determining the position parameter of the speaker in the image according to the position parameter of the audio data;

a pixel amplification module, configured to obtain image capture data of a speaker face according to a position parameter of the speaker in the image, and perform pixel amplification on the image in the image capture data

Further, the face recognition module includes:

extracting human face features in the sample image;

and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.

Further, the pixel amplifying module includes:

the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;

the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;

and the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor.

There is also provided an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the method of identifying a speaker in a multi-person video.

There is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of identifying a speaker in a multi-person video.

The embodiment of the invention has the following advantages:

according to the method and the device, all face targets in the image are positioned by applying a face recognition technology, and the microphone array is combined to position the position information of a specific speaker, so that the specific position of the speaker in the image is specifically positioned, the face image of the speaker is amplified by calculating the image amplification factor through the image, the video picture structuralization in real-time live broadcast can be automatically realized, the live broadcast interest is improved, and the human-computer interaction capability is enhanced.

Drawings

FIG. 1 is a flow chart illustrating steps of an embodiment of a method for identifying a speaker in a multi-person video according to the present invention;

FIG. 2 is a block diagram of an embodiment of an apparatus for identifying a speaker in a multi-person video according to the present invention;

fig. 3 is a block diagram of a computer apparatus for speaker identification in a multi-person video in accordance with the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method for controlling the rotation of the camera based on the sound source positioning can be applied to any terminal equipment with a voice function and an image recognition function, such as terminal equipment of a smart phone, a tablet personal computer, a smart home and the like.

In the embodiment of the application, one camera can be used, only one direction is shot, and correspondingly, the microphone array is in a linear array; the number of the cameras can be multiple, the cameras are in an annular array, and correspondingly, the microphones are also in the annular array.

One of application scenarios in the embodiment of the present application is to identify an actual speaker in the same video frame where multiple people appear simultaneously, as shown in fig. 1, providing a method for identifying a speaker in a video of multiple people, which includes the following specific steps:

s100, acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of image data, and determining the position parameter of each acquired face feature in the image data;

s200, acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;

s300, determining the position parameter of the speaker in the image according to the position parameter of the audio data;

s400, acquiring image intercepting data of the face of the speaker according to the position parameters of the speaker in the image, and carrying out pixel amplification on the image in the image intercepting data.

In step S100, the preset face recognition model is obtained by continuously training a sample image with face features based on a convolutional neural network, and specifically includes:

extracting human face features in the sample image; the method mainly comprises the position information of a real face in the sample pattern, and can extract the coordinate data and the pixel proportion data of the face in the sample pattern by utilizing the existing image feature selection tool.

training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model;

the recognition network is a convolutional neural network, and the structure of the recognition network is not limited to a convolutional layer, but also includes a pooling layer, a full link layer, and the like, and no matter which structural method is combined for training, the purpose is to obtain position information of a face in an image and image data of a face capture frame by inputting image data with face features into a face recognition model in the embodiment of the present application.

In step S200, image data acquisition and audio data acquisition are performed synchronously, and image data acquisition can be quickly identified and located based on the face recognition model, and preprocessing is required in the identification process of audio data, specifically, the method includes:

and carrying out noise reduction and suppression on the signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain. Wherein the sound frequency of the human voice is 20HZ-20 KHZ.

After preprocessing the collected audio data, obtaining processed voice data, wherein the identification process of the audio data is as follows:

processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals; beam forming, which is to perform time delay or phase compensation and amplitude weighting processing on audio signals output by each microphone in a microphone array to form a beam pointing to a specific direction;

In an embodiment, the beam signals further include keyword information, and the performing speech recognition on each beam signal respectively further includes:

In the above embodiment, in the voice data obtained by preprocessing in the multi-channel audio data, the keyword result trained by the speech recognition model detects that the keyword information matched with the result exists in a certain channel, the position parameter of the channel of audio data is regarded as the position parameter of the audio data with the strongest voice energy, and the position parameter of the audio data with the keyword information is used as the reference parameter for subsequent positioning.

After the position parameters of the audio data with the strongest sound energy of one path of people are determined, the angle and the direction of the audio data are found through the beam forming algorithm, so that one microphone in the microphone array close to the audio data is determined, the position parameters of the microphone are obtained, and the corresponding relation between the real speaker and the microphone can be obtained.

Specifically, if 4 microphones are used for linear array, and the included angle between adjacent microphones is 45 degrees, and each microphone just corresponds to one person, 4 individual face recognition frames should be recognized in the image. Assuming that each person shows a speaking state, the system cannot identify an actual speaker through a face recognition technology, and after the position parameter of the audio data with the strongest voice energy is obtained in the embodiment of the application, the system can locate which specific microphone has the strongest voice energy, and can determine the actual speaker position parameter in the face recognition frame by combining the position parameter.

In step S400, the acquired image data includes an image enlargement area, that is, the identified image is enlarged in the specified image enlargement area, specifically, the method includes:

acquiring pixel proportion data of an image amplification area;

Therefore, the actual speaker is amplified and displayed in the image in the multi-person video picture, the interactivity in the multi-person video is improved, and the wider application of the multi-person video is widened.

As shown in fig. 2, an apparatus for recognizing a speaker in a multi-person video is further provided in an embodiment of the present application, including:

the face recognition module 100 is configured to acquire image data acquired by a camera, call a preset face recognition model to recognize each frame of the image data, and determine a position parameter of each acquired face feature in the image data to which the face feature belongs;

the voice recognition module 200 is used for acquiring multiple paths of audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one path of people by adopting a preset voice recognition model;

a position confirmation module 300, configured to determine a position parameter of the speaker in the image according to the position parameter of the audio data;

a pixel amplification module 400, configured to obtain image capture data of a face of a speaker according to a position parameter of the speaker in the image, and perform pixel amplification on an image in the image capture data

In one embodiment, the face recognition module 100 includes:

extracting human face features in the sample image;

In one embodiment, the pixel amplification module 400 includes:

the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Referring to fig. 3, a computer device for identifying a speaker in a multi-person video according to the present invention is shown, which may specifically include the following:

in an embodiment of the present invention, the present invention further provides a computer device, where the computer device 12 is represented in a general computing device, and the components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, and a processor or local bus 18 using any of a variety of bus 18 architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus 18, micro-channel architecture (MAC) bus 18, enhanced ISA bus 18, audio Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)31 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, with the program modules 42 configured to carry out the functions of embodiments of the invention.

A program/utility 41 having a set (at least one) of program modules 42 may be stored, for example, in memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN)), a Wide Area Network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As shown, the network adapter 21 communicates with the other modules of the computer device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage systems 34, etc.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing a method for speaker recognition in a multi-person video provided by an embodiment of the present invention.

That is, the processing unit 16 implements, when executing the program: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.

In an embodiment of the present invention, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements a method for identifying a speaker in a multi-person video as provided in all embodiments of the present application.

That is, the program when executed by the processor implements: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer-readable storage medium or a computer-readable signal medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPOM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method for identifying speakers in a multi-person video provided by the invention is described in detail above, and the principle and the implementation of the invention are explained in the present document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for identifying a speaker in a multi-person video, comprising:

2. The method according to claim 1, wherein the invoking a preset face recognition model to recognize each frame of the image data comprises:

extracting human face features in the sample image;

3. The method of claim 1, wherein the acquiring the multiple paths of audio data collected by the microphone array, and determining the position parameter of the audio data with the strongest sound energy of one path of people by using a preset speech recognition model, comprises:

4. The method of claim 1, wherein the separately performing speech recognition on each beam signal comprises:

5. The method of claim 1, wherein the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model comprises:

acquiring pixel proportion data of an image amplification area;

6. An apparatus for identifying a speaker in a multi-person video, comprising:

and the pixel amplification module is used for acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image and amplifying the pixels of the image in the image interception data.

7. The apparatus of claim 6, wherein the face recognition module comprises:

extracting human face features in the sample image;

8. The apparatus of claim 6, wherein the pixel amplification module comprises:

9. Electronic device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and capable of running on said processor, said computer program, when executed by said processor, implementing a method for identification of a speaker in a multi-person video according to any of claims 1 to 5.

10. Computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method for recognition of a speaker in a multi-person video according to any one of claims 1 to 5.