CN112487246A - Method and device for identifying speakers in multi-person video - Google Patents

Method and device for identifying speakers in multi-person video Download PDF

Info

Publication number
CN112487246A
CN112487246A CN202011373431.4A CN202011373431A CN112487246A CN 112487246 A CN112487246 A CN 112487246A CN 202011373431 A CN202011373431 A CN 202011373431A CN 112487246 A CN112487246 A CN 112487246A
Authority
CN
China
Prior art keywords
image
data
face
speaker
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011373431.4A
Other languages
Chinese (zh)
Inventor
陈均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Kadoxi Technology Co ltd
Original Assignee
Shenzhen Kadoxi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Kadoxi Technology Co ltd filed Critical Shenzhen Kadoxi Technology Co ltd
Priority to CN202011373431.4A priority Critical patent/CN112487246A/en
Publication of CN112487246A publication Critical patent/CN112487246A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of camera device control, in particular to a method and a device for identifying a speaker in a multi-person video, wherein the method comprises the steps of acquiring image data acquired by a camera, calling a preset face identification model to identify the image data of each frame, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; according to the position parameters of the speaker in the image, image capture data of the face of the speaker are obtained, and the image in the image capture data is subjected to pixel amplification, so that the video frame structuring in real-time live broadcast can be automatically realized, the live broadcast interest is improved, and the human-computer interaction capability is enhanced.

Description

Method and device for identifying speakers in multi-person video
Technical Field
The invention relates to the technical field of camera device control, in particular to a method and a device for identifying speakers in a multi-person video.
Background
Under the background of rapid development and progress of the prior art, more video and audio intelligent analysis technologies are provided so as to complete the output of the structured data of video and audio, and more humanized application experience can be provided through the fusion presentation of the structured data and the video and audio data.
In the presence of audio and video data of multiple persons, when the audio and video data are displayed on the same picture, a system cannot determine a specific speaker in the current video stream, so that the system cannot automatically embody the structured audio and video data, the structured audio and video data are often formed by post-person co-processing and fusion in the recorded audio and video data, and the system is difficult to adapt to real-time live broadcast application.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed to provide a method and apparatus for identifying a speaker in a multi-person video that overcome or at least partially solve the above problems.
In order to solve the above problem, an embodiment of the present invention discloses a method for identifying a speaker in a multi-person video, including:
acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data;
acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
determining a position parameter of a speaker in the image according to the position parameter of the audio data;
and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
Further, the calling a preset face recognition model to recognize the image data of each frame includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training the face recognition frame and the face screenshot frame through the recognition network to obtain the face recognition model.
Further, the acquiring of the multiple paths of audio data acquired by the microphone array and the determining of the position parameter of the audio data with the strongest acoustic energy of one path of people by using the preset speech recognition model include:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
carrying out noise reduction suppression on signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain;
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
Further, the performing voice recognition on each beam signal respectively includes:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
Further, the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model includes:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
There is also provided an apparatus for identifying a speaker in a multi-person video, comprising:
the face recognition module is used for acquiring image data acquired by the camera, calling a preset face recognition model to recognize each frame of image data and determining the position parameter of each acquired face feature in the image data to which the face feature belongs;
the voice recognition module is used for acquiring multi-channel audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
the position confirmation module is used for determining the position parameter of the speaker in the image according to the position parameter of the audio data;
a pixel amplification module, configured to obtain image capture data of a speaker face according to a position parameter of the speaker in the image, and perform pixel amplification on the image in the image capture data
Further, the face recognition module includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
Further, the pixel amplifying module includes:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
and the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor.
There is also provided an electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the method of identifying a speaker in a multi-person video.
There is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of identifying a speaker in a multi-person video.
The embodiment of the invention has the following advantages:
according to the method and the device, all face targets in the image are positioned by applying a face recognition technology, and the microphone array is combined to position the position information of a specific speaker, so that the specific position of the speaker in the image is specifically positioned, the face image of the speaker is amplified by calculating the image amplification factor through the image, the video picture structuralization in real-time live broadcast can be automatically realized, the live broadcast interest is improved, and the human-computer interaction capability is enhanced.
Drawings
FIG. 1 is a flow chart illustrating steps of an embodiment of a method for identifying a speaker in a multi-person video according to the present invention;
FIG. 2 is a block diagram of an embodiment of an apparatus for identifying a speaker in a multi-person video according to the present invention;
fig. 3 is a block diagram of a computer apparatus for speaker identification in a multi-person video in accordance with the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method for controlling the rotation of the camera based on the sound source positioning can be applied to any terminal equipment with a voice function and an image recognition function, such as terminal equipment of a smart phone, a tablet personal computer, a smart home and the like.
In the embodiment of the application, one camera can be used, only one direction is shot, and correspondingly, the microphone array is in a linear array; the number of the cameras can be multiple, the cameras are in an annular array, and correspondingly, the microphones are also in the annular array.
One of application scenarios in the embodiment of the present application is to identify an actual speaker in the same video frame where multiple people appear simultaneously, as shown in fig. 1, providing a method for identifying a speaker in a video of multiple people, which includes the following specific steps:
s100, acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of image data, and determining the position parameter of each acquired face feature in the image data;
s200, acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
s300, determining the position parameter of the speaker in the image according to the position parameter of the audio data;
s400, acquiring image intercepting data of the face of the speaker according to the position parameters of the speaker in the image, and carrying out pixel amplification on the image in the image intercepting data.
In step S100, the preset face recognition model is obtained by continuously training a sample image with face features based on a convolutional neural network, and specifically includes:
extracting human face features in the sample image; the method mainly comprises the position information of a real face in the sample pattern, and can extract the coordinate data and the pixel proportion data of the face in the sample pattern by utilizing the existing image feature selection tool.
Inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model;
the recognition network is a convolutional neural network, and the structure of the recognition network is not limited to a convolutional layer, but also includes a pooling layer, a full link layer, and the like, and no matter which structural method is combined for training, the purpose is to obtain position information of a face in an image and image data of a face capture frame by inputting image data with face features into a face recognition model in the embodiment of the present application.
In step S200, image data acquisition and audio data acquisition are performed synchronously, and image data acquisition can be quickly identified and located based on the face recognition model, and preprocessing is required in the identification process of audio data, specifically, the method includes:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
and carrying out noise reduction and suppression on the signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain. Wherein the sound frequency of the human voice is 20HZ-20 KHZ.
After preprocessing the collected audio data, obtaining processed voice data, wherein the identification process of the audio data is as follows:
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals; beam forming, which is to perform time delay or phase compensation and amplitude weighting processing on audio signals output by each microphone in a microphone array to form a beam pointing to a specific direction;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
In an embodiment, the beam signals further include keyword information, and the performing speech recognition on each beam signal respectively further includes:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
In the above embodiment, in the voice data obtained by preprocessing in the multi-channel audio data, the keyword result trained by the speech recognition model detects that the keyword information matched with the result exists in a certain channel, the position parameter of the channel of audio data is regarded as the position parameter of the audio data with the strongest voice energy, and the position parameter of the audio data with the keyword information is used as the reference parameter for subsequent positioning.
After the position parameters of the audio data with the strongest sound energy of one path of people are determined, the angle and the direction of the audio data are found through the beam forming algorithm, so that one microphone in the microphone array close to the audio data is determined, the position parameters of the microphone are obtained, and the corresponding relation between the real speaker and the microphone can be obtained.
Specifically, if 4 microphones are used for linear array, and the included angle between adjacent microphones is 45 degrees, and each microphone just corresponds to one person, 4 individual face recognition frames should be recognized in the image. Assuming that each person shows a speaking state, the system cannot identify an actual speaker through a face recognition technology, and after the position parameter of the audio data with the strongest voice energy is obtained in the embodiment of the application, the system can locate which specific microphone has the strongest voice energy, and can determine the actual speaker position parameter in the face recognition frame by combining the position parameter.
In step S400, the acquired image data includes an image enlargement area, that is, the identified image is enlarged in the specified image enlargement area, specifically, the method includes:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
Therefore, the actual speaker is amplified and displayed in the image in the multi-person video picture, the interactivity in the multi-person video is improved, and the wider application of the multi-person video is widened.
As shown in fig. 2, an apparatus for recognizing a speaker in a multi-person video is further provided in an embodiment of the present application, including:
the face recognition module 100 is configured to acquire image data acquired by a camera, call a preset face recognition model to recognize each frame of the image data, and determine a position parameter of each acquired face feature in the image data to which the face feature belongs;
the voice recognition module 200 is used for acquiring multiple paths of audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one path of people by adopting a preset voice recognition model;
a position confirmation module 300, configured to determine a position parameter of the speaker in the image according to the position parameter of the audio data;
a pixel amplification module 400, configured to obtain image capture data of a face of a speaker according to a position parameter of the speaker in the image, and perform pixel amplification on an image in the image capture data
In one embodiment, the face recognition module 100 includes:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
In one embodiment, the pixel amplification module 400 includes:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Referring to fig. 3, a computer device for identifying a speaker in a multi-person video according to the present invention is shown, which may specifically include the following:
in an embodiment of the present invention, the present invention further provides a computer device, where the computer device 12 is represented in a general computing device, and the components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, and a processor or local bus 18 using any of a variety of bus 18 architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus 18, micro-channel architecture (MAC) bus 18, enhanced ISA bus 18, audio Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)31 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, with the program modules 42 configured to carry out the functions of embodiments of the invention.
A program/utility 41 having a set (at least one) of program modules 42 may be stored, for example, in memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN)), a Wide Area Network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As shown, the network adapter 21 communicates with the other modules of the computer device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage systems 34, etc.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing a method for speaker recognition in a multi-person video provided by an embodiment of the present invention.
That is, the processing unit 16 implements, when executing the program: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
In an embodiment of the present invention, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements a method for identifying a speaker in a multi-person video as provided in all embodiments of the present application.
That is, the program when executed by the processor implements: acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data; acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model; determining a position parameter of a speaker in the image according to the position parameter of the audio data; and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer-readable storage medium or a computer-readable signal medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPOM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method for identifying speakers in a multi-person video provided by the invention is described in detail above, and the principle and the implementation of the invention are explained in the present document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for identifying a speaker in a multi-person video, comprising:
acquiring image data acquired by a camera, calling a preset face recognition model to recognize each frame of the image data, and determining the position parameter of each acquired face feature in the image data;
acquiring multi-channel audio data acquired by a microphone array, and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
determining a position parameter of a speaker in the image according to the position parameter of the audio data;
and acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image, and amplifying the pixels of the image in the image interception data.
2. The method according to claim 1, wherein the invoking a preset face recognition model to recognize each frame of the image data comprises:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training the face recognition frame and the face screenshot frame through the recognition network to obtain the face recognition model.
3. The method of claim 1, wherein the acquiring the multiple paths of audio data collected by the microphone array, and determining the position parameter of the audio data with the strongest sound energy of one path of people by using a preset speech recognition model, comprises:
performing echo cancellation processing on each path of acquired audio data according to a reference signal; specifically, the reference signal may be obtained from a speaker or a sound card driver;
carrying out noise reduction suppression on signals which are not subjected to echo cancellation, and obtaining recognizable voice data by adopting automatic gain;
processing the human voice data in each path of audio data by adopting a beam forming algorithm to obtain a plurality of paths of beam signals;
and respectively carrying out voice recognition on each path of beam signal, determining the beam signal with the strongest sound energy of the human body, and obtaining the position parameter of the audio data corresponding to the beam signal.
4. The method of claim 1, wherein the separately performing speech recognition on each beam signal comprises:
and respectively carrying out voice recognition on the keywords in each path of beam signal, and when detecting that the keyword information in one path of beam signal is matched with a preset keyword training result, determining that the path of beam signal is the keyword beam signal.
5. The method of claim 1, wherein the training the face recognition box and the face screenshot box through the recognition network to obtain the face recognition model comprises:
acquiring pixel proportion data of an image amplification area;
calculating an amplification factor of the intercepted image amplified to the image amplification area according to pixel ratio data in the image interception data;
and carrying out pixel amplification on the image in the image interception data according to the amplification factor.
6. An apparatus for identifying a speaker in a multi-person video, comprising:
the face recognition module is used for acquiring image data acquired by the camera, calling a preset face recognition model to recognize each frame of image data and determining the position parameter of each acquired face feature in the image data to which the face feature belongs;
the voice recognition module is used for acquiring multi-channel audio data acquired by the microphone array and determining the position parameters of the audio data with the strongest sound energy of one channel of people by adopting a preset voice recognition model;
the position confirmation module is used for determining the position parameter of the speaker in the image according to the position parameter of the audio data;
and the pixel amplification module is used for acquiring image interception data of the face of the speaker according to the position parameters of the speaker in the image and amplifying the pixels of the image in the image interception data.
7. The apparatus of claim 6, wherein the face recognition module comprises:
extracting human face features in the sample image;
inputting the human face features and sample image data into a recognition network, and determining position information of a human face recognition frame and human face image information in the human face recognition frame;
intercepting the face image in the face recognition frame to obtain a face image interception frame, and inputting image data in the face image interception frame into the recognition network;
and training a multi-convolution layer structure on the face recognition frame and the face cutout frame through the recognition network to obtain the face recognition model.
8. The apparatus of claim 6, wherein the pixel amplification module comprises:
the enlarged region acquisition module is used for acquiring pixel proportion data of an image enlarged region;
the amplification factor calculation module is used for calculating the amplification factor of the intercepted image amplified to the image amplification area according to the pixel proportion data in the image interception data;
and the amplification submodule is used for carrying out pixel amplification on the image in the image interception data according to the amplification factor.
9. Electronic device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and capable of running on said processor, said computer program, when executed by said processor, implementing a method for identification of a speaker in a multi-person video according to any of claims 1 to 5.
10. Computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method for recognition of a speaker in a multi-person video according to any one of claims 1 to 5.
CN202011373431.4A 2020-11-30 2020-11-30 Method and device for identifying speakers in multi-person video Pending CN112487246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011373431.4A CN112487246A (en) 2020-11-30 2020-11-30 Method and device for identifying speakers in multi-person video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011373431.4A CN112487246A (en) 2020-11-30 2020-11-30 Method and device for identifying speakers in multi-person video

Publications (1)

Publication Number Publication Date
CN112487246A true CN112487246A (en) 2021-03-12

Family

ID=74937375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011373431.4A Pending CN112487246A (en) 2020-11-30 2020-11-30 Method and device for identifying speakers in multi-person video

Country Status (1)

Country Link
CN (1) CN112487246A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301372A (en) * 2021-05-20 2021-08-24 广州繁星互娱信息科技有限公司 Live broadcast method, device, terminal and storage medium
CN114594892A (en) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 Remote interaction method, remote interaction device and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101394679A (en) * 2007-09-17 2009-03-25 深圳富泰宏精密工业有限公司 Sound source positioning system and method
CN103841357A (en) * 2012-11-21 2014-06-04 中兴通讯股份有限公司 Microphone array sound source positioning method, device and system based on video tracking
US20150088515A1 (en) * 2013-09-25 2015-03-26 Lenovo (Singapore) Pte. Ltd. Primary speaker identification from audio and video data
CN108737719A (en) * 2018-04-04 2018-11-02 深圳市冠旭电子股份有限公司 Camera filming control method, device, smart machine and storage medium
CN109257559A (en) * 2018-09-28 2019-01-22 苏州科达科技股份有限公司 A kind of image display method, device and the video conferencing system of panoramic video meeting

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101394679A (en) * 2007-09-17 2009-03-25 深圳富泰宏精密工业有限公司 Sound source positioning system and method
CN103841357A (en) * 2012-11-21 2014-06-04 中兴通讯股份有限公司 Microphone array sound source positioning method, device and system based on video tracking
US20150088515A1 (en) * 2013-09-25 2015-03-26 Lenovo (Singapore) Pte. Ltd. Primary speaker identification from audio and video data
CN108737719A (en) * 2018-04-04 2018-11-02 深圳市冠旭电子股份有限公司 Camera filming control method, device, smart machine and storage medium
CN109257559A (en) * 2018-09-28 2019-01-22 苏州科达科技股份有限公司 A kind of image display method, device and the video conferencing system of panoramic video meeting

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301372A (en) * 2021-05-20 2021-08-24 广州繁星互娱信息科技有限公司 Live broadcast method, device, terminal and storage medium
CN114594892A (en) * 2022-01-29 2022-06-07 深圳壹秘科技有限公司 Remote interaction method, remote interaction device and computer storage medium
CN114594892B (en) * 2022-01-29 2023-11-24 深圳壹秘科技有限公司 Remote interaction method, remote interaction device, and computer storage medium

Similar Documents

Publication Publication Date Title
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN111370014B (en) System and method for multi-stream target-voice detection and channel fusion
JP6464449B2 (en) Sound source separation apparatus and sound source separation method
US6441825B1 (en) Video token tracking system for animation
CN112088402A (en) Joint neural network for speaker recognition
WO2021000498A1 (en) Composite speech recognition method, device, equipment, and computer-readable storage medium
CN112487246A (en) Method and device for identifying speakers in multi-person video
CN110611861B (en) Directional sound production control method and device, sound production equipment, medium and electronic equipment
CN112492207B (en) Method and device for controlling camera to rotate based on sound source positioning
CN112601045A (en) Speaking control method, device, equipment and storage medium for video conference
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
Yu et al. Audio-visual multi-channel integration and recognition of overlapped speech
CN111091845A (en) Audio processing method and device, terminal equipment and computer storage medium
CN110503957A (en) A kind of audio recognition method and device based on image denoising
CN111868823A (en) Sound source separation method, device and equipment
WO2019227552A1 (en) Behavior recognition-based speech positioning method and device
US20120242860A1 (en) Arrangement and method relating to audio recognition
CN110188179B (en) Voice directional recognition interaction method, device, equipment and medium
CN113014844A (en) Audio processing method and device, storage medium and electronic equipment
CN111383629B (en) Voice processing method and device, electronic equipment and storage medium
CN115516553A (en) System and method for multi-microphone automated clinical documentation
Cabañas-Molero et al. Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
Ivanko et al. Designing advanced geometric features for automatic Russian visual speech recognition
TWI751866B (en) Audiovisual communication system and control method thereof
CN113035176B (en) Voice data processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination