CN112466306A - Conference summary generation method and device, computer equipment and storage medium - Google Patents

Conference summary generation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112466306A
CN112466306A CN201910766155.9A CN201910766155A CN112466306A CN 112466306 A CN112466306 A CN 112466306A CN 201910766155 A CN201910766155 A CN 201910766155A CN 112466306 A CN112466306 A CN 112466306A
Authority
CN
China
Prior art keywords
voice
segment data
data stream
image
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910766155.9A
Other languages
Chinese (zh)
Other versions
CN112466306B (en
Inventor
许家铭
石晶
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910766155.9A priority Critical patent/CN112466306B/en
Publication of CN112466306A publication Critical patent/CN112466306A/en
Application granted granted Critical
Publication of CN112466306B publication Critical patent/CN112466306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention relates to a conference summary generation method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: calling voice acquisition equipment to acquire whole-course voice of a conference process, and calling image acquisition equipment to acquire whole-course images of the conference process; extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segmented data streams from the single-channel voice; for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from a whole-course image; inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information; inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters; and recording each voice transcription character and the corresponding voice sender identity information in sequence to generate a conference summary.

Description

Conference summary generation method and device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of automatic processing of computer information, in particular to a conference summary generation method and device, computer equipment and a storage medium.
Background
In various working and living environments, a conference is a very important scene for people to communicate information, complete discussion and make plans. In a multi-person conference, multiple rounds of speech and conversation are typically conducted, with communication and communication being accomplished through a series of contextually relevant voices and content. Among them, voice is the most natural and effective means for people to perform information interaction, and is widely used in various meeting scenes.
In daily life, people communicate through languages, and the communication is actually completed based on common stimulation of various sensory signals (such as auditory sense and visual sense). For example, in a daily session, in addition to the pure speech signal of the auditory pathway itself, vision may also bring about effects such as confirmation of the identity of the speaker, improvement of speech recognition (e.g., assistance by lip language actions), and the like.
Specifically, in a conference scene, because the number of the voice speakers is more than one, the voice existing in the conference scene is simply transcribed, and the confirmation of the identity of the voice speakers is neglected, so that each section of voice lacks the identity information of the voice speakers, and subsequently, the identity information of the voice speakers of each section of voice needs to be confirmed in a manual mode, so that the efficiency is low.
Disclosure of Invention
In view of this, to solve the above technical problems or some technical problems, embodiments of the present invention provide a method and an apparatus for generating a conference summary, a computer device, and a storage medium.
In a first aspect, an embodiment of the present invention provides a method for generating a conference summary, where the method includes:
calling voice acquisition equipment to acquire whole-course voices corresponding to a plurality of voice speakers in a conference process, and calling image acquisition equipment to acquire whole-course images corresponding to the plurality of voice speakers in the conference process;
extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice sender;
for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker;
inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information;
inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters;
and recording each voice transcription character and the corresponding voice sender identity information in sequence to generate a conference summary.
In one possible embodiment, the extracting single-channel speech from the global speech includes:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
In one possible embodiment, the extracting the plurality of voice segment data streams from the single-channel voice respectively includes:
and performing sentence segmentation on the single-channel voice to extract a plurality of voice segmented data streams.
In one possible embodiment, the intercepting, for each voice segment data stream, an image segment data stream corresponding to the voice segment data stream from the global image includes:
and for each voice segment data stream, intercepting a corresponding image segment data stream which is positioned in the same time period with the voice segment data stream from the whole-course image.
In one possible embodiment, the inputting each voice segment data stream, and the corresponding voice speaker identity information and location information into the voice recognition model, and extracting a plurality of corresponding voice transcriptions includes:
inputting each voice segment data stream, corresponding voice sender identity information and corresponding position information into a voice enhancement denoising model to obtain a plurality of denoised voices;
and inputting each voice segment data stream and the corresponding denoised voice into a voice recognition model, and extracting a plurality of corresponding voice transcription words.
In a second aspect, an embodiment of the present invention provides a conference summary generation apparatus, where the apparatus includes:
the acquisition module is used for calling the voice acquisition equipment to acquire the whole-course voice corresponding to the plurality of voice speakers in the conference process and calling the image acquisition equipment to acquire the whole-course image corresponding to the plurality of voice speakers in the conference process;
the voice extraction module is used for extracting single-channel voice from the whole-course voice;
a data stream extraction module, configured to extract a plurality of voice segment data streams from the single-channel voice, where each voice segment data stream belongs to a voice utterer;
the data flow intercepting module is used for intercepting an image segment data flow corresponding to the voice segment data flow from the whole-course image aiming at each voice segment data flow, wherein each voice segment data flow and the corresponding image segment data flow belong to the same voice sender;
the information extraction module is used for inputting each voice segment data stream and the corresponding image segment data stream into the voice speaker detection model and extracting a plurality of corresponding voice speaker identity information and position information;
the character extraction module is used for inputting each voice segment data stream, corresponding voice speaker identity information and position information into the voice recognition model and extracting a plurality of corresponding voice transcription characters;
and the summary generation module is used for recording each voice transcription character and the corresponding voice speaker identity information in sequence to generate a conference summary.
In a possible implementation, the speech extraction module is specifically configured to:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
In a possible implementation manner, the data stream extraction module is specifically configured to:
and performing sentence segmentation on the single-channel voice to extract a plurality of voice segmented data streams.
In a third aspect, an embodiment of the present invention provides a storage medium, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the foregoing conference summary generation method.
In a fourth aspect, an embodiment of the present invention provides a computer device, including: a processor and a memory, the processor being configured to execute the conference summary generation program stored in the memory to implement the aforementioned conference summary generation method.
The technical scheme provided by the embodiment of the invention can realize that each voice transcription character and the corresponding voice sender identity information are recorded in sequence, the conference summary is generated, the identity of the voice sender is confirmed while the voice existing in a conference scene is transcribed, each section of voice has the identity information of the voice sender, and the identity information of the voice sender of each section of voice is confirmed in a subsequent manual mode, so that the efficiency is obviously improved.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic diagram of an implementation flow of a conference summary generation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a conference summary generation apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
As shown in fig. 1, an implementation flow diagram of a conference summary generation method provided in an embodiment of the present invention is shown, where the method specifically includes the following steps:
s101, calling voice acquisition equipment to acquire whole-course voices corresponding to a plurality of voice speakers in a conference process, and calling image acquisition equipment to acquire whole-course images corresponding to the plurality of voice speakers in the conference process;
in the embodiment of the present invention, the voice collecting device may be a microphone, and the image collecting device may be a camera.
For example, a microphone is called to collect the whole-course voice corresponding to a plurality of voice speakers in the conference process, and a camera is called to collect the whole-course image corresponding to a plurality of voice speakers in the conference process, wherein the whole-course image can be stored in an RGB format.
S102, extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice sender;
for the above collected whole-course speech, extracting single-channel speech from the whole-course speech may specifically be:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
For example, the whole-course speech is subjected to a/D conversion, and the extraction rate is set to 16000, so that single-channel speech with the extraction rate of 16000 can be extracted.
And aiming at the obtained single-channel voice, respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice speaker.
For example, if the voice segment data stream 1, the voice segment data stream 2, and the voice segment data stream 3 are extracted from the single-channel voice, the voice segment data stream 1 belongs to the user a, the voice segment data stream 2 belongs to the user B, and the voice segment data stream 3 belongs to the user C.
As an alternative embodiment, the single-channel speech may be sentence-segment-segmented to extract a plurality of speech segment data streams.
As another alternative, the single-channel speech may be speech-inspected and sentence-segmented to extract a plurality of speech-segmented data streams. A neural network detection model which can judge whether each frame of voice is mixed voice of a plurality of voice speakers or no voice can be trained by using a Voice Activity Detection (VAD) technology in voice processing, the judgment is carried out by using the neural network detection model, only a voice frame of one voice speaker is stored, and thus a plurality of voice segment data streams can be obtained.
S103, for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker;
and for each obtained voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker.
For example, the voice segment data stream 1 and the image segment data stream a correspond one-to-one, both belonging to the same voice speaker a.
As an optional implementation manner, for each voice segment data stream, a corresponding image segment data stream located in the same time period as the voice segment data stream is cut from the global image, so that based on the time sequence, multiple sets of mutually corresponding voice segment data streams and image segment data streams can be obtained, and both belong to the same voice speaker.
S104, inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information;
each voice segment data stream can be regarded as an auditory signal, the image segment data stream corresponding to the voice segment data stream can be regarded as a visual signal, each voice segment data stream and the corresponding image segment data stream are input into a voice speaker detection model, and a plurality of corresponding voice speaker identity information and position information are extracted.
Specifically, for two signals, namely, auditory signals and visual signals, the two signals are processed through different sub-network modules to obtain corresponding feature representations, then feature fusion is carried out to obtain a saliency Mask on a visual pathway, and finally a fused representation of the saliency Mask is obtained.
Auditory processing sub-network models mainly extract auditory raw signals into a high-dimensional space for further processing. The speech input to the subnetwork model is a common speech feature such as a short-time fourier transform (STFT), mel-frequency cepstral coefficients (MFCC), or Fbank feature. Specifically, the sub-network model extracts features of an input speech signal as high-dimensional hidden layer vectors through a multi-layer convolutional neural network and Pooling (Pooling) operation and a fully connected layer. The input first layer of convolutional neural network can make different input channel numbers according to the difference of the channel numbers of the input voice characteristics. In the whole subnetwork model, the operation of the convolutional layer will keep the size of the data unchanged, but the number of channels will remain unchanged or increase until the preset number of channels is reached. In the pooling operation, the input speech features are compressed on a time scale, the time dimension of the speech signal is gradually compressed to be the same as the time length of the visual signal, and the spectral dimension of the speech features is compressed to 1 in the pooling operation. At this point, a representation of the auditory hidden layer in the segment is obtained.
The image processing sub-network model mainly extracts auditory raw signals into a high-dimensional space for further processing. This sub-network model image input is a common RGB image feature. Specifically, the sub-network model is configured by a bottom layer feature extraction network, a context-dependent convolution layer, and a full link layer, and extracts features of an input image signal as a high-dimensional hidden layer vector. The bottom layer feature extraction network is a pre-trained feature extraction network on tasks such as image classification, object recognition and the like, and is used for extracting the image input features under the scene from the bottom layer. The network pre-trained on the large-scale image data set is used, so that the convergence rate of the training of the network can be facilitated. In addition, the feature extraction of the part also normalizes the size of the original image input, and represents the original image signal of each frame with less space size (typical value is 13 × 13). Then, the invention uses a time sequence related convolution network layer to model the front and back time sequence information of the multi-frame image, and is used for capturing information such as obvious change and action on the image. And finally, projecting each pixel point of each frame of image to a preset characteristic channel number through two full-connection layers, wherein the number is consistent with the last preset channel number of the voice processing sub-network. At this point, a visually hidden layer representation in the segment is obtained.
For the visual-auditory hidden layer characteristics obtained through the sub-network model, the final fusion characteristic representation is obtained by adopting a fusion method, specifically, the consistency of an auditory channel and a visual channel is utilized, the similarity of each pixel point in each frame in the image characteristics and the voice characteristics at the corresponding moment is firstly calculated, and then a mask is obtained. The mask represents the degree of correspondence between the images at different positions in the frame and the speech at the current time, that is, the position information of the speech speaker in the normal case. And then, applying the mask to the hidden layer characteristics obtained by the vision sub-network model to perform multiplication operation, and filtering the characteristics of smaller pixel points according to a set threshold value. And finally, compressing all pixel points of each frame to obtain a feature vector corresponding to each frame. After the obtained filtered visual feature vector is fused with the original auditory hidden layer feature, the fusion modes which can be adopted in the method are various, such as direct splicing or after the splicing, a final fusion feature representation is obtained after an LSTM network and a full connection layer.
S105, inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters;
inputting each voice segment data stream, corresponding voice sender identity information and corresponding position information into a voice enhancement denoising model to obtain a plurality of denoised voices; and inputting each voice segment data stream and the corresponding denoised voice into a voice recognition model, and extracting a plurality of corresponding voice transcription words.
For example, the obtained information of the voice speaker is combined with the voice segment data stream and input into a denoising network constructed by combining a multilayer CNN with an LSTM. Through the participation of the visual signal, the face information, the lip information and even the limb information on the image of the voice sender can be utilized to filter the background noise which is irrelevant to the voice in the environment. It is even possible to separate only the voice of the target voice utterer in the case of mixing by other sounds. After passing through the network, the voice of the voice speaker with relatively pure, high audibility and large signal-to-noise ratio in the segment is obtained.
For example, the obtained information of the speech utterer is combined with the speech segment data stream, and input to a network model for speech recognition to perform speech recognition. Through the participation of visual signals, the face information, the lip information and even the limb information on the image of the voice sender can be utilized to supplement the recognition process of a simple voice recognition channel. The method further utilizes a lip language recognition (Lipreading) similar method in the aspect of visual signals, and is combined with the original single-channel voice recognition, so that the accuracy and the stability of the method are further improved. After this step, the high quality text content corresponding to the segment is output.
And S106, sequentially recording each voice transcription character and the corresponding voice speaker identity information to generate a conference summary.
And recording each obtained voice transcription character and the corresponding voice speaker identity information in sequence to generate a conference summary. Wherein each voice transcription word and the corresponding voice speaker identity information correspond to a voice speaker.
For example, the voice speaker identity information is aggregated from voice speaker to voice speaker. In detail, if the identity information of the voice speaker is represented by the distributed hidden layer features, according to the sequence of the segments, by a method of presetting a threshold value, the subsequent segments are sequentially classified into the previously appeared categories according to the distance from the cluster center formed by the previous segments, or appear as a new voice speaker. If the identity information of the voice speaker is marked by the label before, all the fragments of the same label are directly gathered together. Besides the voice transcription of each segment, the corresponding visual information can be recorded simultaneously, and the video segment can be recorded more visually and more strongly. According to the output needed by the final task, different functions such as text recording, video clip query according to the text and the like can be realized.
Through the above description of the technical solutions provided by the embodiments of the present invention, the conference summary generation method provided by the embodiments of the present invention has the following beneficial effects:
1. most of the existing automatic generation schemes of the conference summary only use information of a pure voice channel, and do not well utilize information provided by a visual path. In the invention, a visual path signal is introduced to complete the generation of the conference summary.
2. The existing automatic generation scheme of the conference summary can only directly output the characters, and the information of the voice sender of each section of speech is confirmed by manual participation or other complicated calibration methods in the later period. The invention integrates the work of sound activity detection, voice sender detection, voice denoising and voice recognition into a technical scheme, and finally flexibly organizes the obtained multi-section conference records according to different time and different voice senders in an organizing way. The technical scheme greatly improves the practicability and efficiency of the current conference summary generation scheme, and can also carry out various interactions through characters and videos in the final presentation form, thereby greatly improving the convenience and experience of users in use.
3. In the process of detecting the voice speaker, the position information of the voice speaker is obtained through the consistency correspondence of the visual and auditory signals. By the method, most irrelevant information on wider image input is filtered, and the detection of dynamic voice speakers in a plurality of voice speaker scenes is completed. The technology can further play a role in promoting subsequent voice denoising, voice separation and voice recognition. Meanwhile, the method can play a supporting role in tracking and denoising the voice speaker in the remote video conference in real time.
As for the method embodiment, an embodiment of a conference summary generation apparatus is further provided in the embodiments of the present invention, as shown in fig. 2, the apparatus may include: the system comprises a collection module 210, a voice extraction module 220, a data stream extraction module 230, a data stream interception module 240, an information extraction module 250, a text extraction module 260 and a summary generation module 270.
The acquisition module 210 is configured to invoke the voice acquisition device to acquire whole-course voices corresponding to the multiple voice speakers in the conference process, and invoke the image acquisition device to acquire whole-course images corresponding to the multiple voice speakers in the conference process;
a voice extracting module 220, configured to extract a single-channel voice from the full-range voice;
a data stream extracting module 230, configured to extract a plurality of voice segment data streams from the single-channel voice, where each voice segment data stream belongs to a voice utterer;
a data stream intercepting module 240, configured to intercept, for each voice segment data stream, an image segment data stream corresponding to the voice segment data stream from the global image, where each voice segment data stream and the corresponding image segment data stream both belong to the same voice originator;
an information extraction module 250, configured to input each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extract a plurality of corresponding voice speaker identity information and location information;
a text extraction module 260, configured to input each voice segment data stream, and corresponding voice utterer identity information and location information into a voice recognition model, and extract a plurality of corresponding voice transcription texts;
and a summary generation module 270, configured to record each voice transcription and corresponding voice speaker identity information in sequence, and generate a conference summary.
According to a specific embodiment provided by the present invention, the speech extraction module 220 is specifically configured to:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
According to a specific embodiment of the present invention, the data stream extracting module 230 is specifically configured to:
and performing sentence segmentation on the single-channel voice to extract a plurality of voice segmented data streams.
Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 300 shown in fig. 3 includes: at least one processor 301, memory 302, at least one network interface 304, and other user interfaces 303. The various components in computer device 300 are coupled together by a bus system 305. It will be appreciated that the bus system 305 is used to enable communications among the components connected. The bus system 305 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 305 in fig. 3.
The user interface 303 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.
It will be appreciated that the memory 302 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a Read-only memory (ROM), a programmable Read-only memory (PROM), an erasable programmable Read-only memory (erasabprom, EPROM), an electrically erasable programmable Read-only memory (EEPROM), or a flash memory. The volatile memory may be a Random Access Memory (RAM) which functions as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (staticiram, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (syncronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM ), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 302 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 302 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 3021 and application programs 3022.
The operating system 3021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 3022 include various application programs such as a media player (MediaPlayer), a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application program 3022.
In the embodiment of the present invention, by calling a program or an instruction stored in the memory 302, specifically, a program or an instruction stored in the application 3022, the processor 301 is configured to execute the method steps provided by the method embodiments, for example, including:
calling voice acquisition equipment to acquire whole-course voices corresponding to a plurality of voice speakers in a conference process, and calling image acquisition equipment to acquire whole-course images corresponding to the plurality of voice speakers in the conference process; extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice sender; for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker; inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information; inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters; and recording each voice transcription character and the corresponding voice sender identity information in sequence to generate a conference summary.
The method disclosed in the above embodiments of the present invention may be applied to the processor 301, or implemented by the processor 301. The processor 301 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 301. The processor 301 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 302, and the processor 301 reads the information in the memory 302 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The computer device provided in this embodiment may be a computer device as shown in fig. 3, and may execute all steps of the conference summary generation method shown in fig. 1, so as to achieve the technical effect of the conference summary generation method shown in fig. 1, and please refer to the related description of fig. 1 for brevity, which is not described herein again.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors, the conference summary generation method executed on the conference summary generation apparatus side is implemented.
The processor is used for executing the conference summary generation program stored in the memory so as to realize the following steps of the conference summary generation method executed on the conference summary generation equipment side:
calling voice acquisition equipment to acquire whole-course voices corresponding to a plurality of voice speakers in a conference process, and calling image acquisition equipment to acquire whole-course images corresponding to the plurality of voice speakers in the conference process; extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice sender; for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker; inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information; inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters; and recording each voice transcription character and the corresponding voice sender identity information in sequence to generate a conference summary.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of generating a conference summary, the method comprising:
calling voice acquisition equipment to acquire whole-course voices corresponding to a plurality of voice speakers in a conference process, and calling image acquisition equipment to acquire whole-course images corresponding to the plurality of voice speakers in the conference process;
extracting single-channel voice from the whole-course voice, and respectively extracting a plurality of voice segment data streams from the single-channel voice, wherein each voice segment data stream belongs to a voice sender;
for each voice segment data stream, intercepting an image segment data stream corresponding to the voice segment data stream from the whole-course image, wherein each voice segment data stream and the corresponding image segment data stream belong to the same voice speaker;
inputting each voice segment data stream and the corresponding image segment data stream into a voice speaker detection model, and extracting a plurality of corresponding voice speaker identity information and position information;
inputting each voice segment data stream, corresponding voice speaker identity information and position information into a voice recognition model, and extracting a plurality of corresponding voice transcription characters;
and recording each voice transcription character and the corresponding voice sender identity information in sequence to generate a conference summary.
2. The method of claim 1, wherein the extracting single-channel speech from the global speech comprises:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
3. The method of claim 1, wherein the extracting the plurality of voice segment data streams from the single-channel voice respectively comprises:
and performing sentence segmentation on the single-channel voice to extract a plurality of voice segmented data streams.
4. The method of claim 1, wherein for each voice segment data stream, truncating from the global image an image segment data stream corresponding to the voice segment data stream comprises:
and for each voice segment data stream, intercepting a corresponding image segment data stream which is positioned in the same time period with the voice segment data stream from the whole-course image.
5. The method of claim 1, wherein inputting each speech segment data stream, and corresponding speech speaker identity information and location information into a speech recognition model, extracting a plurality of corresponding phonetic transcriptions, comprises:
inputting each voice segment data stream, corresponding voice sender identity information and corresponding position information into a voice enhancement denoising model to obtain a plurality of denoised voices;
and inputting each voice segment data stream and the corresponding denoised voice into a voice recognition model, and extracting a plurality of corresponding voice transcription words.
6. An apparatus for generating a conference summary, the apparatus comprising:
the acquisition module is used for calling the voice acquisition equipment to acquire the whole-course voice corresponding to the plurality of voice speakers in the conference process and calling the image acquisition equipment to acquire the whole-course image corresponding to the plurality of voice speakers in the conference process;
the voice extraction module is used for extracting single-channel voice from the whole-course voice;
a data stream extraction module, configured to extract a plurality of voice segment data streams from the single-channel voice, where each voice segment data stream belongs to a voice utterer;
the data flow intercepting module is used for intercepting an image segment data flow corresponding to the voice segment data flow from the whole-course image aiming at each voice segment data flow, wherein each voice segment data flow and the corresponding image segment data flow belong to the same voice sender;
the information extraction module is used for inputting each voice segment data stream and the corresponding image segment data stream into the voice speaker detection model and extracting a plurality of corresponding voice speaker identity information and position information;
the character extraction module is used for inputting each voice segment data stream, corresponding voice speaker identity information and position information into the voice recognition model and extracting a plurality of corresponding voice transcription characters;
and the summary generation module is used for recording each voice transcription character and the corresponding voice speaker identity information in sequence to generate a conference summary.
7. The apparatus of claim 6, wherein the speech extraction module is specifically configured to:
and carrying out A/D conversion on the whole-course voice, and extracting according to a preset extraction rate to obtain single-channel voice.
8. The apparatus of claim 6, wherein the data stream extraction module is specifically configured to:
and performing sentence segmentation on the single-channel voice to extract a plurality of voice segmented data streams.
9. A computer device, comprising: a processor and a memory, the processor being configured to execute a conference summary generation program stored in the memory to implement the conference summary generation method of any one of claims 1 to 5.
10. A storage medium storing one or more programs executable by one or more processors to implement the method of generating a conference summary according to any one of claims 1 to 5.
CN201910766155.9A 2019-08-19 2019-08-19 Conference summary generation method, device, computer equipment and storage medium Active CN112466306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910766155.9A CN112466306B (en) 2019-08-19 2019-08-19 Conference summary generation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910766155.9A CN112466306B (en) 2019-08-19 2019-08-19 Conference summary generation method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112466306A true CN112466306A (en) 2021-03-09
CN112466306B CN112466306B (en) 2023-07-04

Family

ID=74807086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910766155.9A Active CN112466306B (en) 2019-08-19 2019-08-19 Conference summary generation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112466306B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689855A (en) * 2021-08-18 2021-11-23 北京铁道工程机电技术研究所股份有限公司 Conference record generation system, method, device and storage medium
CN113722425A (en) * 2021-07-23 2021-11-30 阿里巴巴达摩院(杭州)科技有限公司 Data processing method, computer device and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212556A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Factorial hidden markov model for audiovisual speech recognition
WO2014199596A1 (en) * 2013-06-10 2014-12-18 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Speaker identification method, speaker identification device, and speaker identification system
CN106657865A (en) * 2016-12-16 2017-05-10 联想(北京)有限公司 Method and device for generating conference summary and video conference system
CN107451110A (en) * 2017-07-10 2017-12-08 珠海格力电器股份有限公司 A kind of method, apparatus and server for generating meeting summary
CN109361825A (en) * 2018-11-12 2019-02-19 平安科技(深圳)有限公司 Meeting summary recording method, terminal and computer storage medium
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030212556A1 (en) * 2002-05-09 2003-11-13 Nefian Ara V. Factorial hidden markov model for audiovisual speech recognition
WO2014199596A1 (en) * 2013-06-10 2014-12-18 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Speaker identification method, speaker identification device, and speaker identification system
US20150205568A1 (en) * 2013-06-10 2015-07-23 Panasonic Intellectual Property Corporation Of America Speaker identification method, speaker identification device, and speaker identification system
CN106657865A (en) * 2016-12-16 2017-05-10 联想(北京)有限公司 Method and device for generating conference summary and video conference system
CN107451110A (en) * 2017-07-10 2017-12-08 珠海格力电器股份有限公司 A kind of method, apparatus and server for generating meeting summary
CN109361825A (en) * 2018-11-12 2019-02-19 平安科技(深圳)有限公司 Meeting summary recording method, terminal and computer storage medium
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JON BARKER等: "Energetic and Informational Masking Effects in an Audiovisual Speech Recognition System", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 17, no. 3, XP011251202, DOI: 10.1109/TASL.2008.2011534 *
秦正鹏: "基于深度学习方法的口型识别技术的研究", 中国优秀硕士学位论文电子全文数据库 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722425A (en) * 2021-07-23 2021-11-30 阿里巴巴达摩院(杭州)科技有限公司 Data processing method, computer device and computer-readable storage medium
CN113689855A (en) * 2021-08-18 2021-11-23 北京铁道工程机电技术研究所股份有限公司 Conference record generation system, method, device and storage medium

Also Published As

Publication number Publication date
CN112466306B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
US11699456B2 (en) Automated transcript generation from multi-channel audio
Makino et al. Recurrent neural network transducer for audio-visual speech recognition
Gabbay et al. Visual speech enhancement
CN110517689B (en) Voice data processing method, device and storage medium
CN112786052B (en) Speech recognition method, electronic equipment and storage device
Chuang et al. Improved lite audio-visual speech enhancement
Tao et al. Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection.
CN110719436B (en) Conference document information acquisition method and device and related equipment
CN116129931B (en) Audio-visual combined voice separation model building method and voice separation method
CN113593601A (en) Audio-visual multi-modal voice separation method based on deep learning
CN112466306B (en) Conference summary generation method, device, computer equipment and storage medium
TWI769520B (en) Multi-language speech recognition and translation method and system
Park et al. OLKAVS: an open large-scale Korean audio-visual speech dataset
JP7400364B2 (en) Speech recognition system and information processing method
CN117313785A (en) Intelligent digital human interaction method, device and medium based on weak population
CN115439614A (en) Virtual image generation method and device, electronic equipment and storage medium
Wang et al. A large-scale depth-based multimodal audio-visual corpus in mandarin
CN114495946A (en) Voiceprint clustering method, electronic device and storage medium
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
Kalkhorani et al. Time-domain Transformer-based Audiovisual Speaker Separation
CN112397089B (en) Speech generator identity recognition method, device, computer equipment and storage medium
CN111160051A (en) Data processing method and device, electronic equipment and storage medium
CN117854535B (en) Cross-attention-based audio-visual voice enhancement method and model building method thereof
KR102550750B1 (en) Sing language recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant