CN111986677A - Conference summary generation method and device, computer equipment and storage medium - Google Patents

Conference summary generation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111986677A
CN111986677A CN202010910529.2A CN202010910529A CN111986677A CN 111986677 A CN111986677 A CN 111986677A CN 202010910529 A CN202010910529 A CN 202010910529A CN 111986677 A CN111986677 A CN 111986677A
Authority
CN
China
Prior art keywords
conference
teleconference
speaker
text
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010910529.2A
Other languages
Chinese (zh)
Inventor
王金燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010910529.2A priority Critical patent/CN111986677A/en
Publication of CN111986677A publication Critical patent/CN111986677A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Social Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a conference summary generation method, a device, equipment and a storage medium. The conference summary generation method comprises the steps of receiving a media data stream corresponding to a remote conference in real time; carrying out speech detection on the media data stream and determining a speaker identifier; calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification; and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference. The conference summary generation method can effectively improve the recording efficiency of the conference summary and is not easy to miss.

Description

Conference summary generation method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a conference summary generation method, a conference summary generation device, computer equipment and a storage medium.
Background
Modern enterprises are getting larger and larger in scale, and people are distributed more and more widely and more dispersedly, so that the demand of people on the cooperation of different places is more and more urgent. The remote conference is a conference form which realizes the cooperation of working in different places by utilizing a modern communication means.
At present, different participants discuss the subject of the conference in the conference during the conference holding period, and after the teleconference is finished, the conference summary of the conference, the resolution and task allocation and other work contents in the conference need to be manually organized by the staff who is specially responsible for recording, so that the working time is consumed, the efficiency is low, and the phenomenon that the key content of the conference is not completely recorded can also occur.
Disclosure of Invention
The embodiment of the invention provides a conference summary generation method, a conference summary generation device, computer equipment and a storage medium, and aims to solve the problems that conference summary in a current teleconference needs to be manually sorted by a recorder, the efficiency is low, and omission is easy.
A conference summary generation method, comprising:
receiving a media data stream corresponding to the teleconference in real time;
carrying out speech detection on the media data stream and determining a speaker identifier;
calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification;
and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
A conference summary generation apparatus comprising:
the data receiving module is used for receiving the media data stream corresponding to the teleconference in real time;
a speaker identifier determining module, configured to perform speaker detection on the media data stream, and determine a speaker identifier;
the speech text recognition module is used for calling the voice recognition module to recognize speech contents of a speaker according to the speaker identification and acquiring a speech text corresponding to the speaker identification;
and the summary text acquisition module is used for extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the above-mentioned conference summary generation method when executing said computer program.
A computer storage medium, storing a computer program which, when executed by a processor, implements the steps of the above-described conference summary generation method.
In the conference summary generation method, the conference summary generation device, the computer equipment and the storage medium, the media data stream corresponding to the teleconference is received in real time so as to carry out speech detection on the media data stream, the speaker identification is automatically determined, then the voice recognition module is called to recognize the effective speech content of the speaker in the speech state according to the speaker identification so as to acquire the speech text corresponding to the speaker identification, and the speech content is associated with the speaker so as to facilitate the follow-up review of conference content by participants. And finally, extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference, so that the problems that the conference summary needs to be manually arranged in the current teleconference, the efficiency is low, and omission is easy to occur are effectively solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a conference summary generation method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;
FIG. 3 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;
FIG. 4 is a detailed flowchart of step S302 in FIG. 3;
FIG. 5 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;
FIG. 6 is a detailed flowchart of step S204 in FIG. 2;
FIG. 7 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;
FIG. 8 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;
FIG. 9 is a schematic diagram of a conference summary generation apparatus in accordance with an embodiment of the present invention;
FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The conference summary generation method is applicable in an application environment as in fig. 1, where a computer device communicates with a server over a network. The computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server.
In an embodiment, as shown in fig. 2, a conference summary generation method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:
s201: and receiving the media data stream corresponding to the teleconference in real time.
The method can be applied to a remote conference tool, automatically generates the conference summary aiming at the conference content in the remote conference, does not need manual intervention, saves time and can effectively ensure the comprehensiveness of the conference summary. Specifically, when the teleconference starts to be performed, the server may receive, in real time, a media data stream fed back by the terminal of each participant in the teleconference, where the media data stream includes an audio stream or a video stream, and this is not limited here.
S202: and carrying out speech detection on the media data stream and determining the identifier of the speaker.
The speaker identification is used for uniquely identifying the speaker, whether the voice recognition module is started to recognize the speech content can be further determined by determining the speaker identification, the voice recognition module does not need to be started in the whole process, and therefore the occupation of memory resources is reduced. It is to be understood that the media data stream includes an audio stream or a video stream, so that the detection of the utterance is performed in different manners for the audio stream or the video stream.
S203: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.
Specifically, in the teleconference, when the terminal corresponding to the speaker identifier records the audio stream, interference of device noise or background noise of the environment around the speaker occurs, so that in this embodiment, a speech enhancement algorithm (including but not limited to spectral subtraction, EEMD decomposition algorithm, and SVD singular value algorithm) may be further adopted to perform noise reduction processing on the content of the speech (i.e., audio data carried by the video stream), so as to ensure accuracy of speech recognition, and never ensure accuracy and reliability of subsequently generated conference summary.
It can be understood that when the speech recognition module is called, the VAD algorithm may be used to perform silence detection to count the silence duration of the speaker after the speech is started, and if the silence duration is detected to exceed the preset threshold, the speaker is considered to finish speaking at this time, the recognition program of the speech recognition module is stopped, and the recognition text corresponding to the segment of audio data is used as the speech text corresponding to the speaker identifier. By associating the speaking content with the speaker, the participant can subsequently review the meeting content.
In the case of the teleconference, when it is determined that the speaker has finished speaking, it is determined that there is a recognition result in which the mouth of at least two or more consecutive frames is closed in the recognition sequence output by the recognition utterance detection model, for example, 001011111, and when it is determined that the speaker has finished speaking at this time, the recognition process of the speech recognition module is stopped and the recognition text corresponding to the piece of audio data is used as the utterance text corresponding to the speaker id.
S204: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
Specifically, keyword extraction is performed on a speech text corresponding to each speaker identifier according to a plurality of preset extraction fields corresponding to the conference type, so that a target summary text corresponding to the remote video conference is obtained. When the conference summary is extracted, the extraction can be carried out in real time, and the keyword extraction can also be uniformly carried out on the speech text after the conference is held, which is not limited here.
In the embodiment, the media data stream corresponding to the teleconference is received in real time so as to perform speaking detection on the media data stream, the speaker identifier is automatically determined, then the voice recognition module is called to recognize the effective speaking content of the speaker in the speaking state in a targeted manner according to the speaker identifier so as to acquire the speaking text corresponding to the speaker identifier, and the speaking content is associated with the speaker so as to facilitate the participants to review the conference content subsequently. And finally, extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference, so that the problems that the current teleconference summary needs to be manually arranged, the efficiency is low, and omission is easy are effectively solved.
In one embodiment, as shown in FIG. 3, the teleconference includes a teleconference. The media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out; the conference summary generation method specifically comprises the following steps:
s301: and receiving a video stream corresponding to the teleconference in real time.
S302: and adopting a pre-trained speech detection model to detect speech of the video stream and determining the identifier of the speaker.
S303: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.
When the remote video conference type starts, a camera on a terminal corresponding to each participant is started, video streams are recorded in real time, and the video streams are transmitted to a server. Specifically, if the conference form initiated by the current conference is a remote video conference, the server receives a video stream corresponding to each participant in the remote video conference, which is recorded by a terminal corresponding to the participant, in real time, so that a pre-trained speech detection model is adopted to perform speech detection on the video stream corresponding to each participant, a speaker identifier corresponding to the video stream is determined, that is, which participant the speaker is, when the speaker speaks, the speech recognition module is called to recognize speech content of the speaker, that is, the speech recognition module is called to recognize audio data fed back by the terminal corresponding to the received speaker identifier, and a speech text corresponding to the speaker identifier is acquired.
S304: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
Specifically, step S304 is consistent with step S204, and is not described herein again to avoid repetition.
In one embodiment, as shown in FIG. 4, the video stream includes a plurality of frames of video frame images carrying time tags; in step S302, that is, a pre-trained speech detection model is used to perform speech detection on the video stream, and a speaker identifier is determined, which specifically includes the following steps:
s401: and inputting a plurality of frames of video frame images carrying time labels into the speech detection model according to a time sequence for identification, and acquiring an identification result of each frame of video frame image.
Specifically, a plurality of frames of video frame images carrying time labels are input into the utterance detection model according to a time sequence for identification, and an identification result of each frame of video frame image is obtained, that is, the video frame images with time sequence are input into the utterance detection model according to the time sequence for identification. The recognition result is used for reflecting the mouth state of the participant at the moment, namely whether the mouth is in an open state or not. The utterance detection model can be obtained by performing deep learning training on a large number of pre-labeled images (represented by the label "0") of the speaker with the mouth in an open state and images (represented by the label "1") with the mouth closed. The time stamp can be used to reflect the representation of each frame of video frame image in the time axis, for example, the representation of the time stamp corresponding to the first frame of video frame image in the time axis is the first second, and the representation of the time stamp corresponding to the second frame of image in the time axis is the second … ….
S402: and judging each recognition result according to a preset judgment strategy to obtain a speech detection result corresponding to the video stream.
S403: based on the utterance detection result, a corresponding speaker identification is determined.
The speaking detection result is used for reflecting the lip movement state, and whether the participant is in the speaking state currently can be reflected through the lip movement state. For example, assuming that the recognition sequence formed by the recognition results of the video frame images corresponding to each frame in the video stream is "001101", if there is a recognition result in the recognition sequence that the mouth state of two or more consecutive frames is in an open state, the participant at that time is considered to be in the speaking state, and the participant identifier corresponding to the video stream may be taken as the speaker identifier currently speaking.
In the embodiment, whether the current participant is in the speaking state or not is judged by the mouth state corresponding to the identification result of the video frame images of the continuous multiple frames, so that the accuracy of the determination of the speaker is ensured, and the problem of misjudgment caused by the fact that the judgment is carried out only by adopting the identification result of the single frame video frame image is avoided.
In one embodiment, as shown in FIG. 5, the teleconference includes a teleconference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out; the conference summary generation method specifically comprises the following steps:
s501: and receiving an audio stream corresponding to the teleconference in real time.
S502: and extracting target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with the voiceprint characteristics corresponding to the pre-stored participant identifications, and determining the speaker identifications.
The original voiceprint features and the target voiceprint features include, but are not limited to, mel-frequency spectrum features, mel-filter features, and the like.
When the remote voice conference begins to be carried out, a microphone on a terminal corresponding to each participant is started, an audio stream is recorded in real time, and the audio stream is transmitted to a server. Specifically, if the remote voice conference starts, the server receives an audio stream recorded by the terminal corresponding to each participant in real time, and then extracts voiceprint features of the audio stream through a voiceprint feature extraction algorithm to obtain target voiceprint features, performs feature comparison on the target voiceprint features and original voiceprint features corresponding to prestored participant identifiers, and determines a speaker identifier so that the speaking content corresponds to the speaking object, which speaking object the text belongs to can be reflected in the conference summary, thereby facilitating subsequent review of the conference content by the participants. The voiceprint extraction algorithm can be used for extraction by adopting a Fourier transform algorithm or a fast Fourier transform algorithm, and is not limited at this time.
Furthermore, by determining the speaker identifier, the server can directly identify the audio stream fed back by the terminal corresponding to the speaker identifier, and noise interference generated by other terminals can be avoided.
S503: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.
Specifically, step S503 is consistent with step S203, and is not described herein again to avoid repetition.
S504: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
Specifically, step S504 is consistent with step S204, and is not described herein again to avoid repetition.
In the embodiment, conference contents are recorded by calling the voice recognition module through adopting corresponding conference recording strategies for conferences of different conference forms, and the content extraction is carried out on the collected text data by adopting a keyword extraction scheme, so that a conference summary is automatically generated, and the generalization is improved.
In an embodiment, as shown in fig. 6, in step S204, that is, extracting keywords from the utterance text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference, to obtain a target summary text corresponding to the teleconference specifically includes the following steps:
s601: and extracting keywords of the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field.
The first text refers to a text formed by character strings matched with the preset extraction field. Specifically, the preset extraction field is an extraction field which is preset according to a conference type, and in order to ensure that important contents in a conference document are not lost, a plurality of preset extraction fields can be preset to extract keywords, and at this time, a situation that a speech text matched with the preset extraction field does not belong to the important contents of the conference may occur, so that in order to avoid that the contents of the conference document are long, a content screening mechanism is added in the embodiment, namely, a first text is screened according to a preset screening dimension, unimportant conference records are removed, the screened text is used as the contents of the conference document to be mapped and filled according to a conference document template corresponding to the conference type, a target document corresponding to a remote video conference is obtained, and the space of the conference records is effectively reduced.
S602: and screening the first text according to a preset screening dimension to obtain a target summary text corresponding to the teleconference.
The preset filtering dimension includes, but is not limited to, the number of occurrences of the keyword (i.e., the preset extraction field) matched with the first text in the recorded speech text of the conference, the discussion duration of the topic reflected by the keyword (the discussion duration can be reflected by counting the number of continuous occurrences of the keyword in one sentence/paragraph), and the like.
Further, when the conference type is a conference of a working progress statistics type, key fields of each working task, completion, incompletion and the like can be preset as preset extraction fields of the conference of the type for extracting text contents, and in addition, when the voice recognition module recognizes the incompletion field, the semantic analysis module can be called to analyze the semantics of audio data at the next moment, so that the incompletion reason is determined, and the semantic analysis module can be used for automatically recording the content of the conference summary. The semantic analysis module can adopt Natural Language Processing (NLP) to realize semantic analysis.
In an embodiment, as shown in fig. 7, before step S201, the method for generating a conference summary further includes the following steps:
s701: and acquiring a remote conference initiating request, wherein the remote conference initiating request comprises a conference type, a conference form and participant identifications.
The conference type can be selected by the user according to actual needs, the conference type includes but is not limited to a project conference, an example conference, and the project conference includes a project starting conference, a project condition review conference, a project technical review conference, a project problem solving conference, and the like, which are not listed here. Conference modalities include, but are not limited to, remote video conferencing, remote voice conferencing, and the like.
Specifically, the conference initiator selects the people, conference types and conference forms to be participated according to actual needs, so that the server can respond to the remote conference initiation request conveniently.
S702: and determining the conference priority corresponding to the participant identification according to the conference type.
Wherein the meeting priority is used for reflecting the importance of the meeting personnel to the meeting. The importance of different participants to the conference is different, that is, the participants have different priorities when participating in the conference of different conference types.
Specifically, the server may set, in advance, an important role corresponding to each conference type according to different conference types, for example, a project condition review conference, where the review conference is mainly focused on reporting a progress condition of a project, and the important role of the conference includes a person in charge of each module of the project, and the like, so that the priority of the participant in charge of each module may be set to be higher, and other people of different project modules may be selectively joined, and the participant in charge of each module of the project may be mainly reported, and the priority of the part of participants (i.e., other people selectively joined) may be set to be lower.
It can be understood that, when a conference is initiated, the conference priority of the role corresponding to each participant is determined according to the conference type, so as to subsequently judge whether the conference can be held normally according to the conference priority.
S703: and receiving a conference connection result corresponding to the participant identification.
As an example, in this embodiment, when a teleconference initiation request is triggered, a teleconference may be initiated by selecting a plurality of participants, pulling the participants into a discussion group (i.e., a conference group), and then selecting a conference form and a conference type, so as to establish a communication connection with a terminal of each participant; or, the conference initiator initiates a shared two-dimensional code in the workgroup, and each participant enters through the code scanning, but in order to ensure that the relevant information of the conference is not leaked, the identity of the participant needs to be verified when entering the teleconference through the shared two-dimensional code, and if the participant selected by the initiator can enter.
S704: and if the conference priority corresponding to the participant identification is high and the conference connection result is failure, calling the communication module to establish communication with the terminal corresponding to the participant identification.
Specifically, if the terminal corresponding to the participant refuses to connect or does not answer for a long time, the conference connection result of the conference connection failure is fed back. It can be understood that if the terminal corresponding to the participant identifier with the high participant priority feeds back the conference connection failure, the communication connection is initiated again to the terminal of the participant, and if the conference connection failure is still fed back, the communication module (including but not limited to mail or telephone) is called to call the terminal corresponding to the participant identifier to remind the participant to participate in the teleconference, otherwise, the conference cannot be effectively held.
S705: and if the conference priority corresponding to the participant identification is high and the conference connection result is successful, responding to the remote conference initiating request, and creating a conference group corresponding to the conference form so as to receive the media data stream corresponding to the remote conference in real time.
Specifically, if the terminal corresponding to the participant identifier with the high participant priority feeds back that the conference connection is successful, and the conference can be started normally, a conference holding thread is started, that is, a conference group corresponding to the conference form is created, so as to hold the teleconference.
In the embodiment, before the teleconference is held, the priorities of the participants are judged to determine whether the teleconference can be successfully held, so that the problem that the teleconference cannot be effectively held due to the fact that important participants cannot participate in the conference in time due to other external factors is avoided.
In an embodiment, after the target conference summary is generated, a export verification function is further provided to ensure the security of conference records and make important data not easy to leak. As shown in fig. 8, after step S204, the conference summary generation method further includes the following steps:
s801: acquiring a conference summary export request, wherein the conference summary export request comprises a conference summary identifier, an exporter identifier and an export path.
The conference summary derivation request may be triggered manually by a user or automatically after the teleconference is ended, which is not limited herein.
S802: and performing identity verification according to the biological characteristics corresponding to the derived person identification to determine whether to respond to the conference summary derivation request.
The biological features include, but are not limited to, voiceprint features or human face features. Specifically, the voiceprint feature extraction algorithm may be used to perform feature extraction on the audio data input by the exporter to obtain the voiceprint feature, or the collected face image may be input into a face feature extraction model created in advance to extract the face feature, which is not limited herein. It should be understood that, here, the exporter is authenticated, and other authentication methods can be adopted as long as authentication can be realized.
S803: and if the identity authentication is passed, responding to the conference summary derivation request, and deriving a target conference summary corresponding to the conference summary identification according to the derivation path.
The export path includes, but is not limited to, a notepad, a mailbox, or a discussion group of the attendees, and is not limited herein. Specifically, the exportable role corresponding to the conference summary identifier can be set so as to verify the identity of the exporter when the conference summary is exported.
Illustratively, the identity of the exporter can be verified by collecting the biological characteristics of the exporter and comparing the biological characteristics with the biological characteristics which are stored in a database in advance and correspond to the exportable hue, if the identity verification is passed, the exportation authority of the conference summary is considered to exist, a conference summary exporting request is responded, and a target conference summary corresponding to the conference summary identification is exported according to an exporting path.
In this embodiment, derive the function through a key, can derive the meeting summary according to the derivation route of difference, make things convenient for the participant to look over in respective terminal to, still increased and derived the verification function, verify the person's of deriving identity, with the security of effectively guaranteeing the meeting summary.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a conference summary generation apparatus is provided, where the conference summary generation apparatus corresponds to the conference summary generation method in the above embodiment one to one. As shown in fig. 9, the conference summary generation apparatus includes a data receiving module 10, a speaker identification determination module 20, a speech text recognition module 30, and a summary text acquisition module 40. The functional modules are explained in detail as follows:
and the data receiving module 10 is configured to receive a media data stream corresponding to the teleconference in real time.
And a speaker identifier determining module 20, configured to perform speech detection on the media data stream, and determine a speaker identifier.
And the speech text recognition module 30 is configured to invoke the voice recognition module to recognize speech content of the speaker according to the speaker identifier, and acquire a speech text corresponding to the speaker identifier.
And the summary text acquisition module 40 is configured to extract keywords from the utterance text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference, so as to obtain a target summary text corresponding to the teleconference.
In particular, teleconferencing includes teleconferencing; the media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out; the speaker identification determination module includes a first speaker identification determination unit.
And the first speaker identification determining unit is used for detecting the speech of the video stream by adopting a pre-trained speech detection model and determining the speaker identification.
Specifically, the first speaker identification determining unit includes a recognition result acquiring subunit, an utterance detection result acquiring subunit, and a speaker identification determining subunit.
And the identification result acquisition subunit is used for inputting a plurality of frames of video frame images carrying the time labels into the speech detection model according to the time sequence for identification, and acquiring the identification result of each frame of video frame image.
And the speech detection result acquisition subunit is used for judging each recognition result according to a preset judgment strategy and acquiring a speech detection result corresponding to the video stream.
And the speaker identification determining subunit is used for determining the corresponding speaker identification based on the speech detection result.
Specifically, the teleconference includes a remote voice conference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out; the speaker identification determination module includes a second speaker identification determination unit.
And the second speaker identification determining unit is used for extracting the target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with the original voiceprint characteristics corresponding to the prestored participant identification, and determining the speaker identification.
Specifically, the conference type corresponds to a plurality of preset extraction fields; the summary text acquisition module includes a first text acquisition unit and a target summary text acquisition unit.
And the first text acquisition unit is used for extracting keywords of the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field.
And the target summary text acquisition unit is used for screening the first text according to the preset screening dimension to obtain a target summary text corresponding to the teleconference.
Specifically, the conference summary generation device further comprises a teleconference initiating module, a conference priority determining module, a conference connection feedback module, a first processing module and a second processing module.
And the teleconference initiating module is used for acquiring a teleconference initiating request, and the teleconference initiating request comprises a conference type, a conference form and participant identifications.
And the conference priority determining module is used for determining the conference priority corresponding to the participant identification according to the conference type.
And the conference connection feedback module is used for receiving the conference connection result corresponding to the participant identification.
And the first processing module is used for calling the communication module to establish communication with the terminal corresponding to the participant identification if the participant priority corresponding to the participant identification is high and the conference connection result is failure.
And the second processing module is used for responding to the remote conference initiating request and creating a conference group corresponding to the conference form to receive the media data stream corresponding to the remote conference in real time if the conference priority corresponding to the participant identification is high and the conference connection result is successful.
Specifically, the conference summary generation device further comprises an export request acquisition module, an identity verification module and a conference summary export module.
And the derivation request acquisition module is used for acquiring a conference summary derivation request, and the conference summary derivation request comprises a conference summary identifier, a derivation person identifier and a derivation path.
And the identity authentication module is used for performing identity authentication according to the biological characteristics corresponding to the derived person identification so as to determine whether to respond to the conference summary derivation request.
And the conference summary export module is used for responding to the conference summary export request if the identity authentication is passed, and exporting the target conference summary corresponding to the conference summary identification according to the export path.
For specific definition of the conference summary generation apparatus, reference may be made to the above definition of the conference summary generation method, which is not described herein again. The modules in the conference summary generation apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used to store data generated or obtained during the execution of the conference summary generation method, such as a target conference summary. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a conference summary generation method.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the conference summary generation method in the above-mentioned embodiments, such as steps S201-S204 shown in fig. 2 or steps shown in fig. 3 to 8. Alternatively, the processor implements the functions of each module/unit in the embodiment of the conference summary generation apparatus when executing the computer program, for example, the functions of each module/unit shown in fig. 9, and are not described herein again to avoid repetition.
In an embodiment, a computer storage medium is provided, where a computer program is stored on the computer storage medium, and when executed by a processor, the computer program implements the steps of the conference summary generation method in the foregoing embodiments, such as steps S201 to S204 shown in fig. 2 or steps shown in fig. 3 to fig. 8, which are not described herein again to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the embodiment of the conference summary generation apparatus, for example, the functions of each module/unit shown in fig. 9, and are not described herein again to avoid repetition.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A conference summary generation method, comprising:
receiving a media data stream corresponding to the teleconference in real time;
carrying out speech detection on the media data stream and determining a speaker identifier;
calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification;
and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
2. The conference summary generation method of claim 1, wherein the teleconference includes a teleconference; the media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out;
the performing speech detection on the media data stream and determining a speaker identifier includes:
and adopting a pre-trained speech detection model to perform speech detection on the video stream, and determining a speaker identifier.
3. The method of generating a conference summary according to claim 2, wherein the video stream includes a plurality of frames of video frame images carrying time tags;
the detecting the speech of the video stream by adopting the pre-trained speech detection model to determine the identifier of the speaker comprises the following steps:
inputting the multiple frames of video frame images carrying the time labels into the speech detection model according to a time sequence for identification, and acquiring an identification result of each frame of video frame image;
judging each recognition result according to a preset judgment strategy to obtain a speech detection result corresponding to the video stream;
and determining a corresponding speaker identifier based on the speech detection result.
4. The conference summary generation method of claim 1, wherein the teleconference pack is a teleconference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out;
the performing speech detection on the media data stream and determining a speaker identifier includes:
and extracting target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with original voiceprint characteristics corresponding to prestored participant identifications, and determining the speaker identification.
5. The method for generating a conference summary according to claim 1, wherein the text extraction of the utterance text according to a plurality of preset extraction fields corresponding to a conference type of the teleconference to obtain a target summary text corresponding to the teleconference comprises:
extracting keywords from the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field;
and screening the first text according to a preset screening dimension to obtain a target summary text corresponding to the teleconference.
6. The method of generating a conference summary according to claim 1, wherein prior to the step of receiving a media data stream corresponding to a teleconference in real time, the method of generating a conference summary further comprises:
acquiring a remote conference initiating request, wherein the remote conference initiating request comprises a conference type, a conference form and participant identifications;
determining the conference priority corresponding to the participant identification according to the conference type;
receiving a conference connection result corresponding to the participant identification;
if the meeting priority corresponding to the meeting personnel identification is high and the meeting connection result is failure, calling a communication module to establish communication with the terminal corresponding to the meeting personnel identification;
and if the conference joining priority corresponding to the participant identification is high and the conference connection result is successful, responding to the remote conference initiating request, and creating a conference group corresponding to the conference form so as to receive the media data stream corresponding to the remote conference in real time.
7. The method for generating a conference summary according to claim 1, wherein the method for generating a conference summary further comprises, after extracting keywords from the utterance text according to a plurality of preset extraction fields corresponding to a conference type of the teleconference to obtain a target summary text corresponding to the teleconference:
acquiring a conference summary export request, wherein the conference summary export request comprises a conference summary identifier, an exporter identifier and an export path;
performing identity verification according to the biological characteristics corresponding to the derived person identification to determine whether to respond to the conference summary derivation request;
and if the identity authentication is passed, responding to the conference summary derivation request, and deriving a target conference summary corresponding to the conference summary identification according to the derivation path.
8. A conference summary generation apparatus, comprising:
the data receiving module is used for receiving the media data stream corresponding to the teleconference in real time;
a speaker identifier determining module, configured to perform speaker detection on the media data stream, and determine a speaker identifier;
the speech text recognition module is used for calling the voice recognition module to recognize speech contents of a speaker according to the speaker identification and acquiring a speech text corresponding to the speaker identification;
and the summary text acquisition module is used for extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.
9. Computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor, when executing said computer program, carries out the steps of the conference summary generation method according to any one of claims 1 to 7.
10. A computer storage medium storing a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the conference summary generation method according to any one of claims 1 to 7.
CN202010910529.2A 2020-09-02 2020-09-02 Conference summary generation method and device, computer equipment and storage medium Pending CN111986677A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010910529.2A CN111986677A (en) 2020-09-02 2020-09-02 Conference summary generation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010910529.2A CN111986677A (en) 2020-09-02 2020-09-02 Conference summary generation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111986677A true CN111986677A (en) 2020-11-24

Family

ID=73447839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010910529.2A Pending CN111986677A (en) 2020-09-02 2020-09-02 Conference summary generation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111986677A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687272A (en) * 2020-12-18 2021-04-20 北京金山云网络技术有限公司 Conference summary recording method and device and electronic equipment
CN112802480A (en) * 2021-04-15 2021-05-14 广东际洲科技股份有限公司 Voice data text conversion method based on multi-party communication
CN112836016A (en) * 2021-02-05 2021-05-25 北京字跳网络技术有限公司 Conference summary generation method, device, equipment and storage medium
CN113256133A (en) * 2021-06-01 2021-08-13 平安科技(深圳)有限公司 Conference summary management method and device, computer equipment and storage medium
CN113259619A (en) * 2021-05-07 2021-08-13 北京字跳网络技术有限公司 Information sending and displaying method, device, storage medium and conference system
CN113449513A (en) * 2021-06-17 2021-09-28 上海明略人工智能(集团)有限公司 Method, system, computer device and storage medium for automatically generating work summary
CN113746822A (en) * 2021-08-25 2021-12-03 安徽创变信息科技有限公司 Teleconference management method and system
CN113779234A (en) * 2021-09-09 2021-12-10 京东方科技集团股份有限公司 Method, device, equipment and medium for generating speech summary of conference speaker
CN113794853A (en) * 2021-09-07 2021-12-14 广州朗堃电子科技有限公司 Conference summary generation method for remote video conference and conference terminal
CN114093383A (en) * 2022-01-17 2022-02-25 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium
WO2022115087A1 (en) * 2020-11-25 2022-06-02 Turkcell Teknoloji Arastirma Ve Gelistirme Anonim Sirketi A system for creating autocontent in video conference interviews
WO2022146378A1 (en) * 2020-12-28 2022-07-07 Turkcell Teknoloji Arastirma Ve Gelistirme Anonim Sirketi A system for performing automatic translation in video conference server
CN114757155A (en) * 2022-06-14 2022-07-15 深圳乐播科技有限公司 Method and device for generating conference document
CN114936001A (en) * 2022-04-14 2022-08-23 阿里巴巴(中国)有限公司 Interaction method and device and electronic equipment
WO2022237381A1 (en) * 2021-05-08 2022-11-17 聚好看科技股份有限公司 Method for saving conference record, terminal, and server
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment
WO2023160288A1 (en) * 2022-02-25 2023-08-31 京东方科技集团股份有限公司 Conference summary generation method and apparatus, electronic device, and readable storage medium
CN117312612A (en) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109788232A (en) * 2018-12-18 2019-05-21 视联动力信息技术股份有限公司 A kind of summary of meeting recording method of video conference, device and system
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109788232A (en) * 2018-12-18 2019-05-21 视联动力信息技术股份有限公司 A kind of summary of meeting recording method of video conference, device and system
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022115087A1 (en) * 2020-11-25 2022-06-02 Turkcell Teknoloji Arastirma Ve Gelistirme Anonim Sirketi A system for creating autocontent in video conference interviews
CN112687272A (en) * 2020-12-18 2021-04-20 北京金山云网络技术有限公司 Conference summary recording method and device and electronic equipment
WO2022146378A1 (en) * 2020-12-28 2022-07-07 Turkcell Teknoloji Arastirma Ve Gelistirme Anonim Sirketi A system for performing automatic translation in video conference server
CN112836016A (en) * 2021-02-05 2021-05-25 北京字跳网络技术有限公司 Conference summary generation method, device, equipment and storage medium
CN112802480A (en) * 2021-04-15 2021-05-14 广东际洲科技股份有限公司 Voice data text conversion method based on multi-party communication
CN113259619A (en) * 2021-05-07 2021-08-13 北京字跳网络技术有限公司 Information sending and displaying method, device, storage medium and conference system
WO2022237381A1 (en) * 2021-05-08 2022-11-17 聚好看科技股份有限公司 Method for saving conference record, terminal, and server
CN113256133A (en) * 2021-06-01 2021-08-13 平安科技(深圳)有限公司 Conference summary management method and device, computer equipment and storage medium
CN113256133B (en) * 2021-06-01 2023-08-04 平安科技(深圳)有限公司 Conference summary management method, device, computer equipment and storage medium
CN113449513B (en) * 2021-06-17 2024-04-05 上海明略人工智能(集团)有限公司 Automatic work summary generation method, system, computer device and storage medium
CN113449513A (en) * 2021-06-17 2021-09-28 上海明略人工智能(集团)有限公司 Method, system, computer device and storage medium for automatically generating work summary
CN113746822A (en) * 2021-08-25 2021-12-03 安徽创变信息科技有限公司 Teleconference management method and system
CN113794853A (en) * 2021-09-07 2021-12-14 广州朗堃电子科技有限公司 Conference summary generation method for remote video conference and conference terminal
CN113779234A (en) * 2021-09-09 2021-12-10 京东方科技集团股份有限公司 Method, device, equipment and medium for generating speech summary of conference speaker
CN114093383B (en) * 2022-01-17 2022-04-12 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium
CN114093383A (en) * 2022-01-17 2022-02-25 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium
WO2023160288A1 (en) * 2022-02-25 2023-08-31 京东方科技集团股份有限公司 Conference summary generation method and apparatus, electronic device, and readable storage medium
CN114936001A (en) * 2022-04-14 2022-08-23 阿里巴巴(中国)有限公司 Interaction method and device and electronic equipment
CN114757155B (en) * 2022-06-14 2022-09-27 深圳乐播科技有限公司 Conference document generation method and device
CN114757155A (en) * 2022-06-14 2022-07-15 深圳乐播科技有限公司 Method and device for generating conference document
CN115828907A (en) * 2023-02-16 2023-03-21 南昌航天广信科技有限责任公司 Intelligent conference management method, system, readable storage medium and computer equipment
CN117312612A (en) * 2023-10-07 2023-12-29 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium
CN117312612B (en) * 2023-10-07 2024-04-02 广东鼎尧科技有限公司 Multi-mode-based teleconference data recording method, system and medium

Similar Documents

Publication Publication Date Title
CN111986677A (en) Conference summary generation method and device, computer equipment and storage medium
WO2020140665A1 (en) Method and apparatus for quality detection of double-recorded video, and computer device and storage medium
US11646038B2 (en) Method and system for separating and authenticating speech of a speaker on an audio stream of speakers
Jayagopi et al. Modeling dominance in group conversations using nonverbal activity cues
CN112037791B (en) Conference summary transcription method, apparatus and storage medium
WO2020077885A1 (en) Identity authentication method and apparatus, computer device and storage medium
CN110853646A (en) Method, device and equipment for distinguishing conference speaking roles and readable storage medium
CN109949818A (en) A kind of conference management method and relevant device based on Application on Voiceprint Recognition
CN109815489A (en) Collection information generating method, device, computer equipment and storage medium
CN110265032A (en) Conferencing data analysis and processing method, device, computer equipment and storage medium
CN110766442A (en) Client information verification method, device, computer equipment and storage medium
CN113840040B (en) Man-machine cooperation outbound method, device, equipment and storage medium
CN111444323A (en) Accident information rapid acquisition method and device, computer equipment and storage medium
CN109766474A (en) Inquest signal auditing method, device, computer equipment and storage medium
CN109785834B (en) Voice data sample acquisition system and method based on verification code
CN109389028A (en) Face identification method, device, equipment and storage medium based on motion analysis
WO2024032159A1 (en) Speaking object detection in multi-human-machine interaction scenario
CN111626061A (en) Conference record generation method, device, equipment and readable storage medium
CN113571096B (en) Speech emotion classification model training method and device, computer equipment and medium
US11611554B2 (en) System and method for assessing authenticity of a communication
CN113873088A (en) Voice call interaction method and device, computer equipment and storage medium
CN112580459A (en) Service processing method, device, computer equipment and medium based on biological recognition
CN112637613A (en) Live broadcast audio processing method and device, computer equipment and storage medium
US20220046128A1 (en) Method and system for remote interaction between at least one user and a human operator and between at least one user and at least one automated agent
CN113542509B (en) Emergency processing method, device, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination