CN111986677A

CN111986677A - Conference summary generation method and device, computer equipment and storage medium

Info

Publication number: CN111986677A
Application number: CN202010910529.2A
Authority: CN
Inventors: 王金燕
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2020-11-24

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a conference summary generation method, a device, equipment and a storage medium. The conference summary generation method comprises the steps of receiving a media data stream corresponding to a remote conference in real time; carrying out speech detection on the media data stream and determining a speaker identifier; calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification; and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference. The conference summary generation method can effectively improve the recording efficiency of the conference summary and is not easy to miss.

Description

Conference summary generation method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a conference summary generation method, a conference summary generation device, computer equipment and a storage medium.

Background

Modern enterprises are getting larger and larger in scale, and people are distributed more and more widely and more dispersedly, so that the demand of people on the cooperation of different places is more and more urgent. The remote conference is a conference form which realizes the cooperation of working in different places by utilizing a modern communication means.

At present, different participants discuss the subject of the conference in the conference during the conference holding period, and after the teleconference is finished, the conference summary of the conference, the resolution and task allocation and other work contents in the conference need to be manually organized by the staff who is specially responsible for recording, so that the working time is consumed, the efficiency is low, and the phenomenon that the key content of the conference is not completely recorded can also occur.

Disclosure of Invention

The embodiment of the invention provides a conference summary generation method, a conference summary generation device, computer equipment and a storage medium, and aims to solve the problems that conference summary in a current teleconference needs to be manually sorted by a recorder, the efficiency is low, and omission is easy.

A conference summary generation method, comprising:

receiving a media data stream corresponding to the teleconference in real time;

carrying out speech detection on the media data stream and determining a speaker identifier;

calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification;

and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.

A conference summary generation apparatus comprising:

the data receiving module is used for receiving the media data stream corresponding to the teleconference in real time;

a speaker identifier determining module, configured to perform speaker detection on the media data stream, and determine a speaker identifier;

the speech text recognition module is used for calling the voice recognition module to recognize speech contents of a speaker according to the speaker identification and acquiring a speech text corresponding to the speaker identification;

and the summary text acquisition module is used for extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the above-mentioned conference summary generation method when executing said computer program.

A computer storage medium, storing a computer program which, when executed by a processor, implements the steps of the above-described conference summary generation method.

In the conference summary generation method, the conference summary generation device, the computer equipment and the storage medium, the media data stream corresponding to the teleconference is received in real time so as to carry out speech detection on the media data stream, the speaker identification is automatically determined, then the voice recognition module is called to recognize the effective speech content of the speaker in the speech state according to the speaker identification so as to acquire the speech text corresponding to the speaker identification, and the speech content is associated with the speaker so as to facilitate the follow-up review of conference content by participants. And finally, extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference, so that the problems that the conference summary needs to be manually arranged in the current teleconference, the efficiency is low, and omission is easy to occur are effectively solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a conference summary generation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;

FIG. 3 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;

FIG. 4 is a detailed flowchart of step S302 in FIG. 3;

FIG. 5 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;

FIG. 6 is a detailed flowchart of step S204 in FIG. 2;

FIG. 7 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;

FIG. 8 is a flow chart of a method of generating a conference summary in one embodiment of the present invention;

FIG. 9 is a schematic diagram of a conference summary generation apparatus in accordance with an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The conference summary generation method is applicable in an application environment as in fig. 1, where a computer device communicates with a server over a network. The computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server.

In an embodiment, as shown in fig. 2, a conference summary generation method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s201: and receiving the media data stream corresponding to the teleconference in real time.

The method can be applied to a remote conference tool, automatically generates the conference summary aiming at the conference content in the remote conference, does not need manual intervention, saves time and can effectively ensure the comprehensiveness of the conference summary. Specifically, when the teleconference starts to be performed, the server may receive, in real time, a media data stream fed back by the terminal of each participant in the teleconference, where the media data stream includes an audio stream or a video stream, and this is not limited here.

S202: and carrying out speech detection on the media data stream and determining the identifier of the speaker.

The speaker identification is used for uniquely identifying the speaker, whether the voice recognition module is started to recognize the speech content can be further determined by determining the speaker identification, the voice recognition module does not need to be started in the whole process, and therefore the occupation of memory resources is reduced. It is to be understood that the media data stream includes an audio stream or a video stream, so that the detection of the utterance is performed in different manners for the audio stream or the video stream.

S203: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.

Specifically, in the teleconference, when the terminal corresponding to the speaker identifier records the audio stream, interference of device noise or background noise of the environment around the speaker occurs, so that in this embodiment, a speech enhancement algorithm (including but not limited to spectral subtraction, EEMD decomposition algorithm, and SVD singular value algorithm) may be further adopted to perform noise reduction processing on the content of the speech (i.e., audio data carried by the video stream), so as to ensure accuracy of speech recognition, and never ensure accuracy and reliability of subsequently generated conference summary.

It can be understood that when the speech recognition module is called, the VAD algorithm may be used to perform silence detection to count the silence duration of the speaker after the speech is started, and if the silence duration is detected to exceed the preset threshold, the speaker is considered to finish speaking at this time, the recognition program of the speech recognition module is stopped, and the recognition text corresponding to the segment of audio data is used as the speech text corresponding to the speaker identifier. By associating the speaking content with the speaker, the participant can subsequently review the meeting content.

In the case of the teleconference, when it is determined that the speaker has finished speaking, it is determined that there is a recognition result in which the mouth of at least two or more consecutive frames is closed in the recognition sequence output by the recognition utterance detection model, for example, 001011111, and when it is determined that the speaker has finished speaking at this time, the recognition process of the speech recognition module is stopped and the recognition text corresponding to the piece of audio data is used as the utterance text corresponding to the speaker id.

S204: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.

Specifically, keyword extraction is performed on a speech text corresponding to each speaker identifier according to a plurality of preset extraction fields corresponding to the conference type, so that a target summary text corresponding to the remote video conference is obtained. When the conference summary is extracted, the extraction can be carried out in real time, and the keyword extraction can also be uniformly carried out on the speech text after the conference is held, which is not limited here.

In the embodiment, the media data stream corresponding to the teleconference is received in real time so as to perform speaking detection on the media data stream, the speaker identifier is automatically determined, then the voice recognition module is called to recognize the effective speaking content of the speaker in the speaking state in a targeted manner according to the speaker identifier so as to acquire the speaking text corresponding to the speaker identifier, and the speaking content is associated with the speaker so as to facilitate the participants to review the conference content subsequently. And finally, extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference, so that the problems that the current teleconference summary needs to be manually arranged, the efficiency is low, and omission is easy are effectively solved.

In one embodiment, as shown in FIG. 3, the teleconference includes a teleconference. The media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out; the conference summary generation method specifically comprises the following steps:

s301: and receiving a video stream corresponding to the teleconference in real time.

S302: and adopting a pre-trained speech detection model to detect speech of the video stream and determining the identifier of the speaker.

S303: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.

When the remote video conference type starts, a camera on a terminal corresponding to each participant is started, video streams are recorded in real time, and the video streams are transmitted to a server. Specifically, if the conference form initiated by the current conference is a remote video conference, the server receives a video stream corresponding to each participant in the remote video conference, which is recorded by a terminal corresponding to the participant, in real time, so that a pre-trained speech detection model is adopted to perform speech detection on the video stream corresponding to each participant, a speaker identifier corresponding to the video stream is determined, that is, which participant the speaker is, when the speaker speaks, the speech recognition module is called to recognize speech content of the speaker, that is, the speech recognition module is called to recognize audio data fed back by the terminal corresponding to the received speaker identifier, and a speech text corresponding to the speaker identifier is acquired.

S304: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.

Specifically, step S304 is consistent with step S204, and is not described herein again to avoid repetition.

In one embodiment, as shown in FIG. 4, the video stream includes a plurality of frames of video frame images carrying time tags; in step S302, that is, a pre-trained speech detection model is used to perform speech detection on the video stream, and a speaker identifier is determined, which specifically includes the following steps:

s401: and inputting a plurality of frames of video frame images carrying time labels into the speech detection model according to a time sequence for identification, and acquiring an identification result of each frame of video frame image.

Specifically, a plurality of frames of video frame images carrying time labels are input into the utterance detection model according to a time sequence for identification, and an identification result of each frame of video frame image is obtained, that is, the video frame images with time sequence are input into the utterance detection model according to the time sequence for identification. The recognition result is used for reflecting the mouth state of the participant at the moment, namely whether the mouth is in an open state or not. The utterance detection model can be obtained by performing deep learning training on a large number of pre-labeled images (represented by the label "0") of the speaker with the mouth in an open state and images (represented by the label "1") with the mouth closed. The time stamp can be used to reflect the representation of each frame of video frame image in the time axis, for example, the representation of the time stamp corresponding to the first frame of video frame image in the time axis is the first second, and the representation of the time stamp corresponding to the second frame of image in the time axis is the second … ….

S402: and judging each recognition result according to a preset judgment strategy to obtain a speech detection result corresponding to the video stream.

S403: based on the utterance detection result, a corresponding speaker identification is determined.

The speaking detection result is used for reflecting the lip movement state, and whether the participant is in the speaking state currently can be reflected through the lip movement state. For example, assuming that the recognition sequence formed by the recognition results of the video frame images corresponding to each frame in the video stream is "001101", if there is a recognition result in the recognition sequence that the mouth state of two or more consecutive frames is in an open state, the participant at that time is considered to be in the speaking state, and the participant identifier corresponding to the video stream may be taken as the speaker identifier currently speaking.

In the embodiment, whether the current participant is in the speaking state or not is judged by the mouth state corresponding to the identification result of the video frame images of the continuous multiple frames, so that the accuracy of the determination of the speaker is ensured, and the problem of misjudgment caused by the fact that the judgment is carried out only by adopting the identification result of the single frame video frame image is avoided.

In one embodiment, as shown in FIG. 5, the teleconference includes a teleconference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out; the conference summary generation method specifically comprises the following steps:

s501: and receiving an audio stream corresponding to the teleconference in real time.

S502: and extracting target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with the voiceprint characteristics corresponding to the pre-stored participant identifications, and determining the speaker identifications.

The original voiceprint features and the target voiceprint features include, but are not limited to, mel-frequency spectrum features, mel-filter features, and the like.

When the remote voice conference begins to be carried out, a microphone on a terminal corresponding to each participant is started, an audio stream is recorded in real time, and the audio stream is transmitted to a server. Specifically, if the remote voice conference starts, the server receives an audio stream recorded by the terminal corresponding to each participant in real time, and then extracts voiceprint features of the audio stream through a voiceprint feature extraction algorithm to obtain target voiceprint features, performs feature comparison on the target voiceprint features and original voiceprint features corresponding to prestored participant identifiers, and determines a speaker identifier so that the speaking content corresponds to the speaking object, which speaking object the text belongs to can be reflected in the conference summary, thereby facilitating subsequent review of the conference content by the participants. The voiceprint extraction algorithm can be used for extraction by adopting a Fourier transform algorithm or a fast Fourier transform algorithm, and is not limited at this time.

Furthermore, by determining the speaker identifier, the server can directly identify the audio stream fed back by the terminal corresponding to the speaker identifier, and noise interference generated by other terminals can be avoided.

S503: and calling a voice recognition module to recognize the speaking content of the speaker according to the speaker identification, and acquiring a speaking text corresponding to the speaker identification.

Specifically, step S503 is consistent with step S203, and is not described herein again to avoid repetition.

S504: and extracting keywords from the speech text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference to obtain a target summary text corresponding to the teleconference.

Specifically, step S504 is consistent with step S204, and is not described herein again to avoid repetition.

In the embodiment, conference contents are recorded by calling the voice recognition module through adopting corresponding conference recording strategies for conferences of different conference forms, and the content extraction is carried out on the collected text data by adopting a keyword extraction scheme, so that a conference summary is automatically generated, and the generalization is improved.

In an embodiment, as shown in fig. 6, in step S204, that is, extracting keywords from the utterance text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference, to obtain a target summary text corresponding to the teleconference specifically includes the following steps:

s601: and extracting keywords of the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field.

The first text refers to a text formed by character strings matched with the preset extraction field. Specifically, the preset extraction field is an extraction field which is preset according to a conference type, and in order to ensure that important contents in a conference document are not lost, a plurality of preset extraction fields can be preset to extract keywords, and at this time, a situation that a speech text matched with the preset extraction field does not belong to the important contents of the conference may occur, so that in order to avoid that the contents of the conference document are long, a content screening mechanism is added in the embodiment, namely, a first text is screened according to a preset screening dimension, unimportant conference records are removed, the screened text is used as the contents of the conference document to be mapped and filled according to a conference document template corresponding to the conference type, a target document corresponding to a remote video conference is obtained, and the space of the conference records is effectively reduced.

S602: and screening the first text according to a preset screening dimension to obtain a target summary text corresponding to the teleconference.

The preset filtering dimension includes, but is not limited to, the number of occurrences of the keyword (i.e., the preset extraction field) matched with the first text in the recorded speech text of the conference, the discussion duration of the topic reflected by the keyword (the discussion duration can be reflected by counting the number of continuous occurrences of the keyword in one sentence/paragraph), and the like.

Further, when the conference type is a conference of a working progress statistics type, key fields of each working task, completion, incompletion and the like can be preset as preset extraction fields of the conference of the type for extracting text contents, and in addition, when the voice recognition module recognizes the incompletion field, the semantic analysis module can be called to analyze the semantics of audio data at the next moment, so that the incompletion reason is determined, and the semantic analysis module can be used for automatically recording the content of the conference summary. The semantic analysis module can adopt Natural Language Processing (NLP) to realize semantic analysis.

In an embodiment, as shown in fig. 7, before step S201, the method for generating a conference summary further includes the following steps:

s701: and acquiring a remote conference initiating request, wherein the remote conference initiating request comprises a conference type, a conference form and participant identifications.

The conference type can be selected by the user according to actual needs, the conference type includes but is not limited to a project conference, an example conference, and the project conference includes a project starting conference, a project condition review conference, a project technical review conference, a project problem solving conference, and the like, which are not listed here. Conference modalities include, but are not limited to, remote video conferencing, remote voice conferencing, and the like.

Specifically, the conference initiator selects the people, conference types and conference forms to be participated according to actual needs, so that the server can respond to the remote conference initiation request conveniently.

S702: and determining the conference priority corresponding to the participant identification according to the conference type.

Wherein the meeting priority is used for reflecting the importance of the meeting personnel to the meeting. The importance of different participants to the conference is different, that is, the participants have different priorities when participating in the conference of different conference types.

Specifically, the server may set, in advance, an important role corresponding to each conference type according to different conference types, for example, a project condition review conference, where the review conference is mainly focused on reporting a progress condition of a project, and the important role of the conference includes a person in charge of each module of the project, and the like, so that the priority of the participant in charge of each module may be set to be higher, and other people of different project modules may be selectively joined, and the participant in charge of each module of the project may be mainly reported, and the priority of the part of participants (i.e., other people selectively joined) may be set to be lower.

It can be understood that, when a conference is initiated, the conference priority of the role corresponding to each participant is determined according to the conference type, so as to subsequently judge whether the conference can be held normally according to the conference priority.

S703: and receiving a conference connection result corresponding to the participant identification.

As an example, in this embodiment, when a teleconference initiation request is triggered, a teleconference may be initiated by selecting a plurality of participants, pulling the participants into a discussion group (i.e., a conference group), and then selecting a conference form and a conference type, so as to establish a communication connection with a terminal of each participant; or, the conference initiator initiates a shared two-dimensional code in the workgroup, and each participant enters through the code scanning, but in order to ensure that the relevant information of the conference is not leaked, the identity of the participant needs to be verified when entering the teleconference through the shared two-dimensional code, and if the participant selected by the initiator can enter.

S704: and if the conference priority corresponding to the participant identification is high and the conference connection result is failure, calling the communication module to establish communication with the terminal corresponding to the participant identification.

Specifically, if the terminal corresponding to the participant refuses to connect or does not answer for a long time, the conference connection result of the conference connection failure is fed back. It can be understood that if the terminal corresponding to the participant identifier with the high participant priority feeds back the conference connection failure, the communication connection is initiated again to the terminal of the participant, and if the conference connection failure is still fed back, the communication module (including but not limited to mail or telephone) is called to call the terminal corresponding to the participant identifier to remind the participant to participate in the teleconference, otherwise, the conference cannot be effectively held.

S705: and if the conference priority corresponding to the participant identification is high and the conference connection result is successful, responding to the remote conference initiating request, and creating a conference group corresponding to the conference form so as to receive the media data stream corresponding to the remote conference in real time.

Specifically, if the terminal corresponding to the participant identifier with the high participant priority feeds back that the conference connection is successful, and the conference can be started normally, a conference holding thread is started, that is, a conference group corresponding to the conference form is created, so as to hold the teleconference.

In the embodiment, before the teleconference is held, the priorities of the participants are judged to determine whether the teleconference can be successfully held, so that the problem that the teleconference cannot be effectively held due to the fact that important participants cannot participate in the conference in time due to other external factors is avoided.

In an embodiment, after the target conference summary is generated, a export verification function is further provided to ensure the security of conference records and make important data not easy to leak. As shown in fig. 8, after step S204, the conference summary generation method further includes the following steps:

s801: acquiring a conference summary export request, wherein the conference summary export request comprises a conference summary identifier, an exporter identifier and an export path.

The conference summary derivation request may be triggered manually by a user or automatically after the teleconference is ended, which is not limited herein.

S802: and performing identity verification according to the biological characteristics corresponding to the derived person identification to determine whether to respond to the conference summary derivation request.

The biological features include, but are not limited to, voiceprint features or human face features. Specifically, the voiceprint feature extraction algorithm may be used to perform feature extraction on the audio data input by the exporter to obtain the voiceprint feature, or the collected face image may be input into a face feature extraction model created in advance to extract the face feature, which is not limited herein. It should be understood that, here, the exporter is authenticated, and other authentication methods can be adopted as long as authentication can be realized.

S803: and if the identity authentication is passed, responding to the conference summary derivation request, and deriving a target conference summary corresponding to the conference summary identification according to the derivation path.

The export path includes, but is not limited to, a notepad, a mailbox, or a discussion group of the attendees, and is not limited herein. Specifically, the exportable role corresponding to the conference summary identifier can be set so as to verify the identity of the exporter when the conference summary is exported.

Illustratively, the identity of the exporter can be verified by collecting the biological characteristics of the exporter and comparing the biological characteristics with the biological characteristics which are stored in a database in advance and correspond to the exportable hue, if the identity verification is passed, the exportation authority of the conference summary is considered to exist, a conference summary exporting request is responded, and a target conference summary corresponding to the conference summary identification is exported according to an exporting path.

In this embodiment, derive the function through a key, can derive the meeting summary according to the derivation route of difference, make things convenient for the participant to look over in respective terminal to, still increased and derived the verification function, verify the person's of deriving identity, with the security of effectively guaranteeing the meeting summary.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a conference summary generation apparatus is provided, where the conference summary generation apparatus corresponds to the conference summary generation method in the above embodiment one to one. As shown in fig. 9, the conference summary generation apparatus includes a data receiving module 10, a speaker identification determination module 20, a speech text recognition module 30, and a summary text acquisition module 40. The functional modules are explained in detail as follows:

and the data receiving module 10 is configured to receive a media data stream corresponding to the teleconference in real time.

And a speaker identifier determining module 20, configured to perform speech detection on the media data stream, and determine a speaker identifier.

And the speech text recognition module 30 is configured to invoke the voice recognition module to recognize speech content of the speaker according to the speaker identifier, and acquire a speech text corresponding to the speaker identifier.

And the summary text acquisition module 40 is configured to extract keywords from the utterance text according to a plurality of preset extraction fields corresponding to the conference type of the teleconference, so as to obtain a target summary text corresponding to the teleconference.

In particular, teleconferencing includes teleconferencing; the media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out; the speaker identification determination module includes a first speaker identification determination unit.

And the first speaker identification determining unit is used for detecting the speech of the video stream by adopting a pre-trained speech detection model and determining the speaker identification.

Specifically, the first speaker identification determining unit includes a recognition result acquiring subunit, an utterance detection result acquiring subunit, and a speaker identification determining subunit.

And the identification result acquisition subunit is used for inputting a plurality of frames of video frame images carrying the time labels into the speech detection model according to the time sequence for identification, and acquiring the identification result of each frame of video frame image.

And the speech detection result acquisition subunit is used for judging each recognition result according to a preset judgment strategy and acquiring a speech detection result corresponding to the video stream.

And the speaker identification determining subunit is used for determining the corresponding speaker identification based on the speech detection result.

Specifically, the teleconference includes a remote voice conference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out; the speaker identification determination module includes a second speaker identification determination unit.

And the second speaker identification determining unit is used for extracting the target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with the original voiceprint characteristics corresponding to the prestored participant identification, and determining the speaker identification.

Specifically, the conference type corresponds to a plurality of preset extraction fields; the summary text acquisition module includes a first text acquisition unit and a target summary text acquisition unit.

And the first text acquisition unit is used for extracting keywords of the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field.

And the target summary text acquisition unit is used for screening the first text according to the preset screening dimension to obtain a target summary text corresponding to the teleconference.

Specifically, the conference summary generation device further comprises a teleconference initiating module, a conference priority determining module, a conference connection feedback module, a first processing module and a second processing module.

And the teleconference initiating module is used for acquiring a teleconference initiating request, and the teleconference initiating request comprises a conference type, a conference form and participant identifications.

And the conference priority determining module is used for determining the conference priority corresponding to the participant identification according to the conference type.

And the conference connection feedback module is used for receiving the conference connection result corresponding to the participant identification.

And the first processing module is used for calling the communication module to establish communication with the terminal corresponding to the participant identification if the participant priority corresponding to the participant identification is high and the conference connection result is failure.

And the second processing module is used for responding to the remote conference initiating request and creating a conference group corresponding to the conference form to receive the media data stream corresponding to the remote conference in real time if the conference priority corresponding to the participant identification is high and the conference connection result is successful.

Specifically, the conference summary generation device further comprises an export request acquisition module, an identity verification module and a conference summary export module.

And the derivation request acquisition module is used for acquiring a conference summary derivation request, and the conference summary derivation request comprises a conference summary identifier, a derivation person identifier and a derivation path.

And the identity authentication module is used for performing identity authentication according to the biological characteristics corresponding to the derived person identification so as to determine whether to respond to the conference summary derivation request.

And the conference summary export module is used for responding to the conference summary export request if the identity authentication is passed, and exporting the target conference summary corresponding to the conference summary identification according to the export path.

For specific definition of the conference summary generation apparatus, reference may be made to the above definition of the conference summary generation method, which is not described herein again. The modules in the conference summary generation apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used to store data generated or obtained during the execution of the conference summary generation method, such as a target conference summary. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a conference summary generation method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the conference summary generation method in the above-mentioned embodiments, such as steps S201-S204 shown in fig. 2 or steps shown in fig. 3 to 8. Alternatively, the processor implements the functions of each module/unit in the embodiment of the conference summary generation apparatus when executing the computer program, for example, the functions of each module/unit shown in fig. 9, and are not described herein again to avoid repetition.

In an embodiment, a computer storage medium is provided, where a computer program is stored on the computer storage medium, and when executed by a processor, the computer program implements the steps of the conference summary generation method in the foregoing embodiments, such as steps S201 to S204 shown in fig. 2 or steps shown in fig. 3 to fig. 8, which are not described herein again to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the embodiment of the conference summary generation apparatus, for example, the functions of each module/unit shown in fig. 9, and are not described herein again to avoid repetition.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A conference summary generation method, comprising:

receiving a media data stream corresponding to the teleconference in real time;

2. The conference summary generation method of claim 1, wherein the teleconference includes a teleconference; the media data stream comprises a video stream transmitted by a terminal corresponding to each participant in real time when the remote video conference starts to be carried out;

the performing speech detection on the media data stream and determining a speaker identifier includes:

and adopting a pre-trained speech detection model to perform speech detection on the video stream, and determining a speaker identifier.

3. The method of generating a conference summary according to claim 2, wherein the video stream includes a plurality of frames of video frame images carrying time tags;

the detecting the speech of the video stream by adopting the pre-trained speech detection model to determine the identifier of the speaker comprises the following steps:

inputting the multiple frames of video frame images carrying the time labels into the speech detection model according to a time sequence for identification, and acquiring an identification result of each frame of video frame image;

judging each recognition result according to a preset judgment strategy to obtain a speech detection result corresponding to the video stream;

and determining a corresponding speaker identifier based on the speech detection result.

4. The conference summary generation method of claim 1, wherein the teleconference pack is a teleconference; the media data stream comprises an audio stream transmitted by a terminal corresponding to each participant in real time when the remote voice conference starts to be carried out;

and extracting target voiceprint characteristics corresponding to the audio stream, comparing the target voiceprint characteristics with original voiceprint characteristics corresponding to prestored participant identifications, and determining the speaker identification.

5. The method for generating a conference summary according to claim 1, wherein the text extraction of the utterance text according to a plurality of preset extraction fields corresponding to a conference type of the teleconference to obtain a target summary text corresponding to the teleconference comprises:

extracting keywords from the speech text according to the preset extraction field to obtain a first text matched with the preset extraction field;

and screening the first text according to a preset screening dimension to obtain a target summary text corresponding to the teleconference.

6. The method of generating a conference summary according to claim 1, wherein prior to the step of receiving a media data stream corresponding to a teleconference in real time, the method of generating a conference summary further comprises:

acquiring a remote conference initiating request, wherein the remote conference initiating request comprises a conference type, a conference form and participant identifications;

determining the conference priority corresponding to the participant identification according to the conference type;

receiving a conference connection result corresponding to the participant identification;

if the meeting priority corresponding to the meeting personnel identification is high and the meeting connection result is failure, calling a communication module to establish communication with the terminal corresponding to the meeting personnel identification;

and if the conference joining priority corresponding to the participant identification is high and the conference connection result is successful, responding to the remote conference initiating request, and creating a conference group corresponding to the conference form so as to receive the media data stream corresponding to the remote conference in real time.

7. The method for generating a conference summary according to claim 1, wherein the method for generating a conference summary further comprises, after extracting keywords from the utterance text according to a plurality of preset extraction fields corresponding to a conference type of the teleconference to obtain a target summary text corresponding to the teleconference:

acquiring a conference summary export request, wherein the conference summary export request comprises a conference summary identifier, an exporter identifier and an export path;

performing identity verification according to the biological characteristics corresponding to the derived person identification to determine whether to respond to the conference summary derivation request;

and if the identity authentication is passed, responding to the conference summary derivation request, and deriving a target conference summary corresponding to the conference summary identification according to the derivation path.

8. A conference summary generation apparatus, comprising:

9. Computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor, when executing said computer program, carries out the steps of the conference summary generation method according to any one of claims 1 to 7.

10. A computer storage medium storing a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the conference summary generation method according to any one of claims 1 to 7.