CN114125368B

CN114125368B - Conference audio participant association method and device and electronic equipment

Info

Publication number: CN114125368B
Application number: CN202111448173.6A
Authority: CN
Inventors: 王斌
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2024-01-30
Anticipated expiration: 2041-11-30
Also published as: CN114125368A

Abstract

The disclosure relates to a conference audio participant association method, a conference audio participant association device and electronic equipment, and relates to the technical field of audio and video conferences, comprising the following steps: extracting a plurality of audio clips from conference audio; acquiring target sound identification information in a target audio fragment, wherein the target audio fragment is any one of a plurality of audio fragments; comparing the target voice identification information with the voice identification information stored in the voice identification library; and if the target voice identification information is matched with the first voice identification information stored in the voice identification library, the target audio clip is associated to the first participant, and the first voice identification information is the voice identification information of the first participant.

Description

Conference audio participant association method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of audio and video conferences, and in particular relates to a conference audio participant association method, a conference audio participant association device and electronic equipment.

Background

The internet has become a major carrier for voice and video services, with remote multi-person conferencing, including voice and video services, being an important branch in internet applications, with multi-person conferencing assuming increasingly important office functions as a common mode of working communication. However, for a long time, since the content in the conference audio is not associated with the participant, if the communication content of a participant in the conference audio needs to be queried after the conference is finished, the recorded conference audio may need to be repeatedly listened to manually to learn the communication content of the participant, which wastes time, and thus, associating the conference audio to the respective participant is a problem to be solved.

Disclosure of Invention

In order to solve the technical problems or at least partially solve the technical problems, the disclosure provides a method, a device and an electronic device for associating a participant in conference audio. The method comprises the steps of acquiring target sound identification information of a target audio fragment in conference audio, and associating the target audio fragment with a first participant to which the stored first sound identification information belongs under the condition that the target sound identification information is matched with the first sound identification information stored in a sound identification library, so that the audio fragment in the conference audio can be associated with a specific participant, and the method is simpler and more convenient when inquiring communication contents of the specific participant.

In order to achieve the above object, the technical solution provided by the embodiments of the present disclosure is as follows:

in a first aspect, an embodiment of the present disclosure provides a method for associating a participant with conference audio, including:

extracting a plurality of audio clips from conference audio;

acquiring target sound identification information in a target audio fragment, wherein the target audio fragment is any one of the plurality of audio fragments;

comparing the target voice identification information with the voice identification information stored in the voice identification library;

And if the target voice identification information is matched with the first voice identification information stored in the voice identification library, the target audio clip is associated to a first participant, and the first voice identification information is the voice identification information of the first participant.

As an optional implementation manner of the embodiment of the present disclosure, the comparing the target voice identification information with the voice identification information stored in the voice identification library includes:

determining sound identification information of a plurality of participants in a current conference from the sound identification library, wherein the current conference is a conference corresponding to the conference audio;

comparing the target voice identification information with the voice identification information of the plurality of participants;

the first voice identification information is specifically voice identification information of the first participant.

As an optional implementation manner of the embodiment of the present disclosure, before comparing the target voice identification information with the voice identification information stored in the voice identification library, the method further includes:

determining a sound channel playing scene of the target audio fragment;

the comparing the target voice identification information with the voice identification information stored in the voice identification library comprises the following steps:

Determining sound identification information stored by a plurality of participants in the current conference in the sound channel playing scene from the sound identification library;

and comparing the target sound identification information with sound identification information stored by the plurality of participants in the sound channel playing scene.

As an optional implementation manner of the embodiment of the present disclosure, the channel playing scene includes: earphone play scenes or speaker play scenes.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining the target sound identification information in the target audio segment includes:

inputting the target audio fragment into a first voice identification extraction model, and acquiring the target voice identification information output by the first voice identification extraction model;

acquiring second sound identification information from the sound identification library;

judging whether the second sound identification information is extracted by the first sound identification extraction model;

and if the second sound identification information is acquired through the first sound identification extraction model, the second sound identification information is used as the first sound identification information, and the target sound identification information is compared with the first sound identification information.

As an alternative implementation of the embodiments of the present disclosure, the method further includes: if the second sound identification information is not obtained through the first sound identification extraction model, obtaining an original audio fragment which is stored corresponding to the second sound identification information;

inputting the original audio fragment into the first voice identification extraction model, acquiring the first voice identification information output by the first voice identification extraction model, and comparing the target voice identification information with the first voice identification information.

As an optional implementation manner of the embodiment of the present disclosure, the extracting a plurality of audio segments from conference audio includes:

dividing the conference audio into a plurality of audio subfragments;

extracting the audio characteristic information of each audio sub-segment in the plurality of audio sub-segments, and performing audio clustering according to the audio characteristic information of each audio sub-segment to obtain the plurality of audio segments, wherein the audio segments are obtained by clustering the audio sub-segments.

Acquiring registered audio of the first participant;

when the audio quality of the registered audio meets the preset quality condition, judging whether sound identification information associated with the identification of the first participant is stored in a sound identification library;

extracting the first voice identification information from the registered audio if the voice identification information associated with the identification of the first participant does not exist in the voice identification library;

and storing the first voice identification information, the registered audio and the identification of the first participant in a correlation way.

As an alternative implementation of the embodiments of the present disclosure, the method further includes:

if the voice identification information related to the identification of the first participant exists in the voice identification library and the storage time of the voice identification information related to the identification of the first participant is earlier than the preset time, extracting the first voice identification information from the registered audio;

and updating sound identification information associated with the identification of the first participant by adopting the first sound identification information, and storing the first sound identification information, the registered audio and the identification of the first participant in an associated manner.

acquiring a sound channel playing scene of the registered audio;

the association stores the first voice identification information, the registered audio and the identification of the first participant, including:

the audio channel playing scene, the first voice identification information, the registered audio and the identification of the first participant are stored in a correlated mode;

wherein the identification of the first participant and the first sound identification information are stored in different databases.

In a second aspect, there is provided a participant-associated device for conference audio comprising:

the extraction module is used for extracting a plurality of audio clips from conference audio;

the acquisition module is used for acquiring target sound identification information in target audio clips, wherein the target audio clips are any one of the plurality of audio clips;

the association module is used for comparing the target voice identification information with the voice identification information stored in the voice identification library;

As an optional implementation manner of the embodiment of the present disclosure, the association module is specifically configured to:

As an optional implementation manner of the embodiment of the disclosure, the obtaining module is further configured to:

determining a sound channel playing scene of the target audio fragment;

the association module is specifically configured to:

As an optional implementation manner of the embodiment of the present disclosure, the obtaining module is specifically configured to:

The association module is specifically configured to:

As an optional implementation manner of the embodiment of the disclosure, the association module is further configured to:

if the second sound identification information is not obtained through the first sound identification extraction model, obtaining an original audio fragment which is stored corresponding to the second sound identification information;

As an optional implementation manner of the embodiment of the disclosure, the extraction module is specifically configured to:

Dividing the conference audio into a plurality of audio subfragments;

As an alternative implementation of the disclosed embodiment, the apparatus further includes: a registration module for:

acquiring registered audio of the first participant;

As an optional implementation manner of the embodiment of the disclosure, the registration module is further configured to:

acquiring the sound channel playing scene of the registered audio;

the association module is specifically configured to:

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: the method comprises the steps of acquiring target sound identification information of a target audio fragment in conference audio, and associating the target audio fragment with a first participant to which the stored first sound identification information belongs under the condition that the target sound identification information is matched with the first sound identification information stored in a sound identification library, so that the audio fragment in the conference audio can be associated with a specific participant, and the method is simpler and more convenient when inquiring communication contents of the specific participant. The voice identification information stored in the voice identification library is extracted from the voice of the user and stored after the user is authorized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a method for associating a participant with conference audio according to an embodiment of the disclosure;

fig. 2 is a schematic diagram of a voice identifier registration process of a method for associating a participant with conference audio according to an embodiment of the disclosure;

fig. 3 is a schematic view of a scenario for implementing a method according to an embodiment of the present disclosure;

fig. 4 is a second flow chart of a method for associating a participant with conference audio according to an embodiment of the disclosure;

fig. 5 is a block diagram of a conference audio participant association device according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

The embodiment of the disclosure provides a participant association method, a participant association device and electronic equipment for conference audio, which can acquire target sound identification information of a target audio fragment in the conference audio, and associate the target audio fragment with a first participant of the stored first sound identification information under the condition that the target sound identification information is matched with the first sound identification information stored in a sound identification library, so that the audio fragment in the conference audio can be associated to a specific participant, and the communication content of the specific participant can be queried more simply and conveniently.

The conference audio participant association method can be applied to a conference audio participant association device or electronic equipment, and the conference audio participant association device can be a functional module or a functional entity in the electronic equipment, which can realize the conference audio participant association method.

The electronic device may be a server, a tablet computer, a mobile phone, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), a personal computer (personal computer, PC), etc., which is not particularly limited in this disclosure.

As shown in fig. 1, a flowchart of a method for associating a participant with conference audio according to an embodiment of the disclosure may include the following steps 101 to 103.

101. A plurality of audio clips are extracted from the conference audio.

The conference audio may be recorded during the conference.

In some embodiments, extracting a plurality of audio clips from conference audio includes: dividing conference audio into a plurality of audio subfragments; extracting the audio characteristic information of each audio sub-segment in the plurality of audio sub-segments, and performing audio clustering according to the audio characteristic information of each audio sub-segment to obtain a plurality of audio segments, wherein the audio segments are obtained by clustering the audio sub-segments.

In some embodiments, the division of conference audio into several audio sub-segments may be: and continuously and uniformly dividing the conference audio from the starting time of the conference audio according to the time sequence, wherein the time length of each audio sub-segment obtained by dividing is a first preset time length (for example, 1.5 seconds) until the time length of the last audio sub-segment is smaller than or equal to the first preset time length.

In some embodiments, the division of conference audio into several audio sub-segments may be: and intercepting the audio from the starting point of the conference audio according to a preset sliding window to obtain audio subfragments. Here, the preset sliding window may include a window length and a sliding step length, and the window length of the preset sliding window may be greater than the sliding step length. For example, the window length may be 1.5 seconds and the sliding step may be 0.75 seconds. The window length is the time length of each intercepted audio sub-segment, and the sliding step length is the time difference between the starting time of two adjacent intercepting operations. Because the window length of the preset sliding window is larger than the sliding step length, overlapping parts exist between two adjacent audio subfragments in the intercepted audio fragment, and thus, the audio data in the conference audio are all reflected in the audio subfragments, and the information in the conference audio can be ensured not to be lost.

The audio feature information may be short-time spectrum features such as mel frequency cepstrum coefficient (Mel Frequency Cepstral Coefficient, MFCC), perceptual linear prediction (Perceptual Linear Prediction, PLP), filter Banks (FBank), etc., or features extracted by a time delay neural network (Time Delay Neural Networks, TDNN) such as identity-vector (i-vector).

In some embodiments, various clustering algorithms may be employed in the process of performing audio clustering according to the feature information of each audio sub-segment to obtain a plurality of audio segments.

Wherein, the clustering algorithm may comprise: the distance-based clustering algorithm or density-based clustering algorithm, etc., to which the present disclosure is not particularly limited. For example, spectral Clustering clustering algorithms, K-Means clustering, mean shift clustering, maximum Expectation (EM) clustering with Gaussian mixture model (GMM, gaussian Mixed Model), agglomeration hierarchical clustering, graph group detection (Graph Community Detection), and the like may be employed.

102. Target sound identification information in the target audio clip is acquired.

Wherein the target audio clip is any one of a plurality of audio clips. That is, each audio clip in the conference audio may be associated to a corresponding participant by performing the methods 102 to 104 for any of the plurality of audio clips.

In some embodiments, obtaining target sound identification information in a target audio clip includes: and inputting the target audio fragment into the first voice identification extraction model, and acquiring target voice identification information output by the first voice identification extraction model.

103. And comparing the target voice identification information with the voice identification information stored in the voice identification library.

Wherein a large amount of sound identification information can be stored in the sound identification library. The comparing the target voice identification information with the voice identification information stored in the voice identification library may be comparing the target voice identification information with the voice identification information stored in the voice identification library one by one.

By way of example, the sound identification information in the embodiments of the present disclosure may be any information capable of identifying the sound characteristics of the user.

In the embodiment of the disclosure, personnel identification information corresponding to the voice identification information may also be stored in a voice identification library or other databases.

For example, the voice identification library stores 4 pieces of voice identification information as follows: sound identification a, sound identification B, sound identification C and sound identification C, and the other database stores person identifications as follows: the voice identification system comprises a person A, a person B, a person C and a person D, wherein the voice identification A is voice identification information of the person A, the voice identification B is voice identification information of the person B, the voice identification C is voice identification information of the person C, and the voice identification D is voice identification information of the person D, and is a schematic diagram of the association storage of voice identification information and person identification as shown in fig. 4. In particular, the voice identification information may be associated with the person identification according to a storage order in the database, or the voice identification information may be associated with the person identification according to some association information (e.g., some association identifications).

TABLE 1

For the embodiment in which the target voice identification information is output by the first voice identification extraction model, the comparing the target voice identification information with the voice identification information stored in the voice identification library includes:

in some embodiments, second sound identification information is obtained from a sound identification library; judging whether the second sound identification information is extracted through the first sound identification extraction model; and if the second sound identification information is acquired through the first sound identification extraction model, the second sound identification information is taken as the first sound identification information, and the target sound identification information is compared with the first sound identification information.

Further, if the second sound identification information is not obtained through the first sound identification extraction model, obtaining an original audio fragment which is stored corresponding to the second sound identification information; the original audio clip is input into a first voice identification extraction model, first voice identification information output by the first voice identification extraction model is obtained, and target voice identification information is compared with the first voice identification information.

In the above embodiment, when the second sound identification information is determined to be acquired through the first sound identification extraction model, it may be confirmed that the pre-stored sound identification information and the sound identification information extracted from the current conference audio are acquired by the same model, so that the two may be compared, and when the second sound identification information is determined not to be acquired through the first sound identification extraction model, the model is different when the two are extracted, and a large error may exist in comparing the two, so as to cause a situation of misidentification, and at this moment, an original audio fragment (i.e., an original audio fragment) for extracting the second sound identification information when the sound identification information is stored may be acquired, so that the first sound identification information may be re-extracted based on the same model and compared with the target audio information extracted from the current conference audio, so that a large error may not exist when the two are compared, and misidentification is avoided.

In some embodiments, comparing the target sound identification information with sound identification information stored in the sound identification library includes: determining sound identification information of a plurality of participants in a current conference from a sound identification library, wherein the current conference is a conference corresponding to conference audio; the target sound identification information is compared with sound identification information of a plurality of participants.

The first voice identification information is specifically voice identification information of a first participant in voice identification information of a plurality of participants in the current conference.

In general, a large number of voice identification information of people exist in the voice identification library, before comparing the target voice identification information with the voice identification information stored in the voice identification library, the voice identification information in the voice identification library can be screened first, the voice identification information of a plurality of participants of the current conference stored in the voice identification library is determined, and compared with the target voice identification information in such a range, so that the compared data range is reduced, and the comparison efficiency can be improved.

Before determining the sound identification information of the plurality of participants in the current conference from the sound identification library in some embodiments, in embodiments of the present disclosure, a channel playing scene of the target audio clip may also be determined, where the channel playing scene includes: earphone playing scenes or loudspeaker playing scenes; and then before comparing the target sound identification information with the sound identification information stored in the sound identification library, screening the sound identification information in the sound identification library according to the sound channel playing scenes to determine the sound identification information stored in the sound identification library and consistent with the sound channel playing scenes of the target audio clips, so that the compared data range can be narrowed, and the comparison result is more accurate because the comparison is performed based on the extracted sound identification information in the consistent sound channel playing scenes, and the situation of misidentification caused by errors in the comparison result due to the difference of the sound channel playing scenes is avoided.

The sound channel playing scene corresponding to the sound identification information refers to the sound channel playing scene of the original audio of the extracted sound identification information.

Before determining sound identification information of a plurality of participants in a current conference from a sound identification library in some embodiments, in embodiments of the present disclosure, a channel playing scene of the target audio clip may also be determined; then determining sound identification information stored by a plurality of participants in the current conference in the sound channel playing scene from the sound identification library; further, the target sound identification information is compared with sound identification information stored by the plurality of participants in the sound channel playing scene.

In the above embodiment, the corresponding relationship between the sound identification information in the sound identification library and the channel playing scene corresponding to the sound identification information may be stored in advance, so that before comparing the sound identification information, it may be confirmed whether the corresponding channel playing scene is consistent with the channel playing scene of the target audio clip, and if so, the comparison is performed, and if not, the comparison is performed.

In some embodiments, sound identification information of a plurality of participants in the current conference is determined from a sound identification library, wherein the plurality of participants are stored in association with a channel play scene.

104. And if the target voice identification information is matched with the first voice identification information stored in the voice identification library, associating the target audio clip with the first participant.

The first voice identification information is of a first participant.

For example, when the target voice identification information matches the voice identification a in the above table 1, since the voice identification a is stored in association with the a person identification, it is possible to obtain that the voice identification a is the voice identification information of the a person, and then the target audio clip may be associated with the a person, that is, the a person is the first participant.

The embodiment of the disclosure provides the method, which can acquire the target sound identification information of the target audio fragment in the conference audio, and can associate the target audio fragment with the first participant to which the stored first sound identification information belongs under the condition that the target sound identification information is determined to be matched with the first sound identification information stored in the sound identification library, so that the audio fragment in the conference audio can be associated to a specific participant, and the method is simpler and more convenient when inquiring the communication content of the specific participant.

In the embodiment of the disclosure, before comparing the target voice identification information with the voice identification information stored in the voice identification library, a voice identification registration procedure may be further included.

As shown in fig. 2, a schematic voice identifier registration flow chart of a method for associating a participant with conference audio according to an embodiment of the disclosure may include the following steps 201 to 206.

201. Registration audio of the first participant is obtained.

202. And judging whether the audio quality of the registered audio meets the preset quality condition.

Continuing to execute the following 203 to 206 when it is determined that the audio quality of the registered audio satisfies the preset quality condition; upon determining that the audio quality of the registered audio does not satisfy the preset quality condition, the above 201 is performed back.

In some embodiments, determining whether the audio quality meets the preset quality condition may include: and identifying the voice in the registered audio, detecting the duration of the voice in the registered audio, judging whether the duration of the voice is greater than or equal to the preset duration, if so, indicating that the duration of the voice is long enough, and successfully extracting voice identification information from the registered audio, wherein the audio quality of the registered audio is considered to meet the preset quality condition.

In some embodiments, the energy ratio of human voice to noise in the registered audio may be detected, and when the energy ratio is greater than a preset value, the voice identification information may be considered to be successfully extracted from the registered audio, where the audio quality of the registered audio is considered to satisfy a preset quality condition.

In some embodiments, the ratio of the energy of the voice to the noise in the registered audio and the duration of the voice may be considered simultaneously to determine whether the audio quality meets the preset quality condition.

Other factors of registering audio quality may also be considered in the embodiments of the present disclosure, and the embodiments of the present disclosure are not limited.

203. It is determined whether sound identification information associated with the identification of the first participant is stored in the sound identification library.

If no voice identification information associated with the identification of the first participant exists in the voice identification library, directly executing 206 after 204; if the voice identification information associated with the identification of the first participant exists in the voice identification library, and the storage time of the voice identification information associated with the identification of the first participant is earlier than the preset time, executing the following 204, 205 and 206; if the voice identification information associated with the identification of the first participant exists in the voice identification library and the storage time of the voice identification information associated with the identification of the first participant is not earlier than the preset time, the voice identification information associated with the identification of the first participant is not reserved without executing 204, 205 and 206.

204. First sound identification information is extracted from the registered audio.

205. The voice identification information associated with the identification of the first participant is updated with the first voice identification information.

206. The association stores the first sound identification information, the registered audio, and the identification of the first participant.

For the case that the voice identification information associated with the identification of the first participant does not exist in the voice identification library, the first voice identification information is extracted from the registration audio, and the first voice identification information, the registration audio and the identification of the first participant are associated and stored, so that the voice identification registration process can be completed, and the registration of the voice identification information of the first participant is completed.

For the case where there is voice identification information associated with the identification of the first participant in the voice identification library, since the voice identification registration flow has been completed before, the registration of the voice identification information for the first participant has been completed, the first voice identification information in the registered audio may not be extracted any more.

In some embodiments, for the case that the voice identification information associated with the identification of the first participant exists in the voice identification library, whether the storage time of the voice identification information associated with the identification of the first participant is earlier than the preset time is further considered, and when the storage time is earlier than the preset time, the registration of the voice identification information of the first participant, which is completed before the description, may not be applicable due to the fact that the time is long, at this time, the first voice identification information in the registration audio may be extracted, the voice identification information associated with the identification of the first participant is updated, the first voice identification information, the registration audio and the identification of the first participant are stored in an associated manner, so that the voice identification registration process may be updated, and the registration of the voice identification information of the first participant may be perfected. Accordingly, in the case where the storage time is not earlier than the preset time, the previously completed registration of the sound identification information for the first participant is considered to be recently registered, and is time-efficient, so that updating may not be performed.

In some embodiments, in the voice identifier registration process, a channel playing scene of the registered audio may also be obtained; the associating and storing the first voice identification information, the registered audio, and the identification of the first participant in 206 may include: the method comprises the steps of associating and storing a sound track playing scene, first sound identification information, registered audio and identification of a first participant. That is, the association storage for the sound channel playing scenes is further increased, so that when the comparison of the sound identification information is carried out later, the comparison can be carried out based on the same sound channel playing scenes, and the accuracy of the comparison can be improved.

In some embodiments, the identification of the first participant and the first sound identification information may be stored in separate databases. That is, the user identity information corresponding to the voice identifier and the specific voice identifier information can be stored in separate databases, so that the user identity information and the voice identifier information can be isolated from each other, and the safety of the user data can be ensured.

The conference audio participant association method provided in the embodiment of the disclosure can be realized through interaction between devices, for example, through interaction between terminal equipment and a server. As shown in fig. 3, a schematic view of a scenario for implementing the method of the embodiment of the disclosure is shown, where the scenario involves a server 30 and 3 terminal devices, namely a terminal device 31, a terminal device 32 and a terminal device 33, and 3 users respectively perform online conferences for a user a, a user B and a user C through the 3 terminal devices, where the terminal device 31 used by the user a records conference audio and sends the conference audio to the server 30, and the server 30 performs association of participants with respect to the conference audio and sends the association result back to the terminal device 31.

As shown in fig. 4, a flowchart of a method for associating a participant with conference audio may include the following steps:

401. and the terminal equipment sends the recorded conference audio to the server.

The terminal device here may be the terminal device 31 in fig. 3 described above.

402. The server extracts a plurality of audio clips from the conference audio.

403. The server obtains target sound identification information in the target audio piece.

Wherein the target audio clip is any one of a plurality of audio clips.

404. The server compares the target voice identification information with the voice identification information stored in the voice identification library.

405. If the target voice identification information matches the first voice identification information stored in the voice identification library, the server associates the target audio clip to the first participant.

The descriptions of 402 to 405 may refer to the descriptions of 101 to 104 in the above embodiments, and are not repeated here.

406. And the server sends the association result to the terminal equipment.

The server may perform the steps 403 to 405 above for each audio segment in the conference audio, so as to associate each audio segment in the conference audio to a corresponding participant, to obtain an association result corresponding to the conference audio, and send the association result to the server.

407. And the terminal equipment displays the association result.

The association result may be displayed in the terminal device, for example, for the scenario shown in fig. 3, all audio segments corresponding to the participating user a may be displayed, all audio segments corresponding to the participating user B may be displayed, and all audio segments corresponding to the participating user C may be displayed.

All the audio clips corresponding to each user are displayed, so that when meeting audio is checked later, the inquiry and arrangement of communication content of a certain user can be facilitated, and the subsequent arrangement and editing can be facilitated.

Furthermore, voice recognition can be performed on the audio clips corresponding to each user to generate corresponding text contents, and finally the question contents corresponding to the audio clips of each participant can be obtained, so that the follow-up personnel can conveniently edit by using the text information. For example, sort meeting notes, etc.

In the above embodiment, the server is used to perform the participant association of the conference audio, and the terminal device is only used to display the association result, so that the implementation scheme is beneficial to the distribution of server data. For example, in the scenario shown in fig. 3, the server may issue the association result to the terminal number device 31, the terminal device 32, and the terminal device 33 at the same time.

Furthermore, the server is used for carrying out the participant association of the conference audio, and the terminal equipment is only used for displaying the scheme of the association result, so that the calculation amount of the terminal equipment can be saved.

As shown in fig. 5, an embodiment of the present invention provides a participant association device for conference audio, the device comprising:

an extracting module 501, configured to extract a plurality of audio clips from conference audio;

an obtaining module 502, configured to obtain target sound identification information in a target audio segment, where the target audio segment is any one of a plurality of audio segments;

a correlation module 503, configured to compare the target voice identification information with the voice identification information stored in the voice identification library;

and if the target voice identification information is matched with the first voice identification information stored in the voice identification library, the target audio clip is associated to the first participant, and the first voice identification information is the voice identification information of the first participant.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining module 502 is specifically configured to:

determining sound identification information of a plurality of participants in a current conference from a sound identification library, wherein the current conference is a conference corresponding to conference audio;

Comparing the target voice identification information with voice identification information of a plurality of participants;

the first voice identification information is specifically voice identification information of a first participant in the voice identification information of the plurality of participants.

As an optional implementation manner of the embodiment of the present disclosure, the obtaining module 502 is further configured to:

determining a sound channel playing scene of a target audio fragment;

the association module 503 is specifically configured to:

inputting the target audio fragment into a first voice identification extraction model, and acquiring target voice identification information output by the first voice identification extraction model;

the association module 503 is specifically configured to:

Acquiring second sound identification information from a sound identification library;

judging whether the second sound identification information is extracted through the first sound identification extraction model;

and if the second sound identification information is acquired through the first sound identification extraction model, the second sound identification information is taken as the first sound identification information, and the target sound identification information is compared with the first sound identification information.

As an alternative implementation of the embodiment of the present disclosure, the association module 503 is further configured to:

the original audio clip is input into a first voice identification extraction model, first voice identification information output by the first voice identification extraction model is obtained, and target voice identification information is compared with the first voice identification information.

As an optional implementation manner of the embodiment of the present disclosure, the extracting module 503 is specifically configured to:

dividing conference audio into a plurality of audio subfragments;

extracting the audio characteristic information of each audio sub-segment in the plurality of audio sub-segments, and performing audio clustering according to the audio characteristic information of each audio sub-segment to obtain a plurality of audio segments, wherein the audio segments are obtained by clustering the audio sub-segments.

As an alternative implementation of the embodiment of the disclosure, the apparatus further includes: a registration module 504 for:

acquiring registered audio of a first participant;

when the audio quality of the registered audio meets the preset quality condition, judging whether second sound identification information associated with the identification of the first participant is stored in a sound identification library;

if the second voice identification information associated with the identification of the first participant does not exist in the voice identification library, extracting the first voice identification information from the registered audio;

the association stores the first sound identification information, the registered audio, and the identification of the first participant.

Registration module 504 is further configured to:

and updating the second sound identification information by adopting the first sound identification information, and storing the first sound identification information, the registered audio and the identification of the first participant in an associated manner.

As an alternative implementation of the disclosed embodiments, the registration module 504 is further configured to:

Acquiring a sound channel playing scene of registered audio;

the registration module 504 is specifically configured to:

associating and storing the sound channel playing scene, the first sound identification information, the registered audio and the identification of the first participant;

As shown in fig. 6, an embodiment of the present invention further provides a terminal device, where the terminal device includes: processor 601, memory 602, and a computer program stored on memory 602 and executable on processor 601, which when executed by processor 601, implements the participant-associated method of conference audio in the method embodiments described above.

The embodiment of the invention provides a computer readable storage medium, which is characterized in that the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program realizes each process of the conference audio participant association method in the embodiment of the method, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

The computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.

The embodiment of the invention provides a computer program product, which stores a computer program, and when the computer program is executed by a processor, the computer program realizes each process of the participant-related method of conference audio in the above method embodiment, and can achieve the same technical effect, so that repetition is avoided, and no redundant description is provided herein.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein.

In this disclosure, the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In the present disclosure, memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash RAM, in a computer readable medium. Memory is an example of a computer-readable medium.

In the present disclosure, computer readable media include both permanent and non-permanent, removable and non-removable storage media. Storage media may embody any method or technology for storage of information, which may be computer readable instructions, data structures, program modules, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The above is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of participant association for conference audio, comprising:

extracting a plurality of audio clips from conference audio;

inputting a target audio fragment into a first voice identification extraction model, and acquiring target voice identification information output by the first voice identification extraction model, wherein the target audio fragment is any one of the plurality of audio fragments;

if the second sound identification information is acquired through the first sound identification extraction model, the second sound identification information is used as first sound identification information, and the target sound identification information is compared with the first sound identification information; and if the target voice identification information is matched with the voice identification information stored in the voice identification library, the target audio fragment is associated to a first participant, and the first voice identification information is the voice identification information of the first participant.

2. The method of claim 1, wherein the comparing the target voice identification information with voice identification information stored in a voice identification library comprises:

3. The method of claim 2, wherein prior to comparing the target voice identification information with voice identification information stored in a voice identification library, further comprising:

determining a sound channel playing scene of the target audio fragment;

4. The method of claim 3, wherein the channel playback scene comprises: earphone play scenes or speaker play scenes.

5. The method according to claim 1, wherein the method further comprises:

6. The method of claim 1, wherein extracting a plurality of audio clips from conference audio comprises:

dividing the conference audio into a plurality of audio subfragments;

7. The method of claim 1, wherein prior to comparing the target voice identification information with voice identification information stored in a voice identification library, the method further comprises:

Acquiring registered audio of the first participant;

8. The method of claim 7, wherein the method further comprises:

and updating the voice identification information associated with the identification of the first participant by adopting the first voice identification information, and storing the first voice identification information, the registered audio and the identification of the first participant in an associated manner.

9. The method according to claim 7 or 8, characterized in that the method further comprises:

acquiring a sound channel playing scene of the registered audio;

10. A participant-associated device for conference audio, comprising:

the acquisition module is used for inputting a target audio fragment into the first voice identification extraction model and acquiring target voice identification information output by the first voice identification extraction model, wherein the target audio fragment is any one of the plurality of audio fragments;

the association module is used for acquiring second sound identification information from the sound identification library; judging whether the second sound identification information is extracted by the first sound identification extraction model; if the second sound identification information is acquired through the first sound identification extraction model, the second sound identification information is used as first sound identification information, and the target sound identification information is compared with the first sound identification information;

11. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the participant-associated method of conference audio according to any one of claims 1 to 9.

12. A computer-readable storage medium, comprising: the computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a participant-associated method of conference audio according to any of claims 1 to 9.