CN114792522A

CN114792522A - Audio signal processing method, conference recording and presenting method, apparatus, system and medium

Info

Publication number: CN114792522A
Application number: CN202110105959.1A
Authority: CN
Inventors: 郑斯奇; 索宏彬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-07-26
Also published as: WO2022161264A1

Abstract

The embodiment of the application provides an audio signal processing method, a conference recording system, a conference recording and presenting method, equipment, a conference recording system and a medium. In the embodiment of the application, aiming at the audio signals of a multi-speaker speaking scene, the audio signals are firstly cut into a plurality of audio segments based on a speaker change point, then the plurality of audio segments are hierarchically clustered according to the duration and the voiceprint characteristics of the plurality of audio segments, the audio segments corresponding to the same speaker are identified, and a user mark is added. The voice print characteristics are not simply utilized for clustering, hierarchical clustering is carried out by combining the duration of the voice print characteristics and the voice print characteristics, the audio frequency segments with more stable voice print characteristics can be clustered by the hierarchical clustering, compared with the mode of clustering all the audio frequency segments at the same time, errors caused by the audio frequency segments with unstable voice print characteristics can be reduced by the hierarchical clustering, the audio frequency segments corresponding to the same speaker can be more accurately identified, the identification efficiency is improved, and the user marking result is more accurate.

Description

Audio signal processing method, conference recording and presenting method, apparatus, system and medium

Technical Field

The present application relates to the field of audio signal processing technologies, and in particular, to an audio signal processing method, a conference recording method, a conference presenting method, an audio signal processing apparatus, a conference recording system, and a conference presenting system.

Background

In a conference, a court trial site, and other multi-person speaking scenes, in order to meet the requirement of recording the conference content, products with a voice collecting function, such as a sound pickup, a recording pen, and the like, are generally adopted to collect voice signals in the multi-person speaking scenes in real time. Based on the voice signals collected by the products, the speech content in the multi-person speech scene can be directly inquired based on the voice signals, or the voice signals can be transcribed into characters to be inquired.

In order to facilitate understanding of speaker information corresponding to the speech content during query, after the speech signal is collected, it is necessary to identify the speaker, that is, identify which speech content is said by which speaker. In the prior art, a neural network model is adopted to extract voiceprint features in a voice signal, and speaking contents corresponding to the same speaker are distinguished according to the voiceprint features.

However, in practical applications, there may be strong noise interference in a multi-person speaking scene, and voiceprint features of speakers may also be influenced by emotions to change, which may cause erroneous judgment of recognition results based on the voiceprint features, and the recognition accuracy is low.

Disclosure of Invention

Aspects of the present application provide an audio signal processing method, a conference recording and presenting method, an apparatus, a system and a medium, so as to identify audio segments corresponding to a same speaker more accurately and improve the identification efficiency.

The embodiment of the application provides an audio signal processing method, which comprises the following steps: identifying a speaker change point in an audio signal collected in a multi-person speaking scene; segmenting an audio signal into a plurality of audio segments according to a speaker change point, and extracting voiceprint characteristics of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

The embodiment of the present application further provides an audio signal processing method, including: carrying out sound source positioning on audio signals collected in a multi-person speaking scene to obtain a change point of a sound source position; segmenting an audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments; clustering the plurality of audio segments according to the voiceprint characteristics and the sound source position of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

An embodiment of the present application further provides a conference recording method, including: collecting audio signals in a multi-person conference scene, and identifying speaker change points in the audio signals; segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting the voiceprint features of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; adding the same user mark for the audio clip corresponding to the same speaker, and generating conference recording information according to the audio signal added with the user mark, wherein the conference recording information comprises a conference identifier.

The embodiment of the present application further provides a method for presenting a conference record, including: receiving a conference look-up request, wherein the conference look-up request comprises a conference identifier to be presented; acquiring meeting record information to be presented according to the meeting identification; presenting the conference recording information, wherein the conference recording information is generated according to an audio signal added with a user mark in a multi-person conference scene; the voice frequency segments corresponding to the same speaker are added with the same user mark according to the plurality of voice frequency segments cut from the speaker change point in the voice frequency signal, and the voice frequency segments corresponding to the same speaker are obtained by carrying out hierarchical clustering on the plurality of voice frequency segments according to the time length and the sound pattern characteristics of the plurality of voice frequency segments. An embodiment of the present application further provides an audio processing system, including: sound pickup equipment and server-side equipment; the pickup equipment is deployed in a multi-person speaking scene and is used for collecting audio signals in the multi-person speaking scene, identifying a speaker change point in the audio signals, segmenting the audio signals into a plurality of audio segments according to the speaker change point and extracting voiceprint features corresponding to the plurality of audio segments; the server-side equipment is used for performing hierarchical clustering on the plurality of audio segments according to the time lengths and the voiceprint characteristics of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

An embodiment of the present application further provides an audio processing system, including: sound pickup equipment and server-side equipment; the pickup equipment is deployed in a multi-person speaking scene and is used for collecting audio signals in the multi-person speaking scene, identifying a speaker change point in the audio signals and segmenting the audio signals into a plurality of audio segments according to the speaker change point; the server equipment is used for extracting voiceprint characteristics corresponding to the plurality of audio clips and performing hierarchical clustering on the plurality of audio clips according to the duration and the voiceprint characteristics of the plurality of audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

The embodiment of the present application further provides a sound pickup apparatus, including: a processor and a memory; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: identifying a speaker change point in an audio signal collected in a multi-person speaking scene; segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting the voiceprint characteristics of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

An embodiment of the present application further provides a sound pickup apparatus, including: a processor and a memory; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: carrying out sound source positioning on audio signals collected in a multi-person speaking scene to obtain a change point of a sound source position; segmenting an audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments; clustering the plurality of audio clips according to the voiceprint characteristics and the sound source position of the plurality of audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

An embodiment of the present application further provides a server device, including: a processor and a memory; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: receiving a plurality of audio clips sent by pickup equipment and corresponding voiceprint characteristics of the audio clips; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

An embodiment of the present application further provides a server device, including: a processor and a memory; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: receiving a plurality of audio clips sent by sound pickup equipment; extracting voiceprint characteristics corresponding to the plurality of audio segments, and performing hierarchical clustering on the plurality of audio segments according to the duration and the voiceprint characteristics of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

Embodiments of the present application further provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the methods provided by the embodiments of the present application.

Embodiments of the present application further provide a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the processor is caused to implement the steps in the methods provided by the embodiments of the present application.

In the embodiment of the application, aiming at the audio signals of a multi-speaker speaking scene, the audio signals are firstly cut into a plurality of audio segments based on a speaker change point, then the plurality of audio segments are hierarchically clustered according to the duration and the voiceprint characteristics of the plurality of audio segments, the audio segments corresponding to the same speaker are identified, and a user mark is added. The hierarchical clustering can be performed on the audio segments with more stable voiceprint characteristics, and compared with the simultaneous clustering of all the audio segments, the hierarchical clustering can reduce errors caused by the audio segments with unstable voiceprint characteristics, can more accurately identify the audio segments corresponding to the same speaker, improves the identification efficiency, and ensures that the user marking result is more accurate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a schematic flowchart of an audio signal processing method according to an exemplary embodiment of the present application;

FIG. 1b is a schematic flow chart of another audio signal processing method provided by an exemplary embodiment of the present application;

FIG. 1c is a schematic flow chart of another audio signal processing method provided in an exemplary embodiment of the present application;

FIG. 2a is a schematic diagram of clustering audio segments in each layer;

FIG. 2b is a schematic diagram of clustering audio segments in each layer;

FIG. 2c is a schematic diagram of clustering audio segments in a first layer;

fig. 3a is a schematic view of a sound pickup apparatus in a multi-person conference scenario;

FIG. 3b is a schematic diagram illustrating a usage status of a sound pickup device in a business partner negotiation scenario;

fig. 3c is a schematic view of a use state of the sound pickup device in a teaching scene;

fig. 3d is a schematic flowchart of a conference recording method according to an exemplary embodiment of the present application;

fig. 3e is a schematic flowchart of a method for presenting a conference record according to an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram of an audio processing system according to an exemplary embodiment of the present application;

FIG. 4b is a schematic diagram of another audio processing system according to an exemplary embodiment of the present application;

fig. 5 is a schematic structural diagram of a sound pickup apparatus according to an exemplary embodiment of the present application;

fig. 6 is a schematic structural diagram of a server device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

In practical application, strong noise interference may exist in a multi-person speaking scene, and voiceprint features of speakers may also be influenced by emotions to change, which may cause erroneous judgment of recognition results based on the voiceprint features and a low recognition accuracy. In some embodiments of the present application, for an audio signal of a multi-speaker speaking scene, the audio signal is cut into a plurality of audio segments based on a speaker change point, and then the plurality of audio segments are hierarchically clustered according to duration and voiceprint characteristics of the plurality of audio segments, so as to identify the audio segments corresponding to the same speaker and add a user mark. The hierarchical clustering can be performed on the audio segments with more stable voiceprint characteristics, and compared with the simultaneous clustering of all the audio segments, the hierarchical clustering can reduce errors caused by the audio segments with unstable voiceprint characteristics, can more accurately identify the audio segments corresponding to the same speaker, improves the identification efficiency, and ensures that the user marking result is more accurate.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1a is a schematic flowchart of an audio signal processing method according to an exemplary embodiment of the present application. As shown in fig. 1a, the method comprises:

101a, identifying a speaker change point in an audio signal collected in a multi-person speaking scene;

102a, segmenting an audio signal into a plurality of audio segments according to a speaker change point, and extracting voiceprint characteristics of the plurality of audio segments;

103a, according to the duration and the voiceprint characteristics of the multiple audio segments, performing hierarchical clustering on the multiple audio segments to obtain audio segments corresponding to the same speaker;

104a, adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

In this embodiment, the speaker changing point refers to a position point for distinguishing different speakers in the audio signal, that is, an occurrence position of a speaker changing event, and the number of the speaker changing points may be 1, or may be multiple, for example, 2, 3, or 5. In the present embodiment, the manner of identifying the speaker change point is not limited, and the following description will be made by way of example.

For example, a speaker change point in an audio signal may be identified through Voice endpoint Detection (VAD) techniques. Endpoints in VAD refer to silence and active speech signal change critical points. For the audio signals collected in a multi-person speaking scene, the VAD technology is adopted, so that the starting point and the tail point corresponding to each voice section can be found out, the voice section and the non-voice section can be distinguished, and silence, noise and the like can be removed. In this embodiment, the speaker change point may be determined in conjunction with the length of the pause between these start and end points. For example, in the case where the pause time interval between the start point and the end point is larger than the set threshold, the positions of the voice endpoints (i.e., the start point and the end point) in this case may be regarded as the speaker change point.

For example, a voiceprint feature extraction may be performed on an audio signal acquired in a multi-person speech scene, and a position point where a voiceprint changes in the audio signal may be used as a speaker change point according to a change in the voiceprint feature in the audio signal. Or, combining the VAD technique with the voiceprint features, further combining the voiceprint features at adjacent start points and adjacent end points with respect to the start point and the end point corresponding to each voice period detected by the VAD technique, and if the voiceprint features at the adjacent start points and the end points change, determining that the position of the voice endpoint (i.e., the start point and the end point) is the speaker change point.

For another example, when an audio signal is collected, a sound source of the audio signal may be localized based on the microphone array to obtain a change point of the sound source position, and a speaker change point in the audio signal may be determined according to the change point of the sound source position. For example, in a speech scene in which the position of each speaker is fixed, the changing point of the sound source position may be set as the speaker changing point.

Of course, in some speaking scenes, a speaker may move around, that is, the speaking position is not fixed, and for this case, sound source positioning may be combined with VAD technology, the sound source positioning technology is used to position a change point of the sound source position, and the VAD technology is used to determine a start point and a tail point corresponding to each speech period in an audio signal; and correcting the change point of the sound source position according to the starting point and the tail point determined by the VAD, thereby obtaining an accurate speaker change point. Specifically, the change point of the sound source position may be aligned with the VAD detection result on the time axis, and it may be determined whether or not a detected voice endpoint, for example, a start point or a tail point, exists within a certain time before and after the change point of the sound source position. By the method, the changing point of the speaker can be determined more accurately, so that the voice recognition result can be cut off more accurately, and the phenomena of losing the prefix, losing the suffix and the like are avoided.

In this embodiment, the audio signal may be divided into a plurality of voice segments according to the speaker change point, for example, in the case where one segment of the audio signal is represented by a1 at the start position and a2 at the end position, and the audio signal is recognized to include one speaker change point B1, the audio signal may be divided into the audio segment a1 — > B1 and the audio segment B1 > a2 according to the speaker change point B1.

In this embodiment, after a plurality of audio segments are segmented according to the speaker change point, the voiceprint features of the plurality of audio segments may be extracted, and the voiceprint features may be represented by feature vectors. Voiceprint characteristics are characteristic representations of audio segments, and voiceprint characteristics of audio segments corresponding to different speakers generally differ. In this embodiment, the embodiment of extracting the voiceprint features of a plurality of speech segments is not limited. For example, a neural network model for extracting voiceprint features may be trained in advance, and the voiceprint features of a plurality of speech segments may be extracted by using the neural network model trained in advance, where the neural network model may be, but is not limited to: a Model based on Mel-scale Frequency Cepstral Coefficients (MFCC for short) or a Gaussian Mixture-general Background Model (GMM-UBM).

In this embodiment, each of the plurality of audio segments divided by the speaker change point corresponds to one speaker, and different audio segments may correspond to the same speaker or different speakers. In an application that needs to add user marks to audio segments, audio segments corresponding to the same speaker need to be identified, so that the same user marks are added to the audio segments corresponding to the same speaker. In order to identify the voice segments corresponding to the same speaker more accurately, in this embodiment, the voice segments may be clustered based on the voiceprint features of the audio segments, and the audio segments with the same voiceprint features may be clustered together as much as possible. In this embodiment, audio clips with the same or similar voiceprint characteristics are regarded as audio clips corresponding to the same user. In addition, due to the speaking habits, speaking manners, special requirements and other factors of speakers, particularly short speaking may exist in a multi-person speaking scene, such as yes, good and the like, and thus some short audio segments may exist in the cut audio segments. The longer the duration of the audio segment is, the more stable the voiceprint feature corresponding to the audio segment is, whereas the shorter the duration of the audio segment is, the less stable the voiceprint feature corresponding to the audio segment is, and the distinctiveness is not obvious. For example, "o" for user a and "o" for user B are not clearly distinguishable in voiceprint characteristics. In view of this, in this embodiment, the duration of the audio segments is further considered, and the multiple audio segments are hierarchically clustered by combining the durations of the multiple audio segments, where hierarchical clustering refers to a process of layering the multiple audio segments and then clustering the layered audio segments layer by layer, so as to fully exert the advantages of longer audio segments and reduce interference that may be caused by shorter audio segments. Therefore, in this embodiment, after obtaining the multiple audio segments and their voiceprint features, the multiple audio segments are hierarchically clustered according to the durations and voiceprint features of the multiple audio segments to obtain audio segments corresponding to the same speaker.

Because the longer the duration of an audio clip is, the more stable the voiceprint feature corresponding to the audio clip is, in some optional embodiments of the present application, the multiple audio clips may be layered according to the durations of the multiple audio clips to obtain multiple layers of audio clips corresponding to different duration ranges; and according to the voiceprint characteristics corresponding to the plurality of audio segments, performing hierarchical clustering on the plurality of layers of audio segments according to the sequence of the time length range from long to short to obtain at least one clustering result, wherein each clustering result comprises the audio segments corresponding to the same speaker. In hierarchical clustering, not only are voiceprint characteristics utilized to cluster a plurality of audio segments, but also the duration of the audio segments are combined, the audio segments with longer duration range are clustered according to the voiceprint characteristics, the audio segments with shorter duration are clustered according to the voiceprint characteristics, in the process of clustering the audio segments with shorter time length, it needs to judge whether the audio segments with shorter time length belong to the result clustered by the audio segments with longer time length, under the condition that the audio fragments do not belong to the audio clustering system, a new clustering result can be established, and the clustering of the audio fragments on all layers is finished by analogy, so that the audio fragments are clustered in layers according to the sequence of the time lengths from long to short, the clustering result of the audio segments with longer time length can be taken as the main point, the instability of the voiceprint characteristics of the audio segments with shorter time length is reduced, and the identification error caused by the voice recognition improves the accuracy of identifying the audio clips corresponding to the same speaker.

In this embodiment, after the audio segments corresponding to the same speaker are obtained, the same user mark may be added to the audio segments corresponding to the same speaker, so as to obtain the audio signal with the user mark added. In this embodiment, the embodiment of adding the user mark is not limited. For example, a speech segment with user tags may be inserted before each audio segment, e.g., a speech segment "user C1 please speak" may be inserted before the audio segment corresponding to user C1. For another example, the same user mark points are added to the audio segment corresponding to the same speaker on the track, for example, a red mark point is added to the audio segment corresponding to speaker C2, a green mark point is added to the audio segment corresponding to speaker C3, a yellow mark point is added to the audio segment corresponding to speaker E, and so on.

In the embodiment of the application, aiming at the audio signals of a multi-speaker speaking scene, the audio signals are firstly cut into a plurality of audio segments based on a speaker change point, then the plurality of audio segments are hierarchically clustered according to the duration and the voiceprint characteristics of the plurality of audio segments, the audio segments corresponding to the same speaker are identified, and a user mark is added. The voice print characteristics are not simply utilized for clustering, hierarchical clustering is carried out by combining the duration of the voice print characteristics and the voice print characteristics, the audio frequency segments with more stable voice print characteristics can be clustered by the hierarchical clustering, compared with the mode of clustering all the audio frequency segments at the same time, errors caused by the audio frequency segments with unstable voice print characteristics can be reduced by the hierarchical clustering, the audio frequency segments corresponding to the same speaker can be more accurately identified, the identification efficiency is improved, and the user marking result is more accurate.

In this embodiment, the embodiment of layering multiple audio clips according to the durations of the multiple audio clips to obtain multiple layers of audio clips corresponding to different duration ranges is not limited. In an optional embodiment, a number threshold of each layer may be set, the plurality of audio segments may be sorted according to time length, and the sorted plurality of audio segments may be layered according to a number threshold of each layer set in advance to obtain a plurality of layers of audio segments. In yet another optional embodiment, a preset time length threshold of each layer may be used, and the multiple audio segments are layered according to the time lengths of the multiple audio segments and the preset time length thresholds of each layer, so as to obtain multiple layers of audio segments corresponding to different time length ranges; the smaller the number of layers is, the larger the corresponding time length threshold is, and the time length of the audio clip in each layer is greater than or equal to the time length threshold of the layer. For example, an audio segment with a duration exceeding 20s may be divided into a first layer, an audio segment with a duration between 10s and 20s may be divided into a second layer, an audio segment with a duration between 5s and 10s may be divided into a third layer, and an audio segment with a duration less than 5s may be divided into a fourth layer.

In this embodiment, after obtaining the multiple layers of audio segments, there is no limitation to the implementation of performing hierarchical clustering on the multiple layers of audio segments to obtain at least one clustering result. As described in detail below.

In an optional embodiment, the audio segments in each layer can be clustered according to the voiceprint features corresponding to the audio segments in each layer to obtain a clustering result of each layer; and sequentially clustering the clustering results of the two adjacent layers according to the sequence of the layer number from small to large and the voiceprint characteristics of the clustering results of each layer to obtain at least one clustering result. The number of the clustering results in each layer may be one, or may be multiple, for example, 2, 3, or 5, and the like, which is not limited herein. As shown in fig. 2a, the audio segments are divided into three layers according to the durations of the audio segments, and each layer of audio segments is clustered according to the voiceprint characteristics of each layer of audio segments to obtain a clustering result of each layer, wherein the first layer has two clustering results D1 and D2, the second layer has three clustering results D3, D4 and D5, and the third layer has two clustering results D6 and D7; clustering the clustering result of the second layer to the clustering result D1 or D2 of the first layer according to the voiceprint characteristics of the clustering result of the second layer, wherein whether the clustering results D3, D4 and D5 can be clustered into the clustering results D1 and D2 can be judged according to the voiceprint characteristics of the clustering results D3, D4 and D5, and if the clustering results D3 and D4 can be clustered into the clustering result D1 to obtain a clustering result E1, and the clustering result D5 can be clustered into the clustering result D2 to obtain a clustering result E2, so that after the clustering result of the second layer is clustered to the clustering result of the first layer, two clustering results E1 and E2 can be obtained; and finally, clustering the clustering results of the third layer to the existing clustering results E1 and E2 according to the voiceprint characteristics of the clustering results of the third layer, wherein whether the clustering results D6 and D7 can be clustered into the clustering results E1 or E2 can be judged according to the voiceprint characteristics of the clustering results D6 and D7, and if the clustering results D6 can be clustered into the clustering results E1 to obtain the clustering results E3, and the clustering results D7 can be clustered into the clustering results E2 to obtain the clustering results E4, and finally obtain two clustering results E3 and E4, namely obtain the audio segments corresponding to two speakers.

In another optional embodiment, the audio segments of the first layer are clustered, and then the audio segments of each layer are clustered into the existing clustering results according to the sequence from small to large in the order of the hierarchy based on the clustering results of the first layer. Specifically, firstly, for the audio segments in the first layer, clustering the audio segments in the first layer according to the voiceprint features corresponding to the audio segments in the first layer to obtain at least one clustering result; then, for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the sequence of the number of layers from small to large and the voiceprint characteristics corresponding to the audio segments in the non-first layer in sequence; and if the residual audio segments which are not clustered into the existing clustering results exist in the non-first layer, clustering the residual audio segments according to the voiceprint characteristics corresponding to the residual audio segments to generate a new clustering result until each audio segment on all the layers is clustered into one clustering result. The following takes the example of the audio segment being segmented into three layers, and the whole hierarchical clustering process is illustrated.

As shown in fig. 2b, firstly, according to the voiceprint characteristics of the audio segments of the first layer, clustering is performed on the audio segments of the first layer to obtain two clustering results F1 and F2, where the clustering result F1 includes an audio segment g1 and an audio segment g2, and the clustering result F2 includes an audio segment g 3; secondly, clustering the audio segments of the second layer to existing clustering results F1 and F2 of the first layer, wherein the second layer comprises three audio segments which are respectively an audio segment g4, an audio segment g5 and an audio segment g6, judging whether the audio segment g4, the audio segment g5 and the audio segment g6 can be clustered in a clustering result F1 or F2 according to the voiceprint characteristics of the audio segment g4, the audio segment g5 and the audio segment g6, and if the audio segment g5 and the audio segment g6 are clustered in the clustering result F2 and the audio segment g4 cannot be clustered in the clustering results F1 and F2 of the first layer, taking the audio segment g4 as a clustering result F3 alone, so that three clustering results F1, F2 and F3 are obtained after the audio segments of the second layer are clustered to the clustering results of the first layer; finally, clustering the clustering results of the third layer to the existing clustering results F1, F2 and F3, wherein the third layer comprises two audio segments which are an audio segment g7 and an audio segment g8 respectively; judging whether the audio segment g7 and the audio segment g8 can be clustered into the existing clustering results F1, F2 or F3 according to the voiceprint characteristics of the audio segment g7 and the audio segment g8, and if the audio segment g7 is clustered into the clustering result F1, clustering the audio segment g8 into the clustering result F2; finally, three clustering results F1, F2 and F3, namely, audio segments corresponding to three speakers, can be obtained.

In this embodiment, the embodiment of clustering a plurality of audio pieces is not limited, and for example, but not limited to: K-Means (K-Means) clustering, mean shift clustering, density-based clustering (DBSCAN), maximum Expectation (EM) clustering with Gaussian Mixture Model (GMM), agglomerative hierarchy clustering, or Graph Community Detection (Graph Community Detection) clustering, etc.

In this embodiment, an implementation manner of clustering the audio segments in the first layer according to the voiceprint features corresponding to the audio segments in the first layer to obtain at least one clustering result is not limited. An embodiment of clustering audio segments in a first layer according to voiceprint features corresponding to the audio segments in the first layer to obtain at least one clustering result, comprising: under the condition that the first layer at least comprises two audio segments, calculating the overall similarity between the at least two audio segments in the first layer according to the corresponding voiceprint features of the at least two audio segments in the first layer, and optionally, taking the voiceprint feature similarity between the at least two audio segments as the overall similarity between the at least two audio segments; dividing at least two audio segments in the first layer into at least one clustering result according to the overall similarity between the at least two audio segments in the first layer; and respectively calculating the clustering centers of the at least one clustering result according to the voiceprint features corresponding to the audio segments contained in the at least one clustering result, wherein the clustering centers comprise central voiceprint features. Specifically, for any audio segment in the first layer, the overall similarity between the audio segment and other audio segments in the first layer can be calculated according to the voiceprint features corresponding to the audio segment and the voiceprint features corresponding to other audio segments in the first layer; if the target audio frequency segment with the whole similarity meeting the set similarity condition exists in other audio frequency segments in the first layer, clustering the audio frequency segment and the target audio frequency segment to obtain a target clustering result, and updating the clustering center of the target clustering result according to the voiceprint characteristics corresponding to the audio frequency segment and the target audio frequency segment. Further, whether the target clustering result and the remaining audio segments in the first layer can be clustered or not can be calculated, and for the remaining audio segments which cannot be clustered to the target clustering result, the remaining audio segments can be clustered according to the voiceprint features corresponding to the remaining audio segments to generate a new clustering result until all the audio segments on the first layer are clustered to one clustering result.

For example, in a case that the first layer includes three audio segments, the three audio segments are respectively an audio segment H1, an audio segment H2, and an audio segment H3, the voiceprint feature similarity of the audio segment H1 and the audio segment H2 may be first calculated, the voiceprint feature similarity is taken as the overall similarity between the two audio segments H1 and H2, if the overall similarity satisfies the set condition, it is considered that the audio segment H1 and the audio segment H2 are from the same speaker, the audio segment H1 and the audio segment H2 may be one clustering result H1, and the clustering center of the clustering result H1, that is, the center voiceprint feature, may be obtained by directly taking the voiceprint feature of the audio segment H1 as the center voiceprint feature, taking the voiceprint feature of the audio segment H2 as the center voiceprint feature, and averaging the voiceprint feature of the audio segment H1 and the voiceprint feature of the audio segment H2 to obtain the center voiceprint feature, this is not limited; after the clustering result H1 is obtained, the similarity between the central voiceprint feature of the clustering result H1 and the voiceprint feature of the audio segment H3 can be calculated, the similarity of the voiceprint features is used as the overall similarity between the clustering result H1 and the audio segment H3, if the overall similarity meets the set condition, the clustering result H1 and the audio segment H3 are considered to be from the same speaker, the clustering result H1 and the audio segment H3 can be clustered to one clustering result H2, and the central voiceprint feature of the clustering result H2 is calculated according to the voiceprint features of the clustering result H1 and the audio segment H3; if the similarity threshold does not satisfy the set condition, the clustering result H1 and the audio segment H3 are not from the same speaker, and the audio segment H3 can be used as a single clustering result H3.

In this embodiment, an implementation manner of clustering, according to the order from small to large, audio segments in a non-first layer to an existing clustering result according to the voiceprint features corresponding to the audio segments in the non-first layer in sequence is also not limited, for example, for each audio segment in any one non-first layer, the overall similarity between the audio segment and the existing clustering result is calculated according to the voiceprint feature corresponding to the audio segment and the clustering center of the existing clustering result; and if the existing clustering results have target clustering results with the overall similarity to the audio segments meeting the set similarity condition, adding the audio segments into the target clustering results, and updating the clustering centers of the target clustering results according to the voiceprint features corresponding to the audio segments.

In this embodiment, an implementation manner of updating the cluster center of the target clustering result according to the voiceprint features corresponding to the audio segments is not limited, and in an optional embodiment, the voiceprint features of the audio segments included in the target clustering result are directly averaged to obtain a new central voiceprint feature as the cluster center after the target clustering result is updated. In another optional embodiment, the number of layers to which each audio clip contained in the target clustering result belongs is determined, different weights are set for different number of layers, and the smaller the number of layers is, the larger the corresponding weight is; and according to the weight corresponding to the number of layers to which each audio clip belongs, carrying out weighted summation on the voiceprint characteristics corresponding to each audio clip to obtain a new central voiceprint characteristic which is used as a clustering center after the target clustering result is updated. For example, the target clustering result includes a first-layer audio segment j1 and an audio segment j2, and a second-layer audio segment j3, when calculating the clustering center, a weight is set to k1 for the first-layer audio segment, a weight is set to k2 for the second-layer audio segment, k1> k2, and k1+ k2 is 1, then the center voiceprint characteristic of the target clustering result is: (voiceprint feature of j 1) × k1+ (voiceprint feature of j 2) × k1+ (voiceprint feature of j 3) × k 2.

In the embodiment of the present application, in a specific multi-person speaking scene, for example, a multi-person conference, a specific speaker can usually speak in a seat of the speaker, and the position of the speaker does not usually change during the conference. Based on this, an embodiment of the present application further provides an audio signal processing method, as shown in fig. 1b, the method includes:

101b, positioning a sound source of the audio signal collected in the multi-person speaking scene to obtain a change point of the sound source position;

102b, segmenting the audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments;

103b, according to the time length, the voiceprint characteristics and the sound source position of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker;

104b, adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signals added with the user marks.

In the embodiment, sound source localization is performed on an audio signal to obtain a change point of a sound source position; the method comprises the steps that audio signals are segmented according to changing points of sound source positions to obtain a plurality of audio segments, each audio segment corresponds to a unique sound source position, and for the condition that the position of a speaker is not changed, it can be considered that each sound source position corresponds to a speaker, namely, each audio segment corresponds to a speaker.

In this embodiment, after the audio segment is segmented into a plurality of audio segments according to the change point of the sound source position, the voiceprint features of the plurality of audio segments can be extracted, and for the implementation of extracting the voiceprint features, reference may be made to the foregoing embodiment, which is not described herein again.

In this embodiment, each of the plurality of audio segments divided by the change point of the sound source position corresponds to one speaker, and different audio segments may correspond to the same speaker or different speakers. In an application that needs to add user marks to audio segments, audio segments corresponding to the same speaker need to be identified, so that the same user marks are added to the audio segments corresponding to the same speaker. In order to identify the voice segments corresponding to the same speaker more accurately, in this embodiment, the voice segments may be clustered based on the voiceprint features of the audio segments, and the audio segments with the same voiceprint features may be clustered together as much as possible. In this embodiment, audio segments with the same or similar voiceprint features are regarded as audio segments corresponding to the same user. Furthermore, the sound source positions corresponding to the audio clips can be combined, and if the voiceprint features of the two audio clips are the same or similar and come from the same sound source position, the probability that the two audio clips correspond to the same user is higher. In addition, it is considered that the longer the duration of the audio segment is, the more stable the corresponding voiceprint features are, and conversely, the shorter the duration of the audio segment is, the lower the stability of the corresponding voiceprint features is, and the distinctiveness is not obvious. Therefore, in this embodiment, the duration of the audio segments is further considered, and the hierarchical clustering is performed on the audio segments by combining the durations of the audio segments, where the hierarchical clustering refers to a process of performing hierarchical clustering on the audio segments and then performing layer-by-layer clustering on the layered audio segments, so as to fully exert the advantages of longer audio segments and reduce the possible interference caused by shorter audio segments. Therefore, in this embodiment, after obtaining the multiple audio segments and their voiceprint features and sound source locations, the multiple audio segments are hierarchically clustered according to the durations, the voiceprint features, and the sound source locations of the multiple audio segments, so as to obtain the audio segments corresponding to the same speaker.

In an optional embodiment of the present application, an implementation of hierarchical clustering on a plurality of audio segments according to time lengths, voiceprint characteristics, and sound source positions of the plurality of audio segments includes: layering the plurality of audio clips according to the time lengths of the plurality of audio clips to obtain a plurality of layers of audio clips corresponding to different time length ranges; and according to the voiceprint characteristics and the sound source positions corresponding to the plurality of audio segments, carrying out hierarchical clustering on the multi-layer audio segments according to the sequence of the time length range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio segments corresponding to the same speaker.

For example, the time length of the audio segments may be determined by a time length of the audio segments, and the time length of the audio segments may be determined by a time length of the audio segments. In this embodiment, an implementation manner is not limited to that, hierarchical clustering is performed on multiple layers of audio segments according to the voiceprint features and the sound source positions corresponding to the multiple audio segments in the order from long to short in the duration range, so as to obtain at least one clustering result. The following examples are given.

In an optional embodiment, the audio segments in each layer can be clustered according to the voiceprint features and the sound source positions corresponding to the audio segments in each layer, so as to obtain a clustering result of each layer; and sequentially clustering the clustering results of two adjacent layers according to the sequence of the number of layers from small to large and according to the voiceprint characteristics and the sound source position of the clustering result of each layer to obtain at least one clustering result.

In another alternative embodiment, the audio segments of the first layer may be clustered first, and then the audio segments of each layer are clustered into the existing clustering results according to the order from small to large based on the clustering results of the first layer. Specifically, firstly, for the audio segments in the first layer, clustering the audio segments in the first layer according to the voiceprint features and the sound source positions corresponding to the audio segments in the first layer to obtain at least one clustering result; then, for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the voiceprint characteristics and the sound source positions corresponding to the audio segments in the non-first layer in sequence from small to large according to the layer number; and if the rest audio segments which are not clustered to the existing clustering results exist in the first layer, clustering the rest audio segments according to the voiceprint characteristics and the sound source positions corresponding to the rest audio segments to generate a new clustering result until each audio segment on all the layers is clustered to at least one clustering result.

In an optional embodiment of the present application, an implementation of clustering audio segments in a first layer according to voiceprint features and sound source locations corresponding to the audio segments in the first layer includes: if the first layer only comprises one audio segment, the audio segment forms a clustering result by itself; and if the first layer at least comprises two audio segments, under the condition that the first layer at least comprises two audio segments, calculating the overall similarity between the at least two audio segments in the first layer according to the corresponding voiceprint features and the sound source positions of the at least two audio segments in the first layer. For example, the voiceprint feature similarity of at least two audio segments may be calculated first, then the sound source position similarity of at least two audio segments may be calculated, and the voiceprint feature similarity and the sound source position similarity may be weighted to obtain the overall similarity between at least two audio segments in the first layer. Further, the at least two audio segments in the first layer may be partitioned into the at least one clustering result according to an overall similarity between the at least two audio segments in the first layer. Further, it is also necessary to calculate a clustering center of the at least one clustering result according to a voiceprint feature and a sound source location corresponding to the audio segments included in the at least one clustering result, where the clustering center includes a central voiceprint feature and a central sound source location, and provides a basis for clustering the at least one clustering result by the audio segments on the non-first layer. For example, for each clustering result, the average value of the voiceprint features corresponding to the audio segments contained in the clustering result may be used as the center voiceprint feature of the clustering result, and the average value of the sound source positions corresponding to the audio segments contained in the clustering result may be used as the center sound source position of the clustering result. For another example, the voiceprint feature of any audio segment included in the clustering result can be directly used as the central voiceprint feature of the clustering result, and the sound source position of any audio segment included in the clustering result can be used as the central voiceprint position of the clustering result.

As shown in fig. 2c, the audio piece of the first layer includes: the audio segment M1-audio segment M6 may calculate an overall similarity between any two audio segments, and cluster two audio segments having an overall similarity higher than a set similarity threshold (e.g., 90%), for example, the overall similarity threshold between the audio segment M1 and the audio segment M3 is 91%, the overall similarity threshold between the audio segment M2 and the audio segment M4 is 93%, and the overall similarity threshold between the audio segment M3 and the audio segment M6 is 95%, so that the audio segment M1 and the audio segment M3 may be clustered to obtain a clustering result M1, the audio segment M2 and the audio segment M4 are clustered to obtain a clustering result M2, the audio segment M3 and the audio segment M6 are clustered to obtain a clustering result M3, the clustering centers of the clustering results M1, the clustering result M2, and the clustering center of the clustering result M3 are calculated, respectively, and based on the centers of two-two clustering results, and calculating the overall similarity of the two clustering results, and if the overall similarity exceeds a set threshold (for example, 90%), continuing to cluster the two clustering results. For example, if the overall similarity between the clustering result M1 and the clustering result M2 is 90%, the overall similarity between the clustering result M1 and the clustering result M3 is 85%, and the overall similarity between the clustering result M2 and the clustering result M3 is 80%, the clustering result M1 and the clustering result M2 are continuously clustered into the clustering result M4, the clustering result M3 is used as a single clustering result, and finally, the audio segment in the first layer obtains two clustering results M3 and M4.

Further optionally, for each audio segment in any one of the non-first layers, a process of clustering the audio segment to an existing clustering result includes: calculating the overall similarity of the audio frequency fragment and the existing clustering result according to the voiceprint characteristic and the sound source position corresponding to the audio frequency fragment and the clustering center of the existing clustering result; if a target clustering result with the overall similarity of the audio segments meeting the set similarity condition exists in the existing clustering results, the audio segments in the audio segments and the target clustering results can be considered to be from the same speaker, the audio segments are added into the target clustering results, and the clustering center of the target clustering results is updated according to the voiceprint features and the sound source positions corresponding to the audio segments.

For the target clustering result, when a new audio segment is added to the target clustering result, the clustering center of the target clustering result can be updated in the following manner, but is not limited to the following manner. For example, the voiceprint characteristics of all the audio segments contained in the target clustering result may be averaged, and the average value is used as the central voiceprint characteristic of the clustering center of the target clustering result; and averaging the sound source positions of the audio segments in the target clustering result, and taking the average value as the central sound source position of the clustering center of the target clustering result. For another example, the number of layers to which each audio clip included in the target clustering result belongs may be determined, different weights are set for different number of layers, and the smaller the number of layers, the larger the corresponding weight; according to the weight corresponding to the number of layers to which each audio fragment belongs, carrying out weighted summation on the voiceprint features corresponding to each audio fragment contained in the target clustering result to obtain a new central voiceprint feature; according to the weight corresponding to the number of layers to which each audio clip belongs, carrying out weighted summation on the sound source positions corresponding to each audio clip contained in the target clustering result to obtain a new central sound source position; and forming a cluster center after the target clustering result is updated by the new central voiceprint characteristics and the new central sound source position.

In the embodiment of the application, aiming at the audio signals of a multi-person speaking scene, the audio signals are firstly cut into a plurality of audio segments based on the sound source position, then the plurality of audio segments are hierarchically clustered according to the duration, the voiceprint characteristics and the sound source position of the plurality of audio segments, the audio segments corresponding to the same speaker are identified, and a user mark is added. The voice print characteristics are not simply utilized for clustering, but the sound source position, the voice print characteristics and hierarchical aggregation are combined, wherein the sound source position can accurately segment the audio signals, the hierarchical aggregation can reduce the influence of short voice on the recognition result, on the basis, the voice print characteristics are utilized to recognize the audio segments corresponding to the same speaker, the recognition efficiency can be greatly improved, and the user marking result is more accurate.

The present embodiment further provides an audio signal processing method, as shown in fig. 1c, the method includes:

101c, positioning a sound source of the audio signal collected in the multi-person speaking scene to obtain a change point of the sound source position;

102c, segmenting the audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments;

103c, clustering the plurality of audio segments according to the voiceprint characteristics and the sound source positions of the plurality of audio segments to obtain audio segments corresponding to the same speaker;

104c, adding the same user mark for the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

Wherein, segmentation audio signal into a plurality of audio frequency fragments according to the change point of sound source position includes: taking the change point of the sound source position as a speaker change point, thereby dividing the audio signal into a plurality of audio segments; or, combining with VAD technology, utilizing VAD technology to detect the starting point and the tail point of the audio signal; and correcting the change point of the sound source position according to the starting point and the tail point to obtain a speaker change point, and further dividing the audio signal into a plurality of audio segments according to the speaker change point.

In an optional embodiment, clustering the plurality of audio segments according to the voiceprint characteristics and the sound source position of the plurality of audio segments to obtain audio segments corresponding to the same speaker includes: layering the plurality of audio clips according to the time lengths of the plurality of audio clips to obtain a plurality of layers of audio clips corresponding to different time length ranges; and according to the voiceprint characteristics and the sound source positions corresponding to the plurality of audio segments, carrying out hierarchical clustering on the multi-layer audio segments according to the sequence of the time length range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio segments corresponding to the same speaker. For the detailed description of the steps in this embodiment, reference may be made to the foregoing embodiments, which are not repeated herein.

The audio signal processing method provided by each embodiment of the application can be applied to various multi-person speaking scenes, such as a multi-person conference scene, a business conversation scene or a teaching scene. In these application scenarios, the sound pickup apparatus of the present embodiment may be deployed in these scenarios to capture audio signals in a multi-person speaking scenario and implement other functions described in the foregoing method embodiments of the present application and the following system embodiments. In order to have a better acquisition effect and facilitate sound source positioning of audio signals, the placement position of the pickup equipment can be reasonably determined according to the specific deployment condition of a multi-person speaking scene. As shown in fig. 3a, in a multi-person conference scene, the sound pickup apparatus is deployed in the center of the conference table, and a plurality of speakers are distributed in different directions of the sound pickup apparatus, so that the voice of each speaker can be picked up conveniently; as shown in fig. 3b, in a business cooperation session scenario, a first business party and a second business party are located opposite to each other, a conference organizer is located between the first business party and the second business party and is responsible for organizing a session between the first business party and the second business party, and a sound pickup device is disposed at a central location of the conference organizer, the first business party and the second business party, and is disposed at different positions of the sound pickup device of the first business party, the second business party and the conference organizer, so as to facilitate sound pickup by the sound pickup device; as shown in fig. 3c, in the teaching scene, the sound pickup apparatus is disposed on the desk, and the teacher and the students are located at different positions of the sound pickup apparatus, so that the voices of the teacher and the students can be picked up conveniently and simultaneously.

Taking the application of the audio signal processing method provided in the above embodiment in a multi-person conference scene as an example, the conference recording can be performed for the multi-person conference scene, and further, the conference recording can be presented or reproduced. As shown in fig. 3d, a conference recording method provided in an exemplary embodiment of the present application includes the following steps:

301d, collecting audio signals in a multi-person conference scene, and identifying speaker change points in the audio signals;

302d, segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting voiceprint features of the plurality of audio segments;

303d, performing hierarchical clustering on the plurality of audio segments according to the time lengths and the vocal print characteristics of the plurality of audio segments to obtain audio segments corresponding to the same speaker;

304d, adding the same user mark for the audio clip corresponding to the same speaker to obtain an audio signal added with the user mark;

and 305d, generating meeting record information according to the audio signal added with the user mark, wherein the meeting record information comprises a meeting identification.

For a detailed description of steps 301d-304d, reference is made to the foregoing embodiments, which are not repeated herein. In this embodiment, the focus is described with respect to step 305 d. Specifically, after the audio signal added with the user mark is obtained, the conference recording information can be generated according to the audio signal added with the user mark, and a corresponding conference identifier is added to the conference recording information, wherein the conference identifier has uniqueness, and can uniquely identify a multi-person conference. In an alternative embodiment, the audio signal to which the user mark is added may be directly used as the conference recording information. In another alternative embodiment, the audio signal with the added user tags may be converted to a text message with speaker information, which may include content in a format similar to, but not limited to: a speaker: xxxx; b, a speaker: yyy, and the like; and then using the text information with the speaker information as conference record information. In any form of the conference recording information, the conference scene can be reproduced based on the conference recording information, and the conference content can be inquired or referred conveniently.

Fig. 3e is a schematic flowchart of a method for presenting a meeting record according to an exemplary embodiment of the present application, and as shown in fig. 3e, the method includes:

301e, receiving a conference consulting request, wherein the conference consulting request comprises a conference identifier to be presented;

302e, obtaining meeting record information to be presented according to the meeting identification;

303e, presenting conference recording information, wherein the conference recording information is generated according to the audio signal added with the user mark in the multi-person conference scene; the voice frequency segments corresponding to the same speaker are added with the same user marks according to the multiple voice frequency segments cut from the speaker change points in the voice frequency signals, and the voice frequency segments corresponding to the same speaker are obtained by carrying out hierarchical clustering on the multiple voice frequency segments according to the duration and the voice print characteristics of the multiple voice frequency segments.

In this embodiment, for a multi-person conference scene, a conference recording may be performed, and the conference recording process is: collecting audio signals in a multi-person conference scene, and identifying a speaker change point in the audio signals; segmenting the audio signal into a plurality of audio segments according to a speaker change point in the audio signal, and extracting voiceprint characteristics of the plurality of audio segments; further, according to the duration and the voiceprint characteristics of the audio clips, performing hierarchical clustering on the multiple audio clips to obtain audio clips corresponding to the same speaker; adding the same user mark to the audio frequency segments corresponding to the same speaker to obtain an audio frequency signal added with the user mark; and generating conference recording information according to the audio signal added with the user mark, and adding a corresponding conference identifier for the conference recording information. For the related process of the conference recording, reference may be made to the foregoing embodiments, and details are not repeated herein.

After the meeting record information is obtained, the relevant meeting content can be consulted through the meeting record information, and then the meeting consultation service is provided for the outside. Based on the method, a conference lookup request sent from the outside can be received, and the request carries a conference identifier to be presented; and based on the conference identifier and the conference identifier in each piece of conference recording information, obtaining the conference recording information to be presented and presenting the conference recording information. Optionally, if the conference recording information is an audio signal added with a user mark, the audio signal added with the user mark may be played through a player, or the audio signal added with the user mark may be converted into text information and then displayed; if the conference recording information is text information with speaker information converted from an audio signal added with a user mark, the text information with speaker information can be displayed through a display, or the text information with speaker information can be played through a player. Thus, the inquiry or reference requirement of the conference content can be met.

In addition, because the conference record information includes the speaker information or the corresponding user mark, when the conference record is referred, the conference content corresponding to a certain speaker can be referred or played back independently instead of the information of a plurality of speakers being mixed together, so that the recognition degree of the conference speaker and the conference content is improved. For example, the conference consulting request may include, in addition to the conference identifier to be presented, speaker information or a user identifier, where the speaker information and the user identifier have a corresponding relationship, so that the conference recording information to be presented may be obtained according to the conference identifier; and acquiring part of conference content corresponding to the speaker information or the user mark in the conference recording information according to the speaker information or the user mark, and presenting the part of conference content corresponding to the speaker information or the user mark.

It should be noted that, the method provided in the embodiment of the present application may be completely completed by a sound pickup device, and may also implement a part of functions on a server device, which is not limited to this. The sound pickup device can be realized as a sound recording pen, a sound recording rod, a recorder or a sound pickup, and the like, and can also be realized as terminal equipment with a sound recording function or audio and video conference equipment, and the like. Based on this, the present embodiment provides an audio processing system, which explains a process in which an audio signal processing method is implemented based on a sound pickup apparatus and a server apparatus together. As shown in fig. 4a, the audio processing system 400 comprises: a sound pickup apparatus 401 and a server side apparatus 402. The audio processing system 400 may be applied to a multi-person speaking scenario, such as the multi-person meeting scenario shown in fig. 3a, the business partner negotiation scenario shown in fig. 3b, and the teaching scenario shown in fig. 3 c. In these scenarios, the pickup apparatus 401 may cooperate with the server apparatus 402 to implement the above-described method embodiments of the present application, and the server apparatus 402 is not shown in the multi-person speaking scenarios shown in fig. 3a to 3 c.

The sound pickup apparatus 401 of this embodiment has functional modules such as a power-on key, an adjustment key, a microphone array, and a speaker, and further optionally, may further include a display screen. The sound pickup device 401 may implement functions such as automatic sound recording, MP3 playback, FM frequency modulation, digital camera functions, telesound recording, timed sound recording, external transcription, repeater, or editing. As shown in fig. 4a, the sound pickup apparatus 401 may collect audio signals in a multi-person speaking scene in the multi-person speaking scene, identify a speaker change point in the audio signals, segment the audio signals into a plurality of audio segments according to the speaker change point, extract voiceprint features corresponding to the plurality of audio segments, and send the plurality of audio segments and the corresponding voiceprint features to the server apparatus 402.

In this embodiment, the server device 402 may receive multiple audio clips and corresponding voiceprint features thereof sent by the sound pickup device 401, and perform hierarchical clustering on the multiple audio clips according to the durations and the voiceprint features of the multiple audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

In this embodiment, the sound pickup apparatus 401 may pick up an audio signal in a multi-person speaking scene by using a microphone array, and based on the intensity of the same sound signal picked up by the microphones at different positions in the microphone array, the sound source position of the sound signal may be calculated. Based on this, in an optional embodiment of the present application, when recognizing a speaker change point in an audio signal, the sound pickup apparatus 401 may perform sound source localization on the audio signal to obtain a change point of a sound source position; and further, a plurality of audio segments can be cut according to the changing points of the speakers, and each audio corresponds to a unique sound source position. Accordingly, when the server device 402 performs hierarchical clustering on the multiple audio segments according to the durations and the voiceprint characteristics of the multiple audio segments to obtain the audio segments corresponding to the same speaker, the server device may perform hierarchical clustering on the multiple audio segments according to the durations, the voiceprint characteristics, and the sound source position of the multiple audio segments to obtain the audio segments corresponding to the same speaker.

In the present embodiment, as shown in fig. 4b, there is also provided an audio processing system, and the embodiment shown in fig. 4b is different from the embodiment shown in fig. 4a in that: in fig. 4a, the process of extracting the voiceprint features corresponding to the multiple audio clips is implemented on the sound pickup device 401, while in fig. 4b, the process of extracting the voiceprint features corresponding to the multiple audio clips is implemented on the server device 402, and other contents shown in fig. 4a and fig. 4b are the same or similar to those shown in fig. 4b, and details can refer to the foregoing embodiment and are not described again here.

In this embodiment, after the server device 402 adds the same user tag to the audio clip corresponding to the same speaker, the audio signal with the user tag added may be stored for subsequent query and use. In an alternative embodiment, as shown in fig. 4a, the audio processing system further includes a transcription device 403, the server device 402 may send the audio signal with the added user mark to the transcription device 403, the transcription device 403 receives the audio signal with the added user mark, converts the audio signal with the added user mark into text information with the added user mark, and returns the text information with the added user mark to the server device 402 or stores the text information with the added user mark in the database 406. Further, as shown in fig. 4a, the audio processing system further includes a query end 404, where the query end 404 may send a first query request to the server device 402, where the first query request includes a user tag to be queried, and the server device 402 receives the first query request, obtains text information corresponding to the user tag to be queried from the text information with the user tag, and returns the text information to the query end 404.

In another alternative embodiment, as shown in fig. 4a, after the server device 402 generates the audio signal with the user mark, the audio signal with the user mark may be output to an upper layer application on the server device 402, for example, the upper layer application may be a teleconference application or a social application, and the like, the upper layer application may acquire user information in a multi-person speaking scene, for example, identification information of a user, such as a name, a nickname, or a voiceprint feature, and the upper layer application may associate the user information with the audio signal with the user mark. For example, the upper layer application stores a corresponding relationship between the user tag and the user information, and based on the corresponding relationship, the user information corresponding to the user tag can be found and associated with the audio signal with the user tag.

Further, as shown in fig. 4a, the query end 404 may send a second query request to the server end device 402, where the second query request includes an audio segment to be queried, and the server end device 402 receives the second query request, extracts a user tag corresponding to the audio segment to be queried from the audio signal to which the user tag is added, and returns the user tag and/or user information corresponding to the user tag to the query end 404.

In yet another alternative embodiment, as shown in fig. 4b, the audio processing system further includes a playback device 405, the playback device 405 may send an audio signal acquisition request to the server device 402, the server device 402 may output an audio signal with a user mark added to the playback device 405 based on the request, and the playback device 405 receives and plays the audio signal with the user mark added.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subject of steps 101a to 103a may be device a; for another example, the execution subject of steps 101a and 102a may be device a, and the execution subject of step 103a may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations occurring in a specific order are included, but it should be clearly understood that these operations may be executed out of order or in parallel as they appear herein, and the sequence numbers of the operations, such as 101a, 102a, etc., are merely used to distinguish various operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit the types of "first" and "second".

Fig. 5 is a schematic structural diagram of a sound pickup apparatus according to an exemplary embodiment of the present application. As shown in fig. 5, the sound pickup apparatus includes: a processor 54 and a memory 55.

A memory 54 for storing computer programs and may be configured to store various other data to support operations on the sound pickup. Examples of such data include instructions for any application or method operating on the sound pickup device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 54 may be implemented by any type or combination of volatile and non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 55 coupled to the memory 54 for executing computer programs in the memory 54 for: identifying a speaker change point in an audio signal collected in a multi-person speaking scene; segmenting an audio signal into a plurality of audio segments according to a speaker change point, and extracting voiceprint characteristics of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

The above process may be completely completed on the sound pickup device, or partial functions may be executed on the server device, for example, extracting voiceprint features of multiple audio clips; according to the duration and the voiceprint characteristics of the multiple audio segments, performing hierarchical clustering on the multiple audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker so as to obtain the part of the audio signal added with the user mark, wherein the part of the audio signal is matched by the server.

In an optional embodiment, when the processor 55 performs hierarchical clustering on the multiple audio segments according to the time lengths and the vocal print characteristics of the multiple audio segments to obtain the audio segments corresponding to the same speaker, the processor is specifically configured to: layering the plurality of audio clips according to the time lengths of the plurality of audio clips to obtain a plurality of layers of audio clips corresponding to different time length ranges; and according to the voiceprint characteristics corresponding to the plurality of audio segments, carrying out hierarchical clustering on the multi-layer audio segments according to the sequence of the time length range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio segments corresponding to the same speaker.

In an optional embodiment, when the processor 55 performs layering on multiple audio clips according to durations of the multiple audio clips to obtain multiple layers of audio clips corresponding to different duration ranges, the processor is specifically configured to: layering the plurality of audio clips according to the time lengths of the plurality of audio clips and preset time length thresholds of all layers to obtain a plurality of layers of audio clips corresponding to different time length ranges; the smaller the number of layers is, the larger the corresponding time length threshold is, and the time length of the audio clip in each layer is greater than or equal to the time length threshold of the layer.

In an optional embodiment, when the processor 55 performs hierarchical clustering on the multiple layers of audio segments according to the voiceprint features corresponding to the multiple audio segments and according to the sequence from long to short in the duration range to obtain at least one clustering result, the processor is specifically configured to: for the audio segments in the first layer, clustering the audio segments in the first layer according to the voiceprint characteristics corresponding to the audio segments in the first layer to obtain at least one clustering result; for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the voiceprint characteristics corresponding to the audio segments in the non-first layer in sequence from small to large according to the layer number; and if the residual audio segments which are not clustered into the existing clustering results exist in the non-first layer, clustering the residual audio segments according to the voiceprint characteristics corresponding to the residual audio segments to generate a new clustering result until each audio segment on all the layers is clustered into one clustering result.

In an alternative embodiment, the processor 55, when identifying a speaker change point in an audio signal captured in a multi-person speaking scene, is specifically configured to: carrying out sound source positioning on the audio signal to obtain a change point of a sound source position; determining a speaker change point in the audio signal according to the change point of the sound source position; wherein each audio clip segmented by the speaker change point corresponds to a unique sound source position.

In an optional embodiment, when the processor 55 performs hierarchical clustering on the multiple layers of audio segments according to the voiceprint features corresponding to the multiple audio segments and according to the order of the time length range from long to short to obtain at least one clustering result, the processor is specifically configured to: and according to the voiceprint characteristics and the sound source positions corresponding to the plurality of audio fragments, carrying out hierarchical clustering on the multi-layer audio fragments according to the sequence from long to short in the time length range to obtain at least one clustering result, wherein each clustering result comprises the audio fragments corresponding to the same speaker.

In an optional embodiment, when the processor 55 performs hierarchical clustering on the multiple layers of audio segments according to the voiceprint features and the sound source locations corresponding to the multiple audio segments and according to the sequence from long to short in the duration range to obtain at least one clustering result, the hierarchical clustering method is specifically configured to: for the audio segments of the first layer, clustering the audio segments in the first layer according to the voiceprint features and the sound source positions corresponding to the audio segments in the first layer to obtain at least one clustering result; for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the voiceprint characteristics and the sound source positions corresponding to the audio segments in the non-first layer in sequence from small to large in layer number; and if the residual audio segments which are not clustered into the existing clustering results exist in the non-first layer, clustering the residual audio segments according to the voiceprint features and the sound source positions corresponding to the residual audio segments to generate new clustering results until each audio segment on all the layers is clustered into one clustering result.

In an optional embodiment, for the audio segment in the first layer, when the processor 55 clusters the audio segments in the first layer according to the voiceprint features and the sound source locations corresponding to the audio segments in the first layer to obtain at least one clustering result, the processor is specifically configured to: under the condition that the first layer at least comprises two audio clips, calculating the overall similarity between the at least two audio clips in the first layer according to the voiceprint features and the sound source positions corresponding to the at least two audio clips in the first layer; dividing at least two audio segments in the first layer into at least one clustering result according to the overall similarity between the at least two audio segments in the first layer; and respectively calculating the clustering centers of the at least one clustering result according to the voiceprint characteristics and the sound source positions corresponding to the audio segments contained in the at least one clustering result, wherein the clustering centers comprise central voiceprint characteristics and central sound source positions.

In an optional embodiment, for the audio segments in the non-first layer, when the processor 55 sequentially clusters the audio segments in the non-first layer to an existing clustering result according to the voiceprint features and the sound source locations corresponding to the audio segments in the non-first layer according to the order from the small number of layers to the large number of layers, the processor is specifically configured to: for each audio clip in any non-first layer, calculating the overall similarity of the audio clip and the existing clustering result according to the voiceprint feature and the sound source position corresponding to the audio clip and the clustering center of the existing clustering result; and if the existing clustering results have target clustering results with the overall similarity of the audio segments meeting the set similarity condition, adding the audio segments into the target clustering results, and updating the clustering centers of the target clustering results according to the voiceprint features and the sound source positions corresponding to the audio segments.

In an optional embodiment, when the processor 55 updates the clustering center of the target clustering result according to the voiceprint feature and the sound source position corresponding to the audio segment, it is specifically configured to: determining the number of layers to which each audio clip contained in the target clustering result belongs, wherein different number of layers correspond to different weights, and the smaller the number of layers, the larger the corresponding weight; and respectively carrying out weighted summation on the voiceprint characteristics and the sound source positions corresponding to the audio segments according to the weights corresponding to the layer numbers of the audio segments, so as to obtain new central voiceprint characteristics and new central sound source positions which are used as the clustering centers after the target clustering results are updated.

In an optional embodiment, when the processor 55 performs hierarchical clustering on the multiple layers of audio segments according to the voiceprint features and the sound source locations corresponding to the multiple audio segments and according to the sequence of the time length range from long to short, so as to obtain at least one clustering result, the processor is specifically configured to: clustering the audio clips in each layer according to the voiceprint characteristics corresponding to the audio clips in each layer to obtain a clustering result of each layer; and sequentially clustering the clustering results of two adjacent layers according to the sequence of the number of layers from small to large and the voiceprint characteristics of the clustering results of each layer to obtain at least one clustering result.

In an alternative embodiment, processor 55 is further configured to: outputting the audio signal added with the user mark to the transcription equipment, so that the transcription equipment converts the audio signal added with the user mark into text information with the user mark; or outputting the audio signal added with the user mark to a playback device so that the playback device can play the audio signal with the user mark; or the audio signal added with the user mark is output to the upper layer application, so that the upper layer application acquires the user information corresponding to the user mark and associates the user information with the audio clip with the user mark.

In an alternative embodiment, processor 55 is further configured to: receiving a first query request, wherein the first query request comprises a user mark to be queried, acquiring text information corresponding to the user mark to be queried from the text information with the user mark, and returning the text information to a query end initiating the first query request; or receiving a second query request, wherein the second query request comprises an audio segment to be queried, extracting a user mark corresponding to the audio segment to be queried from the audio signal added with the user mark, and returning the user mark and/or user information corresponding to the user mark to a query end initiating the second query request.

For detailed description of each operation, reference may be made to the description in the foregoing method embodiments, and details are not repeated here.

Further, as shown in fig. 5, the sound pickup apparatus further includes: communication components 56, display 57, power components 58, audio components 59, and the like. Only some of the components are schematically shown in fig. 5, and the tone pickup apparatus is not meant to include only the components shown in fig. 5.

Accordingly, embodiments of the present application further provide a computer-readable storage medium storing a computer program, where the computer program, when executed by a processor, causes the processor to implement the steps that may be performed by the sound pickup apparatus in the method embodiments shown in fig. 1a and fig. 1 b.

Accordingly, embodiments of the present application further provide a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the processor is enabled to implement the steps that can be executed by the sound pickup apparatus in the method embodiments shown in fig. 1a and fig. 1 b.

The embodiment of the present application further provides a sound pickup apparatus, an implementation structure of the sound pickup apparatus is the same as or similar to the implementation structure of the sound pickup apparatus shown in fig. 5, and the sound pickup apparatus can be implemented with reference to the structure of the sound pickup apparatus shown in fig. 5. The sound pickup apparatus provided in this embodiment is different from the sound pickup apparatus in the embodiment shown in fig. 5 mainly in that: the functions performed by the processor to execute the computer programs stored in the memory are different. For the sound pickup apparatus provided in this embodiment, the processor thereof executes the computer program stored in the memory, and is operable to: carrying out sound source positioning on audio signals collected in a multi-person speaking scene to obtain a change point of a sound source position; segmenting an audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments; clustering the plurality of audio clips according to the voiceprint characteristics and the sound source position of the plurality of audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark. For a detailed description of each operation, reference may be made to the description in the foregoing method embodiments, and details are not repeated here.

The above process may be completely completed on the sound pickup device, or a part of functions may be performed on the server device, for example, extracting voiceprint features of a plurality of audio clips; clustering the plurality of audio segments according to the voiceprint characteristics and the sound source position of the plurality of audio segments to obtain audio segments corresponding to the same speaker; the process of adding the same user tag to the audio segments corresponding to the same speaker to obtain the audio signal added with the user tag can be completed by the cooperation of the server device.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program, when executed by a processor, causes the processor to implement the steps that can be performed by the sound pickup apparatus in the method embodiment shown in fig. 1 c.

Accordingly, an embodiment of the present application further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the processor is enabled to implement the steps that can be executed by the sound pickup apparatus in the method embodiment shown in fig. 1 c.

Fig. 6 is a schematic structural diagram of a server device according to an exemplary embodiment of the present application. As shown in fig. 6, the server device includes: a processor 64 and a memory 65.

The memory 64 is used for storing computer programs and may be configured to store various other data to support operations on the server device. Examples of such data include instructions for any application or method operating on the server device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 64 may be implemented by any type or combination of volatile and non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 65, coupled to the memory 64, for executing computer programs in the memory 64 for: receiving a plurality of audio clips sent by pickup equipment and corresponding voiceprint characteristics of the audio clips; according to the duration and the voiceprint characteristics of the multiple audio segments, performing hierarchical clustering on the multiple audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark. For detailed description of each operation, reference may be made to the description in the foregoing method embodiments, and details are not repeated here.

Further, as shown in fig. 6, the server device further includes: communication components 66, power components 68, and the like. Only some of the components are schematically shown in fig. 6, and it is not meant that the server device includes only the components shown in fig. 6.

An embodiment of the present application further provides a server device, where an implementation structure of the server device is the same as or similar to the implementation structure of the server device shown in fig. 6, and may be implemented with reference to the structure of the server device shown in fig. 6. The difference between the server device provided in this embodiment and the server device in the embodiment shown in fig. 6 mainly lies in: the functions performed by the processor executing the computer programs stored in the memory are different. For the server-side device provided in this embodiment, the processor of the server-side device executes the computer program stored in the memory, and is configured to: receiving a plurality of audio clips sent by a pickup device; extracting voiceprint characteristics corresponding to the plurality of audio segments, and performing hierarchical clustering on the plurality of audio segments according to the duration and the voiceprint characteristics of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark. For a detailed description of each operation, reference may be made to the description in the foregoing method embodiments, and details are not repeated here.

Accordingly, the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps that can be executed by the server device in the audio signal processing method embodiment.

Accordingly, the present application further provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the processor is enabled to implement the steps that can be executed by the server device in the foregoing audio signal processing method embodiment.

In addition to the above devices, an embodiment of the present application further provides a conference recording device, where the conference recording device includes: a memory and a processor; the memory is used for storing a computer program; a processor is coupled to the processor for executing a computer program stored in the memory for: collecting audio signals in a multi-person conference scene, and identifying a speaker change point in the audio signals; segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting the voiceprint features of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; adding the same user mark to the audio segments corresponding to the same speaker to obtain an audio signal added with the user mark; and generating conference recording information according to the audio signal added with the user mark, wherein the conference recording information comprises a conference identifier.

An embodiment of the present application further provides a conference record presenting device, where the conference record presenting device includes: a memory and a processor; the memory is used for storing a computer program; a processor is coupled to the processor for executing a computer program stored in the memory for: receiving a conference look-up request, wherein the conference look-up request comprises a conference identifier to be presented; acquiring meeting record information to be presented according to the meeting identification; presenting the conference recording information, wherein the conference recording information is generated according to an audio signal added with a user mark in a multi-person conference scene; the voice frequency segments corresponding to the same speaker are added with the same user mark according to the multiple voice frequency segments cut from the speaker change point in the voice frequency signal, and the voice frequency segments corresponding to the same speaker are obtained by carrying out hierarchical clustering on the multiple voice frequency segments according to the duration and the voice print characteristics of the multiple voice frequency segments.

Accordingly, the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps in the method embodiments shown in fig. 3d or fig. 3 e.

Accordingly, embodiments of the present application further provide a computer program product, which includes a computer program/instruction, when executed by a processor, causes the processor to implement the steps in the method embodiments shown in fig. 3d or fig. 3 e.

The communication components of fig. 5 and 6 described above are configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or the like, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display in fig. 5 described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply components of fig. 5 and 6 described above provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio component of fig. 5 described above may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An audio signal processing method, comprising:

identifying a speaker change point in an audio signal collected in a multi-person speaking scene;

segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting the voiceprint features of the plurality of audio segments;

according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker;

and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

2. The method of claim 1, wherein hierarchically clustering the plurality of audio segments according to duration and voiceprint characteristics of the plurality of audio segments to obtain audio segments corresponding to a same speaker comprises:

according to the time lengths of the audio clips, layering the audio clips to obtain multi-layer audio clips corresponding to different time length ranges;

and according to the voiceprint characteristics corresponding to the plurality of audio segments, carrying out hierarchical clustering on the plurality of layers of audio segments according to the sequence of the time length range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio segments corresponding to the same speaker.

3. The method of claim 2, wherein layering the audio segments according to the durations of the audio segments to obtain multi-layered audio segments corresponding to different duration ranges comprises:

layering the plurality of audio clips according to the durations of the plurality of audio clips and preset duration thresholds of each layer to obtain a plurality of layers of audio clips corresponding to different duration ranges;

the smaller the number of layers is, the larger the corresponding time length threshold is, and the time length of the audio clip in each layer is greater than or equal to the time length threshold of the layer.

4. The method according to claim 3, wherein the hierarchically clustering the multi-layer audio segments according to the voiceprint features corresponding to the plurality of audio segments from long to short in the time range to obtain at least one clustering result comprises:

for the audio segments in the first layer, clustering the audio segments in the first layer according to the voiceprint characteristics corresponding to the audio segments in the first layer to obtain at least one clustering result;

for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the sequence of the number of layers from small to large and the voiceprint characteristics corresponding to the audio segments in the non-first layer in sequence; and

if the remaining audio segments which are not clustered into the existing clustering results exist in the first layer, clustering the remaining audio segments according to the voiceprint characteristics corresponding to the remaining audio segments to generate a new clustering result until each audio segment on all the layers is clustered into one clustering result.

5. The method of claim 3, wherein identifying speaker change points in an audio signal captured in a multi-person speaking scene comprises:

carrying out sound source positioning on the audio signal to obtain a change point of a sound source position;

determining a speaker change point in the audio signal according to the change point of the sound source position; wherein each audio segment cut by the speaker change point corresponds to a unique sound source position.

6. The method according to claim 5, wherein hierarchically clustering the multi-layered audio segments according to the voiceprint features corresponding to the plurality of audio segments in the order of the duration range from long to short to obtain at least one clustering result comprises:

and according to the voiceprint characteristics and the sound source positions corresponding to the plurality of audio fragments, carrying out hierarchical clustering on the plurality of layers of audio fragments according to the sequence of the duration range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio fragments corresponding to the same speaker.

7. The method according to claim 6, wherein hierarchically clustering the plurality of layers of audio segments according to the voiceprint characteristics and the sound source position corresponding to the plurality of audio segments in an order from long to short duration to obtain at least one clustering result comprises:

for the audio segments of the first layer, clustering the audio segments in the first layer according to the voiceprint features and the sound source positions corresponding to the audio segments in the first layer to obtain at least one clustering result;

for the audio segments in the non-first layer, clustering the audio segments in the non-first layer to an existing clustering result according to the voiceprint characteristics and the sound source positions corresponding to the audio segments in the non-first layer in sequence from small to large in layer number; and

if the rest audio segments which are not clustered into the existing clustering results exist in the first layer, clustering the rest audio segments according to the voiceprint characteristics and the sound source positions corresponding to the rest audio segments to generate a new clustering result until each audio segment on all the layers is clustered into one clustering result.

8. The method of claim 7, wherein for the audio segments of the first layer, clustering the audio segments of the first layer according to the voiceprint characteristics and the sound source position corresponding to the audio segments of the first layer to obtain at least one clustering result, comprises:

under the condition that the first layer at least comprises two audio segments, calculating the overall similarity between the at least two audio segments in the first layer according to the voiceprint characteristics and the sound source positions corresponding to the at least two audio segments in the first layer;

dividing at least two audio segments in the first layer into at least one clustering result according to the overall similarity between the at least two audio segments in the first layer; and

and respectively calculating the clustering centers of the at least one clustering result according to the voiceprint features and the sound source positions corresponding to the audio segments contained in the at least one clustering result, wherein the clustering centers comprise central voiceprint features and central sound source positions.

9. The method of claim 8, wherein for the audio segments in the non-first layer, in the order from the small layer number to the large layer number, sequentially clustering the audio segments in the non-first layer to the existing clustering result according to the voiceprint features and the sound source locations corresponding to the audio segments in the non-first layer, comprises:

for each audio clip in any non-first layer, calculating the overall similarity of the audio clip and the existing clustering result according to the voiceprint feature and the sound source position corresponding to the audio clip and the clustering center of the existing clustering result;

and if the existing clustering results have target clustering results of which the overall similarity with the audio segments meets the set similarity condition, adding the audio segments into the target clustering results, and updating the clustering centers of the target clustering results according to the voiceprint features and the sound source positions corresponding to the audio segments.

10. The method of claim 9, wherein updating the cluster center of the target clustering result according to the voiceprint feature and the sound source location corresponding to the audio segment comprises:

determining the number of layers to which each audio clip contained in the target clustering result belongs, wherein different number of layers correspond to different weights, and the smaller the number of layers, the larger the corresponding weight;

and according to the weight corresponding to the number of layers to which each audio clip belongs, respectively performing weighted summation on the voiceprint features and the sound source positions corresponding to each audio clip to obtain new central voiceprint features and new central sound source positions which are used as the clustering centers after the target clustering results are updated.

11. The method according to claim 6, wherein hierarchically clustering the plurality of layers of audio segments according to the voiceprint characteristics and the sound source position corresponding to the plurality of audio segments in an order from long to short duration to obtain at least one clustering result comprises:

clustering the audio clips in each layer according to the voiceprint characteristics corresponding to the audio clips in each layer to obtain a clustering result of each layer;

and sequentially clustering the clustering results of two adjacent layers according to the sequence of the number of layers from small to large and the voiceprint characteristics of the clustering results of each layer to obtain at least one clustering result.

12. The method according to any one of claims 1-11, further comprising:

outputting the audio signal added with the user mark to a transcription device, so that the transcription device converts the audio signal added with the user mark into text information with the user mark;

or

Outputting the audio signal added with the user mark to a playback device to allow the playback device to play the audio signal with the user mark;

or

And outputting the audio signal added with the user mark to an upper layer application so that the upper layer application can acquire user information corresponding to the user mark and associate the user information with the audio clip with the user mark.

13. The method of claim 12, further comprising:

receiving a first query request, wherein the first query request comprises a user mark to be queried, acquiring text information corresponding to the user mark to be queried from the text information with the user mark, and returning the text information to a query end initiating the first query request;

or

Receiving a second query request, wherein the second query request comprises an audio segment to be queried, extracting a user mark corresponding to the audio segment to be queried from the audio signal added with the user mark, and returning the user mark and/or user information corresponding to the user mark to a query end initiating the second query request.

14. An audio signal processing method, comprising:

carrying out sound source positioning on audio signals collected in a multi-person speaking scene to obtain a change point of a sound source position;

segmenting the audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint features of the plurality of audio segments;

clustering the plurality of audio segments according to the voiceprint characteristics and the sound source positions of the plurality of audio segments to obtain audio segments corresponding to the same speaker;

15. The method of claim 14, wherein clustering the plurality of audio segments according to their voiceprint characteristics and sound source location to obtain audio segments corresponding to a same speaker comprises:

layering the plurality of audio clips according to the time lengths of the plurality of audio clips to obtain a plurality of layers of audio clips corresponding to different time length ranges;

and according to the voiceprint characteristics and the sound source positions corresponding to the plurality of audio fragments, carrying out hierarchical clustering on the plurality of layers of audio fragments according to the sequence of the time length range from long to short so as to obtain at least one clustering result, wherein each clustering result comprises the audio fragments corresponding to the same speaker.

16. A conference recording method, comprising:

collecting audio signals in a multi-person conference scene, and identifying speaker change points in the audio signals;

adding the same user mark to the audio segments corresponding to the same speaker to obtain an audio signal added with the user mark;

and generating conference recording information according to the audio signal added with the user mark, wherein the conference recording information comprises a conference identifier.

17. A method for presenting a conference record, comprising:

receiving a meeting consulting request, wherein the meeting consulting request comprises a meeting identifier to be presented;

acquiring meeting record information to be presented according to the meeting identification;

presenting the conference recording information, wherein the conference recording information is generated according to an audio signal added with a user mark in a multi-person conference scene;

the voice frequency segments corresponding to the same speaker are added with the same user mark according to the multiple voice frequency segments cut from the speaker change point in the voice frequency signal, and the voice frequency segments corresponding to the same speaker are obtained by carrying out hierarchical clustering on the multiple voice frequency segments according to the duration and the voice print characteristics of the multiple voice frequency segments.

18. An audio processing system, comprising: the system comprises sound pickup equipment and server-side equipment;

the pickup equipment is deployed in a multi-person speaking scene and is used for collecting audio signals in the multi-person speaking scene, identifying a speaker change point in the audio signals, segmenting the audio signals into a plurality of audio segments according to the speaker change point and extracting voiceprint characteristics corresponding to the plurality of audio segments;

the server-side equipment is used for performing hierarchical clustering on the plurality of audio segments according to the time lengths and the vocal print characteristics of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

19. The system of claim 18,

the sound pickup apparatus is specifically configured to: carrying out sound source positioning on the audio signal to obtain a change point of a sound source position; determining a speaker change point in the audio signal according to the change point of the sound source position; wherein each audio clip segmented by the speaker change point corresponds to a unique sound source position;

the server device is specifically configured to: and according to the time lengths, the voiceprint characteristics and the sound source positions of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain the audio segments corresponding to the same speaker.

20. An audio processing system, comprising: the system comprises sound pickup equipment and server-side equipment;

the pickup equipment is deployed in a multi-person speaking scene and is used for collecting audio signals in the multi-person speaking scene, identifying speaker change points in the audio signals and segmenting the audio signals into a plurality of audio clips according to the speaker change points;

the server-side equipment is used for extracting voiceprint characteristics corresponding to the plurality of audio clips, and performing hierarchical clustering on the plurality of audio clips according to the duration and the voiceprint characteristics of the plurality of audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

21. A sound pickup apparatus, comprising: a processor and a memory;

the memory for storing a computer program;

the processor is coupled with the memory for executing the computer program for: identifying a speaker change point in an audio signal collected in a multi-person speaking scene; segmenting the audio signal into a plurality of audio segments according to the speaker change point, and extracting the voiceprint features of the plurality of audio segments; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

22. A sound pickup apparatus, comprising: a processor and a memory;

the memory for storing a computer program;

the processor is coupled with the memory for executing the computer program for: carrying out sound source positioning on audio signals collected in a multi-person speaking scene to obtain a change point of a sound source position; segmenting the audio signal into a plurality of audio segments according to the change point of the sound source position, and extracting the voiceprint characteristics of the plurality of audio segments; clustering the plurality of audio segments according to the voiceprint characteristics and the sound source positions of the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

23. A server-side device, comprising: a processor and a memory;

the memory for storing a computer program;

the processor is coupled with the memory for executing the computer program for: receiving a plurality of audio clips sent by pickup equipment and corresponding voiceprint characteristics of the audio clips; according to the duration and the voiceprint characteristics of the plurality of audio segments, performing hierarchical clustering on the plurality of audio segments to obtain audio segments corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

24. A server-side device, comprising: a processor and a memory;

the memory for storing a computer program;

the processor is coupled with the memory for executing the computer program for: receiving the plurality of audio segments sent by the sound pickup equipment; extracting voiceprint features corresponding to the plurality of audio clips, and performing hierarchical clustering on the plurality of audio clips according to the duration and the voiceprint features of the plurality of audio clips to obtain audio clips corresponding to the same speaker; and adding the same user mark to the audio segments corresponding to the same speaker to obtain the audio signal added with the user mark.

25. A computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to carry out the steps of the method according to any one of claims 1 to 17.

26. A computer program product comprising computer programs/instructions, characterized in that, when executed by a processor, causes the processor to implement the steps in the method of any of claims 1-17.